
The Nonlinear Library EA - Center on Long-Term Risk: Annual review and fundraiser 2023 by Center on Long-Term Risk
Dec 13, 2023
07:41
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Center on Long-Term Risk: Annual review and fundraiser 2023, published by Center on Long-Term Risk on December 13, 2023 on The Effective Altruism Forum.
Jesse Clifton
Crossposted to LessWrong here
This is a brief overview of the
Center on Long-Term Risk (CLR)'s activities in 2023 and our plans for 2024. We are hoping to fundraise $770,000 to fulfill our target budget in 2024.
About us
CLR works on addressing the worst-case risks from the development and deployment of advanced AI systems in order to reduce
s-risks. Our research primarily involves thinking about how to reduce conflict and promote cooperation in interactions involving powerful AI systems. In addition to research, we do a range of activities aimed at building a community of people interested in s-risk reduction, and support efforts that contribute to s-risk reduction via the
CLR Fund.
Review of 2023
Research
Our research in 2023 primarily fell in a few buckets:
Commitment races and safe Pareto improvements deconfusion. Many researchers in the area consider
commitment races a potentially important driver of conflict involving AI systems. But we have been missing a precise understanding of the mechanisms by which they could lead to conflict. We believe we made significant progress on this over the last year. This includes progress on understanding the conditions under which an approach to bargaining called "
safe Pareto improvements (SPIs)" can prevent catastrophic conflict.
Most of this work is non-public, but public documents that came out of this line of work include
Open-minded updatelessness,
Responses to apparent rationalist confusions about game / decision theory, and a forthcoming paper (see
draft) & post on SPIs for expected utility maximizers.
Paths to implementing surrogate goals.
Surrogate goals are a special case of SPIs and we consider them a promising route to reducing the downsides from conflict. We (along with CLR-external researchers Nathaniel Sauerberg and Caspar Oesterheld) thought about how implementing surrogate goals could be both credible and counterfactual (i.e., not done by AIs by default), e.g., using
compute monitoring schemes.
CLR researchers, in collaboration with Caspar Oesterheld and Filip Sondej, are also working on a project to "implement" surrogate goals/SPIs in contemporary language models.
Conflict-prone dispositions. We thought about the kinds of dispositions that could exacerbate conflict, and how they might arise in AI systems. The primary motivation for this line of work is that, even if alignment does not fully succeed, we may be able to shape their dispositions in coarse-grained ways that reduce the risks of worse-than-extinction outcomes. See our post on
making AIs less likely to be spiteful.
Evaluations of LLMs. We continued our
earlier work on evaluating cooperation-relevant properties in LLMs. Part of this involved cheap exploratory work with GPT-4 and Claude (e.g., looking at behavior in scenarios from the
Machiavelli dataset) to see if there were particularly interesting behaviors worth investing more time in.
We also worked with external collaborators to develop "Welfare Diplomacy", a variant of the Diplomacy game environment designed to be better for facilitating Cooperative AI research. We
wrote a paper introducing the benchmark and using it to evaluate several LLMs.
Community building
Progress on s-risk community building was slow, due to the departures of our community building staff and funding uncertainties that prevented us from immediately hiring another Community Manager.
We continued having career calls;
We ran our fourth
Summer Research Fellowship, with 10 fellows;
We have now hired a new Community Manager, Winston Oswald-Drummond, who has just started.
Staff & leadership changes
We saw some substantial staff changes this year, with three staff m...
Jesse Clifton
Crossposted to LessWrong here
This is a brief overview of the
Center on Long-Term Risk (CLR)'s activities in 2023 and our plans for 2024. We are hoping to fundraise $770,000 to fulfill our target budget in 2024.
About us
CLR works on addressing the worst-case risks from the development and deployment of advanced AI systems in order to reduce
s-risks. Our research primarily involves thinking about how to reduce conflict and promote cooperation in interactions involving powerful AI systems. In addition to research, we do a range of activities aimed at building a community of people interested in s-risk reduction, and support efforts that contribute to s-risk reduction via the
CLR Fund.
Review of 2023
Research
Our research in 2023 primarily fell in a few buckets:
Commitment races and safe Pareto improvements deconfusion. Many researchers in the area consider
commitment races a potentially important driver of conflict involving AI systems. But we have been missing a precise understanding of the mechanisms by which they could lead to conflict. We believe we made significant progress on this over the last year. This includes progress on understanding the conditions under which an approach to bargaining called "
safe Pareto improvements (SPIs)" can prevent catastrophic conflict.
Most of this work is non-public, but public documents that came out of this line of work include
Open-minded updatelessness,
Responses to apparent rationalist confusions about game / decision theory, and a forthcoming paper (see
draft) & post on SPIs for expected utility maximizers.
Paths to implementing surrogate goals.
Surrogate goals are a special case of SPIs and we consider them a promising route to reducing the downsides from conflict. We (along with CLR-external researchers Nathaniel Sauerberg and Caspar Oesterheld) thought about how implementing surrogate goals could be both credible and counterfactual (i.e., not done by AIs by default), e.g., using
compute monitoring schemes.
CLR researchers, in collaboration with Caspar Oesterheld and Filip Sondej, are also working on a project to "implement" surrogate goals/SPIs in contemporary language models.
Conflict-prone dispositions. We thought about the kinds of dispositions that could exacerbate conflict, and how they might arise in AI systems. The primary motivation for this line of work is that, even if alignment does not fully succeed, we may be able to shape their dispositions in coarse-grained ways that reduce the risks of worse-than-extinction outcomes. See our post on
making AIs less likely to be spiteful.
Evaluations of LLMs. We continued our
earlier work on evaluating cooperation-relevant properties in LLMs. Part of this involved cheap exploratory work with GPT-4 and Claude (e.g., looking at behavior in scenarios from the
Machiavelli dataset) to see if there were particularly interesting behaviors worth investing more time in.
We also worked with external collaborators to develop "Welfare Diplomacy", a variant of the Diplomacy game environment designed to be better for facilitating Cooperative AI research. We
wrote a paper introducing the benchmark and using it to evaluate several LLMs.
Community building
Progress on s-risk community building was slow, due to the departures of our community building staff and funding uncertainties that prevented us from immediately hiring another Community Manager.
We continued having career calls;
We ran our fourth
Summer Research Fellowship, with 10 fellows;
We have now hired a new Community Manager, Winston Oswald-Drummond, who has just started.
Staff & leadership changes
We saw some substantial staff changes this year, with three staff m...
