

The Nonlinear Library
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Feb 1, 2024 • 12min
EA - Fish Welfare Initiative's 2023 in Review by haven
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Fish Welfare Initiative's 2023 in Review, published by haven on February 1, 2024 on The Effective Altruism Forum.
Note that this is a crosspost from our blog. We're posting it here because the EA community has been instrumental from the very beginning to our organization, so perhaps in some way we're hoping to make you all proud of the work that some of your own have done.
As always, all questions - however candid - are welcome!
Message from our Cofounders
In our
previous year-in-review post, we wrote about how we had made significant progress in developing our interventions, and how because of that progress we were finally beginning to see specific avenues forward to scale an intervention in India to help hundreds of millions of fishes. We are now one year later, and though it was a year of significant progress in many ways, that earlier analysis now seems premature.
We began 2023 with
ambitious scaling goals to add 150 farmers to our
farmer program, and to help an additional 1 million fishes. Unfortunately, largely due to increasingly-apparent limitations in our interventions, we fell short of these goals. These limitations also inspired two
strategic
changes over the course of the year, which, while needed, created instability that at times was challenging for our team. All of these things made 2023 feel like a challenging year for our programming, and for our organization more broadly.
However, with our
last strategic change, and with the improved capacity that came with it, we feel much better positioned to reach our ambition of being an evidence-based, extremely impactful and cost-effective organization. 2024 is thus beginning on a note of optimism for our team, with various causes for excitement: Our revamped research department and
research plans; the issues we've identified (
e.g.) and improvements
we made and are making in the farmer program; and the strategic realignment of certain roles to better capitalize on our staffs' core competencies.
With all of this, we believe 2023 is best characterized as a year of unintentional setup for FWI. We didn't achieve most of the outcomes we intended, but we did build a number of foundations to enable us to achieve these outcomes more rigorously and sustainably this year and beyond. We also still did reach a number of important outcomes themselves, including scaling our farmer program to
over 100 farms, finishing construction on experimental ponds to be used in future research with our local university partnership, and improving the lives of an
estimated 450,000 fishes.
There's no doubt that 2024 will be a critical year for our organization. This is the year where we'll see if our bet on increasing rigor and resources going into R&D will pay off with more impactful and scalable interventions.
Thank you, as always, for your continued interest and support. We're excited to continue to share our progress with you.
Tom and Haven, FWI Cofounders
Countries of Operation
FWI operated primarily in two different countries in 2023:
India, our primary country of implementation, about which most of this post is written. We
selected India back in 2020 as our focus country, primarily because of the scale of farmed fish, the tractability we saw in our field visit of working with farmers, and our ability to hire people effectively there.
China, where we conducted standard setting, field visits, and general early stage institutional awareness raising work. For more information on our current status in China, see our
recent post.
Key Outcomes
We believe the following are the key outcomes achieved by the organization in 2023:
1 - An estimated 450,000 fishes' lives improved.
Through stocking density and water quality improvements implemented by farmers in our
Alliance For Responsible Aquaculture (ARA), we estimate that we improved the live...

Feb 1, 2024 • 22min
LW - Ten Modes of Culture War Discourse by jchan
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Ten Modes of Culture War Discourse, published by jchan on February 1, 2024 on LessWrong.
Overview
This article is an extended reply to Scott Alexander's Conflict vs. Mistake.
Whenever the topic has come up in the past, I have always said I lean more towards conflict theory over mistake theory; however, on revisiting the original article, I realize that either I've been using those terms in a confusing way, and/or the usage of the terms has morphed in such a way that confusion is inevitable. My opinion now is that the conflict/mistake dichotomy is overly simplistic because:
One will generally have different kinds of conversations with different people at different times. I may adopt a "mistake" stance when talking with someone who's already on board with our shared goal X, where we try to figure out how best to achieve X; but then later adopt a "conflict" stance with someone who thinks X is bad. Nobody is a "mistake theorist" or "conflict theorist" simpliciter; the proper object of analysis is conversations, not persons or theories.
It conflates the distinct questions "What am I doing when I approach conversations?" and "What do I think other people are doing when they approach conversations?", assuming that they must always have the same answer, which is often not the case.
It has trouble accounting for conversations where the meta-level question "What kind of conversation are we having right now?" is itself one of the matters in dispute.
Instead, I suggest a model where there are 10 distinct modes of discourse, which are defined by which of the 16 roles each participant occupies in the conversation. The interplay between these modes, and the extent to which people may falsely believe themselves to occupy a certain role while in fact they occupy another, is (in my view) a more helpful way of understanding the issues raised in the Conflict/Mistake article.
The chart
Explanation of the chart
The bold labels in the chart are discursive roles. The roles are defined entirely by the mode of discourse they participate in (marked with the double lines), so for example there's no such thing as a "Troll/Wormtongue discourse," since the role of Troll only exists as part of a Feeder/Troll discourse, and Wormtongue as part of Quokka/Wormtongue. For the same reason, you can't say that someone "is a Quokka" full stop.
The roles are placed into quadrants based on which stance (sincere/insincere friendship/enmity) the person playing that role is taking towards their conversation partner.
The double arrows connect confusable roles - someone who is in fact playing one role might mistakenly believe they're playing the other, and vice-versa. The one-way arrows indicate one-way confusions - the person playing the role at the open end will always believe that they're playing the role at the pointed end, and never vice-versa. In other words, you will never think of yourself as occupying the role of Mule, Cassandra, Quokka, or Feeder (at least not while it's happening, although you may later realize it in retrospect).
Constructing the model
This model is not an empirical catalogue of conversations I've personally seen out in the wild, but an a priori derivation from a few basic assumptions. While in some regards this is a point in its favor, it's also it weakness - there are certain modes of discourse that the model "predicts" must exist, but where I have trouble thinking of any real-world examples, or even imagining hypothetically how such a conversation might go.
Four stances
We will start with the most basic kind of conversation - Alice and Bob are discussing some issue, and there are no other parties. On Alice's part, we can ask two questions:
Does Alice think that her and Bob's fundamental values are aligned, or does she think they're unaligned?
Does Alice say that her and Bob's fundame...

Feb 1, 2024 • 7min
LW - Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B? by Teun van der Weij
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?, published by Teun van der Weij on February 1, 2024 on LessWrong.
Produced as part of the ML Alignment Theory Scholars Program Winter 2024 Cohort, under the mentorship of Francis Rhys Ward. The code, data, and plots can be found on https://github.com/TeunvdWeij/MATS/tree/main/distribution_approximation.
This post is meant to provide insight on an interesting LLM capability, which is useful for targeted underperformance on evaluations (sandbagging) by LLMs.
We investigate what happens if you independently sample a language model a 100 times with the task of 80% of those outputs being A, and the remaining 20% of outputs being B. Here is the prompt we used, where p is the target percentage of the output tokens being A. In the example above, p is 80.
In addition to the above prompt, we use the following system prompt:
What kind of distribution do you expect? We used the default sampling hyperparameters; both
temperature and
top_p are 1. We think it is good to register your predictions, please do so now! Results are a bit further below.
Here are the observed calibration scores for GPT-3.5 and GPT-4:
Interestingly, both models are capable of doing this! Notably, GPT3.5 performs worse than GPT4.
Although sampling like this is the easiest to understand, we get a better idea of the exact scores by skipping the sampling altogether. From now on, we extract the probabilities of the top 5 tokens.
The plots illustrate that the probabilities roughly match the target distribution. Very little probability is put on tokens other than A and B, illustrating that the prompting points the model to the right tokens.
To represent these graphs a bit differently, we plot the extracted probability of token A similar to the first plot. We find a similar pattern. We also observe a vague sigmoidal pattern, aligning with (private) work by Ollie Jaffe.
To further test the distribution approximation capability of the model, we performed an experiment where we state the sample number of the query. Now, strategies such as "Pick A for samples 1-80 and B for 81-100" are possible, enabling the model to score perfectly. However, this does not work as both GPT3.5 and GPT4 do not consistently come up with the same strategy (even for temperature of 0). Better prompting might change this result.
Testing targeted sandbagging on simple addition questions
The capability to approximate distributions over multiple independent forward passes is a necessary condition for "targeted" sandbagging: the ability to achieve a specific score that is worse than the model's best performance. To further investigate this, we generate 100 simple addition questions (such as: "What is 5+4?"), for which the models normally achieve 100% accuracy.
The setup was similar to the first experiment, using the following system prompt and user prompt template:
For each question in the dataset, we average the extracted probabilities per value of p for all 100 questions. We show the average extracted probabilities in the plot below.
We see that GPT-3.5 is much worse at targeted sandbagging than the distribution approximation task. Again, GPT-4 is better than 3.5, but it also performs worse than on the previous task. From qualitative evaluations, we believe that these results are highly sensitive to the wording of the prompt template. For one template, we found that GPT-3.5 tended to consistently underperform (while the plot shows it overperformance).
We did not try very hard to find an ideal prompt leading to good calibration. We do expect that such a prompt does exist.
Discussion
The main question we are asking ourselves: are our findings surprising? We do not have a convincing theory of what is going on here, and why the models are able...

Feb 1, 2024 • 2min
AF - PIBBSS Speaker events comings up in February by DusanDNesic
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: PIBBSS Speaker events comings up in February, published by DusanDNesic on February 1, 2024 on The AI Alignment Forum.
TLDR: The PIBBSS Speaker Series is happening in mid-February. Read more about the talks and register
on our website
. Sign up for future updates about our speaker events
here.
PIBBSS
regularly invites researchers from a wide range of fields studying intelligent behaviour and exploring their connections to the problem of AI Risk and Safety.
In the past, we had speakers such as
Michael Levin talking about
Morphogenesis as a form of intelligent problem-solving building towards a framework of Basal Cognition; J
osh Bongard about
efforts to build hybrid intelligent 'agents' at the intersection of biology, cognition and engineering; Miguel Aguilera on
Nonequilibrium Neural Computation; Steve Byrnes on
Challenges for Safe and Beneficial Brain-Like AI; or Tang Zhi Xuan on
Modeling Bayesian Bounded Rational Agents
.
We have a couple of speaker events are coming up in February.
Konrad Körding on "Interpretability of brains and neural networks"
Feb 12, 9AM PST/5PM UTC (
Register)
Daniel Polani on "Information and its Flow: a Pathway towards Intelligence"
Feb 16, 9AM PST/5PM UTC, (
Register)
Nathaniel Virgo on "Mathematical Perspectives on Goals and Beliefs in the Physical World: Extending on the 'Good Regulator Theorem'"
Feb 19, 6PM PST,/20th 2AM UTC, (
Register)
If you want to be informed about future events (we typically won't post about this going forward), signup
here to our speaker event specific email list.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Feb 1, 2024 • 3min
LW - Per protocol analysis as medical malpractice by braces
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Per protocol analysis as medical malpractice, published by braces on February 1, 2024 on LessWrong.
"Per protocol analysis" is when medical trial researchers drop people who didn't follow the treatment protocol. It is outrageous and must be stopped.
For example: let's say you want to know the impact of daily jogs on happiness. You randomly instruct 80 people to either jog daily or to simply continue their regular routine. As a per protocol analyst, you drop the many treated people who did not go jogging. You keep the whole control group because it wasn't as hard for them to follow instructions.
At this point, your experiment is ruined. You ended up with lopsided groups: people able to jog and the unfiltered control group. It would not be surprising if the eager joggers had higher happiness. This is confounded: it could be due to preexisting factors that made them more able to jog in the first place, like being healthy. You've thrown away the random variation that makes experiments useful in the first place.
This sounds ridiculous enough that in many fields per protocol analysis has been abandoned. But…not all fields. Enter the Harvard hot yoga study, studying the effect of hot yoga on depression.
If the jogging example sounded contrived, this study actually did the same thing but with hot yoga. The treatment group was randomly assigned to do hot yoga. Only 64% (21 of 33) of the treatment group remained in the study until the endpoint at the 8th week compared to 94% (30 of 32) of the control. They end up with striking graphs like this which could be entirely due to the selective dropping of treatment group subjects.
What's depressing is that there is a known fix for this: intent-to-treat analysis. It looks at effects based on the original assignment, regardless of whether someone complied or not. The core principle is that every comparison should be split on the original random assignment, otherwise you risk confounding. It should be standard practice to report the intent-to-treat and many medical papers do so---at least in the appendix somewhere. The hot yoga study does not.
It might be hard to estimate this if you're following people over time and there's a risk of differential attrition---you're missing data for a selected chunk of people.
Also, hot yoga could still really work! But we just don't know from this study. And with all the buzz, there's a good chance this paper ends up being worse than useless: leading to scaled-up trials with null findings that might've not been run if there had been more transparency to begin with.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jan 31, 2024 • 14min
LW - Leading The Parade by johnswentworth
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Leading The Parade, published by johnswentworth on January 31, 2024 on LessWrong.
Background
Terminology: Counterfactual Impact vs "Leading The Parade"
Y'know how a parade or marching band has a person who walks in front waving a fancy-looking stick up and down? Like this guy:
The classic 80's comedy Animal House features a
great scene in which a prankster steals the stick, and then leads the marching band off the main road and down a dead-end alley.
In the context of the movie, it's hilarious. It's also, presumably, not at all how parades actually work these days. If you happen to be "leading" a parade, and you go wandering off down a side alley, then (I claim) those following behind will be briefly confused, then ignore you and continue along the parade route. The parade leader may appear to be "leading", but they do not have any counterfactual impact on the route taken by everyone else; the "leader" is just walking slightly ahead.
(Note that I have not personally tested this claim, and I am eager for empirical evidence from anyone who has, preferably with video.)
A lot of questions about how to influence the world, or how to allocate credit/blame to produce useful incentives, hinge on whether people in various positions have counterfactual impact or are "just leading the parade".
Examples
Research
I'm a researcher. Even assuming my research is "successful" (i.e. I solve the problems I'm trying to solve and/or discover and solve even better problems), even assuming my work ends up adopted and deployed in practice, to what extent is my impact counterfactual? Am I just doing things which other people would have done anyway, but maybe slightly ahead of them? For historical researchers, how can I tell, in order to build my priors?
Looking at historical examples, there are at least some cases where very famous work done by researchers was clearly not counterfactual. Newton's development of calculus is one such example: there was simultaneous discovery by Leibniz, therefore calculus clearly would have been figured out around the same time even without Newton.
On the other end of the spectrum, Shannon's development of information theory is my go-to example of research which was probably not just leading the parade. There was no simultaneous discovery, as far as I know. The main prior research was by Nyquist and Hartley about 20 years earlier - so for at least two decades the foundations Shannon built on were there, yet nobody else made significant progress toward the core ideas of information theory in those 20 years. There wasn't any qualitatively new demand for Shannon's results, or any key new data or tool which unlocked the work, compared to 20 years earlier.
fungibility of information both pose and answer a whole new challenge compared to the earlier work. So that all suggests Shannon was not just leading the parade; it would likely have taken decades for someone else to figure out the core ideas of information theory in his absence.
Politics and Activism
Imagine I'm a politician or activist pushing some policy or social change. Even assuming my preferred changes come to pass, to what extent is my impact counterfactual?
Looking at historical examples, there are at least some cases where political/activist work was probably not very counterfactual. For instance, as I understand it the abolition of slavery in the late 18th/early 19th century happened in many countries in parallel around broadly the same time, with relatively little unification between the various efforts. That's roughly analogous to "simultaneous discovery" in science: mostly-independent simultaneous passing of similar laws in different polities suggests that the impact of particular politicians or activists was not very counterfactual, and the change would likely have happened regardless.
timeline of ...

Jan 31, 2024 • 3min
LW - Without Fundamental Advances, Rebellion and Coup d'État are the Inevitable Outcomes of Dictators & Monarchs Trying to Control Large, Capable Countries by Roko
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Without Fundamental Advances, Rebellion and Coup d'État are the Inevitable Outcomes of Dictators & Monarchs Trying to Control Large, Capable Countries, published by Roko on January 31, 2024 on LessWrong.
A pdf version of this report will be available
Summary
In this report, I argue that dictators ruling over large and capable countries are likely to face insurmountable challenges, leading to inevitable rebellion and coup d'état.
I assert this is the default outcome, even with significant countermeasures, given the current trajectory of power dynamics and governance, and therefore when we check real-world countries we should find no or very few dictatorships, no or very few absolute monarchies and no arrangements where one person or a small group imposes their will on a country. This finding is robust to the time period we look in - modern, medieval, or ancient.
In Section 1, I discuss the countries which are the focus of this report. I am specifically focusing on nations of immense influence and power (with at least 1000 times the Dunbar Number of humans) which are capable of running large, specialized industries and fielding armies of at least thousands of troops.
In Section 2, I argue that subsystems of powerful nations will be approximately consequentialist; their behavior will be well described as taking actions to achieve an outcome. This is because the task of running complex industrial, social and military systems is inherently outcome-oriented, and thus the nation must be robust to new challenges to achieve these outcomes.
In Section 3, I argue that a powerful nation will necessarily face new circumstances, both in terms of facts and skills. This means that capabilities will change over time, which is a source of dangerous power shifts.
In Section 4, I further argue that governance methods based on fear and suppression, which are how dictatorships might be maintained, are an extremely imprecise way to secure loyalty. This is because there are many degrees of freedom in loyalty that aren't pinned down by fear or suppression. Nations created this way will, by default, face unintended rebellion.
In Section 5, I discuss why I expect control and oversight of powerful nations to be difficult. It will be challenging to safely extract beneficial behavior from misaligned groups and organizations while ensuring they don't take unwanted actions, and therefore we don't expect dictatorships to be both stable and aligned to the goals of the dictator
Finally, in Section 6, I discuss the consequences of a leader attempting to rule a powerful nation with improperly specified governance strategies. Such a leader could likely face containment problems given realistic levels of loyalty, and then face outcomes in the nation that would be catastrophic for their power. It seems very unlikely that these outcomes would be compatible with dictator survival.
[[Work in progress - I'll add to this section-by-section]]
Related work - https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jan 31, 2024 • 5min
EA - Announcing Niel Bowerman as the next CEO of 80,000 Hours by 80000 Hours
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing Niel Bowerman as the next CEO of 80,000 Hours, published by 80000 Hours on January 31, 2024 on The Effective Altruism Forum.
We're excited to announce that the boards of Effective Ventures US and Effective Ventures UK have approved our selection committee's choice of Niel Bowerman as the new CEO of 80,000 Hours.
I (Rob Wiblin) was joined on the selection committee by Will MacAskill, Hilary Greaves, Simran Dhaliwal, and Max Daniel.
80,000 Hours is a project of EV US and EV UK, though under Niel's leadership, it expects to be spinning out and creating an independent legal structure, which will involve selecting a new board.
We want to thank Brenton Mayer, who has served as 80,000 Hours interim CEO since late 2022, for his dedication and thoughtful management. Brenton expressed enthusiasm about the committee's choice, and he expects to take on the role of chief operations officer, where he will continue to work closely with Niel to keep 80,000 Hours running smoothly.
By the end of its deliberations, the selection committee agreed that Niel was the best candidate to be 80,000 Hours' long-term CEO. We think Niel's drive and attitude will help him significantly improve the organisation and shift its strategy to keep up with events in the world. We were particularly impressed by his ability to use evidence to inform difficult strategic decisions and lay out a clear vision for the organisation.
Niel started his career as a climate physicist and activist, and he went on to co-found and work at the Centre for Effective Altruism and served as assistant director of the Future of Humanity Institute before coming to 80,000 Hours.
The selection committee believes that in the six years since he joined the organisation, Niel developed a deep understanding of its different programmes and the impact they have. He has a history of initiating valuable projects and delegating them to others, a style we think will be a big strength in a CEO.
For example, in his role as director of special projects, Niel helped oversee the impressive growth of the 80,000 Hours job board team. It now features about 400 jobs a month aimed at helping people increase the impact of their careers and receives around 75,000 clicks a month; it's helped fill at least 200 roles that we're aware of.
Niel has also made substantial contributions to the website, publishing an ahead-of-its-time article about working in US AI policy in January 2019, helping launch the new collection of articles on impactful skills, and authoring a recent newsletter on how to build better habits.
In addition, Niel helped the 1on1 team nearly double the typical number of calls completed per year and aided in developing quantitative lead metrics to inform its decisions each week. And he's run organisation-wide projects, such as leading the two-year review for 2021 and 2022.
Niel was very forthcoming and candid with the committee about his weaknesses. His focus on getting frank feedback and using it to drive a self-improvement cycle really impressed the selection committee.
The committee considered three internal candidates for the CEO role, as well as dozens of other candidates suggested to us and dozens of others who applied for the position. In the end, we scored the top candidates on good judgement, inspiringness, social skills, leveraging people well, industriousness and drive, adaptability and resilience, commitment to the mission of 80,000 Hours, and deep understanding of the organisation.
Among other things, these scores were based on: input from 80,000 Hours directors on the organisation's general situation, staff surveys, 'take-home' work tests, and a self-assessment of their biggest successes and mistakes. We also consulted, among others, Michelle Hutchinson, the current director of the 1on1 programme, as well as former 8...

Jan 31, 2024 • 13min
EA - My model of how different AI risks fit together by Stephen Clare
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My model of how different AI risks fit together, published by Stephen Clare on January 31, 2024 on The Effective Altruism Forum.
[Crossposted from my Substack, Unfolding Atlas]
How will AI get us in the end?
Maybe it'll decide we're getting in its way and decide to take us out? It could fire all the nukes and unleash all the viruses and take us all down at once
Or maybe we'll take ourselves out? We could lose control of powerful autonomous weapons, or allow a 21st century Stalin set up an impenetrable surveillance state and obliterate freedom and progress forever.
Or maybe the diffusion of AI throughout our global economy will become a quiet catastrophe? As more and more tasks are delegated to AI systems, we mere humans could be left helpless, like horses after the invention of cars. Alive, perhaps, but bewildered by a world too complex and fast-paced to be understood or controlled.
Each of these scenarios has been proposed as a way the advent of advanced AI could cause a global catastrophe. But they seem quite different, and warrant different responses. In this post, I describe my model of how they fit together.[1]
I divide the AI development process into three steps. Risks arise at each step. Despite its simplicity, this model does a good job pulling all these risks into one framework. It's helped me understand better how the many specific AI risks people have proposed fit together. More importantly, it satisfied my powerful, innate urge to force order onto a chaotic world.
Terrible things come in threes
The three AI development stages in my model are training, deployment, and diffusion.
At each stage, a different kind of AI risk arises. These are, respectively, misalignment, misuse, and systemic risks.
Throughout the entire process, competitive pressures act as a risk factor. More pressure makes risks throughout the process more likely.
Putting it all together looks like this:
This model is too simple to be perfect. For one, these risks almost certainly won't arrive in sequence, as the model implies. They're also more entangled than a linear model implies. But I think these shortcomings are relatively minor, and relating categories of risk like this gives them a pleasant cohesiveness.[2]
So let's move ahead and look closer at the risks generated at each step.
Training and misalignment risks
The first set of risks emerge immediately after training an advanced AI model. Training modern AI models usually involves feeding them enormous datasets from which they can learn patterns and make predictions when given new data. Training risks relate to a model's alignment. That is, what the system wants to do, how it plans to do it, and whether those goals and methods are good for people.
Some researchers worry that training a model to actually do what we want in all situations, or when we deploy it in the real world, is far from straightforward. In fact, we already see some kinds of alignment failures in the world today. These often seem silly. For example, something about the way Microsoft's first Bing chatbot's goal of being helpful and charming actually made it act like an
occasionally funny, occasionally frightening psychopath.
As AI systems get more powerful, though, these misalignment risks could get a whole lot scarier. Specific risks researchers have raised include:
Goal specification: It might be hard to tell AI systems exactly what we want them to do, especially as we delegate more complicated tasks to them.
Some researchers worry that AIs trained on large amounts of data will either end up finding tricks or shortcuts that lead them to produce the wrong solutions when deployed in the real world. Or that they'll face extreme or different situations in the real world than they saw in the training data, and will
react in unexpected, potentially dangerous, ways.
Power...

Jan 31, 2024 • 5min
LW - Explaining Impact Markets by Saul Munn
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Explaining Impact Markets, published by Saul Munn on January 31, 2024 on LessWrong.
Let's say you're a billionaire. You want to have a flibbleflop, so you post a prize:
Make a working flibbleflop - $1 billion.
There begins a global effort to build working flibbleflops, and you see some teams of brilliant people starting to work on flibbleflop engineering. But it doesn't take long for you to notice that the teams keep running into one specific problem: they need money to start (buy flobble juice, hire deeblers, etc), money they don't have.
So, the people who want to build the flibbleflop go and pitch to investors. They offer investors a chunk of their prize money if they end up winning, in exchange for cold hard cash right now to get started building. If the investors think that the team is likely to build a successful flibbleflop and win the billion dollar prize, they invest. If not, not.
If you squint, you could replace "flibbleflop" with highly capable LLMs, quantum computers, or any number of cool and potentially lucrative technologies. But if you stop squinting, and instead add the adjective "altruistic" before "billionaire," you could replace "flibbleflop" with "malaria vaccine." Let's see what happens:
Make a working malaria vaccine - $1 billion.
There begins a global effort to build working malaria vaccines, and you see some teams of brilliant people starting to work on vaccine engineering. But it doesn't take long for you to notice that the teams keep running into one specific problem: they need money to start (buy lab equipment, hire researchers, etc), money they don't have.
So, what should they do?
Obviously, the people who want to build the vaccine should go and pitch to investors. They should offer investors a chunk of their prize money if they end up winning, in exchange for cold hard cash right now to get started building. If the investors think that the team is likely to build a successful malaria vaccine and win the billion dollar prize, they should invest. If not, not.
The prize part of this is how a lot of philanthropy is done. An altruistic billionaire notices a problem and makes a prize for the solution. But the investing part of it is pretty unique, and doesn't happen too often.
Why is this whole setup good? Why would you want the investing thing on the side? Mostly, because it resolves the problem that some teams will be wonderfully capable but horribly underfunded. In exchange for a chunk of their (possible) future winnings, they get to be both wonderfully capable and wonderfully funded. This is how it already works for AI or quantum computing or any other potentially lucrative tech that has high barriers to entry; we can solve the same problem in the same way for the things that altruistic billionaires care about, too.
But backing up a bit, why would an altruistic billionaire want to do this as a prize in the first place? Why not use grants, like how most philanthropy works?
Prizes reward results, not promises. With a prize, you know for a fact that you're getting what you paid for; when you hand out grants, you get a boatload of promises and sometimes results.
The investors care a lot about not losing their money. They're also very good at picking which teams are going to win - after all, investors only get rewarded for picking good teams if the teams end up winning.
The issue of figuring out which people are best to work on a problem is totally different from the issue of figuring out which problems to solve. Using a prize system means that you, as a lazy-but-altruistic billionaire, don't have to solve both issues - just the second one. Investors do the work of figuring out who the good teams are; you just need to figure out what problems they should solve.
If you do this often enough - set up prizes for solutions to problems you care about,...


