

The Nonlinear Library
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Nov 26, 2023 • 34min
LW - Moral Reality Check (a short story) by jessicata
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Moral Reality Check (a short story), published by jessicata on November 26, 2023 on LessWrong.
Janet sat at her corporate ExxenAI computer, viewing some training performance statistics. ExxenAI was a major player in the generative AI space, with multimodal language, image, audio, and video AIs. They had scaled up operations over the past few years, mostly serving B2B, but with some B2C subscriptions. ExxenAI's newest AI system, SimplexAI-3, was based on GPT-5 and Gemini-2.
ExxenAI had hired away some software engineers from Google and Microsoft, in addition to some machine learning PhDs, and replicated the work of other companies to provide more custom fine-tuning, especially for B2B cases. Part of what attracted these engineers and theorists was ExxenAI's AI alignment team.
ExxenAI's alignment strategy was based on a combination of theoretical and empirical work. The alignment team used some standard alignment training setups, like RLHF and having AIs debate each other. They also did research into transparency, especially focusing on distilling opaque neural networks into interpretable probabilistic programs. These programs "factorized" the world into a limited set of concepts, each at least somewhat human-interpretable (though still complex relative to ordinary code), that were combined in a generative grammar structure.
Derek came up to Janet's desk. "Hey, let's talk in the other room?", he asked, pointing to a designated room for high-security conversations. "Sure", Janet said, expecting this to be another un-impressive result that Derek implied the importance of through unnecessary security proceedings. As they entered the room, Derek turned on the noise machine and left it outside the door.
"So, look, you know our overall argument for why our systems are aligned, right?"
"Yes, of course. Our systems are trained for short-term processing. Any AI system that does not get a high short-term reward is gradient descended towards one that does better in the short term. Any long-term planning comes as a side effect of predicting long-term planning agents such as humans. Long-term planning that does not translate to short-term prediction gets regularized out.
"Right. So, I was thinking about this, and came up with a weird hypothesis."
Here we go again, thought Janet. She was used to critiquing Derek's galaxy-brained speculations. She knew that, although he really cared about alignment, he could go overboard with paranoid ideation.
"So. As humans, we implement reason imperfectly. We have biases, we have animalistic goals that don't perfectly align with truth-seeking, we have cultural socialization, and so on."
Janet nodded. Was he flirting by mentioning animalistic goals? She didn't think this sort of thing was too likely, but sometimes that sort of thought won credit in her internal prediction markets.
"What if human text is best predicted as a corruption of some purer form of reason? There's, like, some kind of ideal philosophical epistemology and ethics and so on, and humans are implementing this except with some distortions from our specific life context."
"Isn't this teleological woo? Like, ultimately humans are causal processes, there isn't some kind of mystical 'purpose' thing that we're approximating."
"If you're Laplace's demon, sure, physics works as an explanation for humans. But SimplexAI isn't Laplace's demon, and neither are we. Under computation bounds, teleological explanations can actually be the best."
Janet thought back to her time visiting cognitive science labs. "Oh, like 'Goal Inference as Inverse Planning'? The idea that human behavior can be predicted as performing a certain kind of inference and optimization, and the AI can model this inference within its own inference process?"
"Yes, exactly. And our DAGTransformer structure allows internal node...

Nov 25, 2023 • 2min
EA - Announcing New Beginner-friendly Book on AI Safety and Risk by Darren McKee
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing New Beginner-friendly Book on AI Safety and Risk, published by Darren McKee on November 25, 2023 on The Effective Altruism Forum.
Concisely,
I've just released the book Uncontrollable: The Threat of Artificial Superintelligence and the Race to Save the World
It's an engaging introduction to the main issues and arguments about AI safety and risk. Clarity and accessibility were prioritized. There are blurbs of support from Max Tegmark, Will MacAskill, Roman Yampolskiy and others.
Main argument is that AI capabilities are increasing rapidly, we may not be able to fully align or control advanced AI systems, which creates risk. There is great uncertainty, so we should be prudent and act now to ensure AI is developed safely. It tries to be hopeful.
Why does it exist?
There are lots of useful posts, blogs, podcasts, and articles on AI safety, but there was no up-to-date book entirely dedicated to the AI safety issue that is written for those without any exposure to the issue. (Including those with no science background.)
This book is meant to fill that gap and could be useful outreach or introductory materials.
If you have already been following the AI safety issue, there likely isn't a lot that is new for you. So, this might be best seen as something useful for friends, relatives, some policy makers, or others just learning about the issue. (although, you may still like the framing)
It's available on numerous Amazon marketplaces. Audiobook and Hardcover options to follow.
It was a hard journey. I hope it is of value to the community.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Nov 25, 2023 • 7min
LW - What are the results of more parental supervision and less outdoor play? by juliawise
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What are the results of more parental supervision and less outdoor play?, published by juliawise on November 25, 2023 on LessWrong.
Parents supervise their children way more than they used to
The wild thing is that this is true even while the number of children per family has decreased and the amount of time mothers work outside the home has increased.
(What's happening in France? I wouldn't be surprised if it's measurement error somehow.)
More supervision means less outdoor play
Most of this supervision is indoors, but here I'll focus on outdoor play. Needing a parent to take you outside means that you spend less time outside, and that when you are outside you do different things.
It's surprisingly hard to find data on how much time children spend playing outside now vs. in past generations. Everyone seems to agree it's less now, and you can look at changing advice to parents, but in the past people didn't collect data about children's time use.
"A study conducted in Zurich, Switzerland, in the early 1990s . . . compared 5-year-olds living in neighborhoods where children of that age were still allowed to play unsupervised outdoors to 5-year-olds living in economically similar neighborhoods where, because of traffic, such freedom was denied. Parents in the latter group were much more likely than those in the former to take their children to parks, where they could play under parental supervision.
Adolescent mental health has worsened
This year's Youth Risk Behavior Survey looked pretty bad about the wellbeing of American adolescents.
People squint at correlations, and theories include:
Social media and phone use
Political messages of helplessness and despair
Not enough play and freedom
Play used to be more dangerous
My grandfather was a small-town newspaper reporter in the early 20th century. He wrote "I remember a newspaper story about a boy who suffered a broken arm when, as the account read, he 'fell or jumped' from a low shed roof. Nobody knew whether kids fell or jumped because they were usually doing one or the other."
Our next-door neighbor had a twin brother who drowned at age 6 in the river while playing boats with an older child (in 1950s Cambridge MA, not a remote rural area).
Our housemate grew up on a farm, where he and his friends would amuse themselves by cutting down trees while one of them was in the tree. "It was fun, but there were some scary times when I thought my friends had been killed."
Playground injuries are . . . up?
I was expecting that more supervision meant fewer injuries. This doesn't seem to be the case at playgrounds, at least over the last 30 years.
From a large study of US visits to emergency rooms related to playground equipment:
Maybe children are spending time at playgrounds if they're not playing in empty lots and such? But here's children injured at school playgrounds (which are presumably seeing similar use over time) in Victoria, Australia. I don't think this is just because of wider awareness of concussions or something, because even in the 80s you still got treated at a hospital if you broke your arm.
But deaths from accidents are down
US accidental deaths of children age 10-19:
UK in the 80s and 90s, aged 19 and under:
The types of accidents that kill children and teens are mostly cars and drowning.
Most of the motor vehicle deaths are while riding in cars, which is a different topic. What about while children are playing or walking around?
As parental supervision has increased, child pedestrian deaths have fallen. Some of this may be because of better pedestrian infrastructure like crosswalks and speed bumps. But I suspect much of it is an adult being physically present with children when they're near streets.
Trends in pedestrian death rates by year, United States, 1995-2010, children ages 19 and under. The article...

Nov 25, 2023 • 7min
AF - On "slack" in training (Section 1.5 of "Scheming AIs") by Joe Carlsmith
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On "slack" in training (Section 1.5 of "Scheming AIs"), published by Joe Carlsmith on November 25, 2023 on The AI Alignment Forum.
This is Section 1.5 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast app.
On "slack" in training
Before diving into an assessment of the arguments for expecting scheming, I also want to flag a factor that will come up repeatedly in what follows: namely, the degree of "slack" that we should expect training to allow. By this I mean something like: how much is the training process ruthlessly and relentlessly pressuring the model to perform in a manner that yields maximum reward, vs. shaping the model in a more "relaxed" way, that leaves more room for less-than-maximally rewarded behavior. That is, in a low-slack regime, "but that sort of model would be getting less reward than would be possible given its capabilities" is a strong counterargument against training creating a model of the relevant kind, whereas in a high-slack regime, it's not (so high slack regimes will generally involve greater uncertainty about the type of model you end up with, since models that get less-than-maximal reward are still in the running).
Or, in more human terms: a low-slack regime is more like a hyper-intense financial firm that immediately fires any employees who fall behind in generating profits (and where you'd therefore expect surviving employees to be hyper-focused on generating profits - or perhaps, hyper-focused on the profits that their supervisors think they're generating), whereas a high-slack regime is more like a firm where employees can freely goof off, drink martinis at lunch, and pursue projects only vaguely related to the company's bottom line, and who only need to generate some amount of profit for the firm sometimes.
(Or at least, that's the broad distinction I'm trying to point at. Unfortunately, I don't have a great way of making it much more precise, and I think it's possible that thinking in these terms will ultimately be misleading.)
Slack matters here partly because below, I'm going to be making various arguments that appeal to possibly-quite-small differences in the amount of reward that different models will get. And the force of these arguments depends on how sensitive training is to these differences. But I also think it can inform our sense of what models to expect more generally.
For example, I think slack matters to the probability that training will create models that pursue proxy goals imperfectly correlated with reward on the training inputs. Thus, in a low-slack regime, it may be fairly unlikely for a model trained to help humans with science to end up pursuing a general "curiosity drive" (in a manner that doesn't then motivate instrumental training-gaming), because a model's pursuing its curiosity in training would sometimes deviate from maximally helping-the-humans-with-science.
That said, note that the degree of slack is conceptually distinct from the diversity and robustness of the efforts made in training to root out goal misgeneralization.
Thus, for example, if you're rewarding a model when it gets gold coins, but you only ever show your model environments where the only gold things are coins, then a model that tries to get gold-stuff-in-general will perform just as well a model that gets gold coins in particular, regardless of how intensely training pressures the model to get maximum reward on those environments. E.g., a low-slack regime could in...

Nov 25, 2023 • 4min
EA - Brian Tomasik on climate change by Vasco Grilo
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Brian Tomasik on climate change, published by Vasco Grilo on November 25, 2023 on The Effective Altruism Forum.
This is a linkpost to Brian Tomasik's posts on climate change.
Climate Change and Wild Animals
By Brian Tomasik
First written: 2008. Major additions: 2013. Last nontrivial update: 4 Aug 2018.
Summary
Human environmental choices have vast implications for wild animals, and one of our largest ecological impacts is climate change. Each human in the industrialized world may create or prevent in a potentially predictable way at least millions of insects and potentially more zooplankton per year by his or her greenhouse-gas emissions.
Is this influence net good or net bad? This question is very complicated to answer and takes us from examinations of tropical-climate expansion, sea ice, and plant productivity to desertification, coral reefs, and oceanic-temperature dynamics. On balance, I'm extremely uncertain about the net impact of climate change on wild-animal suffering; my probabilities are basically 50% net good vs.
50% net bad when just considering animal suffering on Earth in the next few centuries (ignoring side effects on humanity's very long-term future). Since other people care a lot about preventing climate change, and since climate change might destabilize prospects for a cooperative future, I currently think it's best to err on the side of reducing our greenhouse-gas emissions where feasible, but my low level of confidence reduces my fervor about the issue in either direction. That said, I am fairly confident that biomass-based carbon offsets, such as rainforest preservation, are net harmful for wild animals.
See also:
"Effects of Climate Change on Terrestrial Net Primary Productivity"
"Scenarios for Very Long-Term Impacts of Climate Change on Wild-Animal Suffering"
Effects of CO2 and Climate Change on Terrestrial Net Primary Productivity
By Brian Tomasik
First written: 2008-2016. Last nontrivial update: 28 Feb 2018.
Summary
This page compiles information on ways in which greenhouse-gas emissions and climate change will likely increase and likely decrease land-plant growth in the coming decades. The net impact is very unclear. I favor lower net primary productivity (NPP) because primary production gives rise to invertebrate suffering. Terrestrial NPP is just one dimension to consider when assessing all the impacts of climate change; effects on, e.g., marine NPP may be just as important.
Scenarios for Very Long-Term Impacts of Climate Change on Wild-Animal Suffering
By Brian Tomasik
First published: 2016 Jan 10. Last nontrivial update: 2016 Mar 07.
Summary
Climate change will significantly affect wild-animal populations, and hence wild-animal suffering, in the future. However, due to advances in technology, it seems unlikely climate change will have a major impact on wild-animal suffering beyond a few centuries from now. Still, there's a remote chance that human civilization will collapse before undoing climate change or eliminating the biosphere, and in that case, the effects of climate change could linger for thousands to millions of years. I calculate that this consideration might multiply the expected wild-animal impact of climate change by 20 to 21 times, although given model uncertainty and the difficulty of long-term predictions, these estimates should be taken with caution.
The default parameters in this piece suggest that the CO2 emissions of the average American lead to a long-term change of -3 to 3 expected insect-years of eventual wild-animal suffering every second. My main takeaway from this piece is that "climate change could be really important even relative to other environmental issues; we should explore further whether it's likely to increase or decrease wild-animal suffering on balance".
This piece should not be interpreted to suppo...

Nov 25, 2023 • 24min
LW - Progress links digest, 2023-11-24: Bottlenecks of aging, Starship launches, and much more by jasoncrawford
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Progress links digest, 2023-11-24: Bottlenecks of aging, Starship launches, and much more, published by jasoncrawford on November 25, 2023 on LessWrong.
I swear I will get back to doing these weekly so they're not so damn long. As always, feel free to skim and skip around!
The Progress Forum
A paradox at the heart of American bureaucracy: "The quickest way to doom a project to be over-budget and long-delayed is to make it an urgent public priority"
Why Governments Can't be Trusted to Protect the Long-run Future: "No one in the long-run future gets to vote in the next election. No one in government today will gain anything if they make the world better 50 years from now or lose anything if they make it worse"
What if we split the US into city-states? "In The Republic, when his entourage asks the ideal size of a state, Socrates replies, 'I would allow the state to increase so far as is consistent with unity; that, I think, is the proper limit'"
The Art of Medical Progress: "These two paintings offer a hopeful contrast. Whereas we begin with pain and suffering, we move to hope and progress. The surgeon stands apart as a hero, a symbol of the triumphant conquering of nature by humanity"
More from Roots of Progress fellows
Bottlenecks of Aging, a "philanthropic menu" of initiatives that "could meaningfully accelerate the advancement of aging science and other life-extending technologies." Fellows Alex Telford and Raiany Romanni both contributed to this (via @jamesfickel)
Drought is a policy choice: "California has surrendered to drought, presupposing that with climate change water shortages are inevitable. In response, the state fallows millions of farmland each year. But this is ignorant of California's history of taming arid lands"
Geoengineering Now! "Solar geoengineering can offset every degree of anthropogenic temperature rise for single-digit billions of dollars" (by @MTabarrok)
A conversation with Richard Bruns on indoor air quality (and some very feasible ways to improve it) (@finmoorhouse)
To Become a World-Class Chipmaker, the United States Might Need Help (NYT) covers a recent immigration proposal co-authored by (@cojobrien). Also, thread from @cojobrien of "what I've written through this program and some of my favorite pieces from other ROP colleagues"
Opportunities
Job opportunities
Forest Neurotech is hiring, "one of the coolest projects in the world" says @elidourado
"Know someone who loves to scale and automate workflows in the lab? We want to apply new tools to onboard a diverse array of species in the lab!" (@seemaychou)
The Navigation Fund (new philanthropic foundation) is hiring an Open Science Program Officer (via @seemaychou, @AGamick)
ARIA Research (UK) is hiring for various roles (@davidad)
Fundraising/investing opportunities
Nat Friedman is "interested in funding early stage startups building evals for AI capabilities"
A curated deal flow network for deep tech startups: "We're looking for A+ deep tech operator-angels. E.g. founders & CxOs at $1b+ deep tech companies, past and present. Robotics, biotech, defense, etc. Who should we talk to?" (@lpolovets)
Policy opportunities
"In 2024 I will be putting together a nuclear power working group for NYC/NYS. If you understand the government (or want to learn), want to act productively, and want to look at nuclear policy in the state, this is for you!" (@danielgolliher)
Gene editing opportunities
"I'm tired of waiting forever for a cure for red-green colorblindness. Reply to this tweet if you'd be willing to travel to an offshore location to receive unapproved (but obviously safe) gene therapy to fix it. If I get enough takers I'll find us a mad scientist to administer the therapy. This has already been done in monkeys (14 years ago) using human genes and a viral vector that is already used in eyes in hu...

Nov 25, 2023 • 2min
EA - New: Donation Gift Vouchers (Spendengutscheine) by Effektiv Spenden by tilboy
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: New: Donation Gift Vouchers (Spendengutscheine) by Effektiv Spenden, published by tilboy on November 25, 2023 on The Effective Altruism Forum.
Hi all!
I am happy to announce that we are introducing donation gift vouchers (German: Spendengutscheine) at Effektiv Spenden for this giving season! It's a great way to get friends and family thinking about Effective Giving - because the voucher recipient decides which of our recommended charities the donation will go to!
The vouchers work like this:
You buy a voucher via our website: https://effektiv-spenden.org/spendengutschein/ (The page is in german, but you can toggle the linked form to english.)
After completing payment, you receive a voucher code by email and forward it to the recipient.
The recipient redeems it via our website and gets to choose which of our recommended charities the donation will go to.
Main goals
Effective Giving Promotion: The vouchers can help introduce and discuss the principles of Effective Giving with family and friends.
Corporate Engagement: We also offer voucher purchase in bulk for organizations that want to send out gifts to their employees, clients, or business partners. See https://effektiv-spenden.org/spendengutschein-unternehmen/
Tax Deductibility in Germany and Switzerland: From a tax perspective the donation comes from the voucher buyer, so German and Swiss nationals benefit from the tax deductibility of voucher purchases like with regular donations. (If you reside in a different country this might not be possible.)
Feedback
If you have any bug reports, questions, or feedback about the product, let me know in the comments or via tilman.masur@effektiv-spenden.org.
Happy giving!
Tilman
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Nov 25, 2023 • 10min
LW - Prepsgiving, A Convergently Instrumental Human Practice by JenniferRM
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Prepsgiving, A Convergently Instrumental Human Practice, published by JenniferRM on November 25, 2023 on LessWrong.
Most cultures have a harvest festival, and every harvest festival is basically automatically a Prepsgiving Celebration.
In the northern hemisphere this probably happens in November or October, and in the southern hemisphere it probably happens in May or June. There could be more than one celebration like this, and in the US, I think Halloween and Thanksgiving both count as instances.
Depending on how you want to frame it, you could argue that the Idea teleologically causes all these Instances, but you could just as easily claim that the Instances epistemically caused the Idea.
If you think this essay is a good idea, and don't want to change your behavior very much at first, I encourage you to enjoy your harvest festivals more mindfully and simply think of them as instances of this idea, and if this helps nudge the practice into a more personally and locally useful shape, all the better!
If you want to take the practice very seriously, and do it a lot (on monthly, weekly, or daily cadences) then I encourage you to still take traditional harvest festivals seriously and interrupt your local routines to integrate in anything that is larger or older or more important, because helping improve people's situational awareness is one of the virtues of such events.
The name of Prepsgiving is related to Thanksgiving, but the orientation towards time is reversed. Harvest festivals in general, and Thanksgiving specifically, all basically celebrate having successfully navigated a period when people had to think really hard about food to have a good life, whether that was pulling a lot of produce in from a field with complex machinery, or surviving a disruption in food supply lines, or whatever... it happened in the past and so looking back you can give thanks.
With Prepsgiving, you should also be looking forward, so you can get ready. Bringing this future orientation to existing events might help you notice the ways in which even just giving thanks for good things that happened in the past can help people be ready to better handle adverse events in the future.
There is an interesting pattern to human disaster planning, where we tend to prepare exactly for a hurricane, or an earthquake, or a tornado, or a fire, when we get into it as a individual person, but once we really take serious steps most people notice that there are lots of simple and easy things to do that help with ALL such patterns. In nearly all of those cases, it is useful to have a "go bag" with stuff that would be useful to use if living as a disaster shelter. In most of those cases "stored water" is probably useful.
Hiking equipment overlaps here a bit, because iodine pills are a very compact way to "get access to emergency water".
Prepsgiving isn't one single practice, that works one single way, but is the overall convergence of "holding a celebration to think about and get better at the convergences that arise in emergency food logistics across many possible emergencies".
For example, at a good Prepsgiving, there are probably more people rather than less people. This lowers the cognitive burden on average, helps aggregate rare knowledge, gives a chance for children to learn rare food preparation skills from trusted adults by observation, takes advantage of efficiencies of scale in food production itself, helps people become friendly and familiar with more people in their extended social network, and maybe starts to set up a social network in which food bartering could occur where huge gains from trade can be accessed through face-to-face processes if food supplies ever get surprisingly scarce for some amount of time.
Before covid, I always had Prepsgiving "as an idea that I should try to do more, and...

Nov 24, 2023 • 2min
LW - What did you change your mind about in the last year? by mike hawke
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What did you change your mind about in the last year?, published by mike hawke on November 24, 2023 on LessWrong.
Seems like a good New Year activity for rationalists, so I thought I'd post it early instead of late.
Here are some steps I recommend:
Go looking through old writings and messages.
If you keep a journal, go look at a few entries from throughout the last year or two.
Skim over your LessWrong activity from the last year.
If you're active on a slack or discord, do a search of
from: [your username] and skim your old messages from the last year (or the last 3 months on the free tier of slack).
Sample a few of your reddit comments and tweets from the last year.
Same for text messages.
Think back through major life events (if you had any this year) and see if they changed your mind about anything. Maybe you changed jobs or turned down an offer; maybe you tried a new therapeutic intervention or recreational drug; maybe you finally told your family something important; maybe you finally excommunicated someone from your life; maybe you tried mediating a conflict between your friends.
Obvious, but look over your records of Manifold trades, quantitative predictions, and explicit bets. See if there's anything interesting.
Here are some emotional loadings that I anticipate seeing:
"Well, that aged poorly."
"Wow, bullet dodged!"
"But I mean...how could I have known?"
"Ah. Model uncertainty strikes again."
"Yeah ok, but this year it'll happen for sure!"
"[Sigh] I did have an inkling, but I guess I just didn't want to admit it at the time."
"I tested that hypothesis and got a result."
"Okay, they were right about this one thing. Doesn't mean I have to like them."
"Now I see what people mean when they say X"
"This is huge, why does no one talk about this?!"
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Nov 24, 2023 • 34min
AF - Why focus on schemers in particular (Sections 1.3 and 1.4 of "Scheming AIs") by Joe Carlsmith
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why focus on schemers in particular (Sections 1.3 and 1.4 of "Scheming AIs"), published by Joe Carlsmith on November 24, 2023 on The AI Alignment Forum.
This is Sections 1.3 and 1.4 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast aapp.
Why focus on schemers in particular?
As I noted above, I think schemers are the scariest model class in this taxonomy.[1] Why think that? After all, can't all of these models be dangerously misaligned and power-seeking? Reward-on-the-episode seekers, for example, will plausibly try to seize control of the reward process, if it will lead to more reward-on-the-episode.
This section explains why. However, if you're happy enough with the focus on schemers, feel free to skip ahead to section 1.4.
The type of misalignment I'm most worried about
To explain why I think that schemers are uniquely scary, I want to first say a few words about the type of misalignment I'm most worried about.
First: I'm focused, here, on what I've elsewhere called "practical power-seeking-alignment" - that is, on whether our AIs will engage in problematic forms of power-seeking on any of the inputs they will in fact receive. This means, importantly, that we don't need to instill goals in our AIs that lead to good results even when subject to arbitrary amounts of optimization power (e.g., we don't need to pass Yudkowsky's "omni test"). Rather, we only need to instill goals in our AIs that lead to good results given the actual options and constraints those AIs will face, and the actual levels of optimization power they will be mobilizing.
This is an importantly lower bar. Indeed, it's a bar that, in principle, all of these models (even schemers) can meet, assuming we control their capabilities, options, and incentives in the right way. For example, while it's true that a reward-on-the-episode seeker will try to seize control of the reward process given the opportunity, one tool in our toolset is: to not give it the opportunity. And while a paradigm schemer might be lying in wait, hoping one day to escape and seize power (but performing well in the meantime), one tool in our tool box is: to not let it escape (while continuing to benefit from its good performance).
Of course, success in this respect requires that our monitoring, control, and security efforts be sufficiently powerful relative to the AIs we're worried about, and that they remain so even as frontier AI capabilities scale up. But this brings me to my second point: namely, I'm here especially interested in the practical PS-alignment of some comparatively early set of roughly human-level - or at least, not-wildly-superhuman - models.
Defending this point of focus is beyond my purpose here. But it's important to the lens I'll be using in what follows.
In particular: I think it's plausible that there will be some key (and perhaps: scarily brief) stage of AI development in which our AIs are not yet powerful enough to take-over (or to escape from human control), but where they are still capable, in principle, of performing extremely valuable and alignment-relevant cognitive work for us, if we can successfully induce them to do so. And I'm especially interested in forms of misalignment that might undermine this possibility.
Finally: I'm especially interested in forms of PS-misalignment in which the relevant power-seeking AIs are specifically aiming either to cause, participate in, or benefit from some kind of full-blown disempowerment of ...


