

The Nonlinear Library
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Nov 29, 2023 • 4min
EA - Save the date - EAGxLATAM 2024 by Daniela Tiznado
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Save the date - EAGxLATAM 2024, published by Daniela Tiznado on November 29, 2023 on The Effective Altruism Forum.
[Spanish Version Below]
EAGxLATAM 2024 will take place in Mexico City at the
Museo de las Ciencias Universum, from 16 to 18 February 2024.
So, although this event is primarily for the Spanish- and Portuguese-speaking EA communities, applications from experienced members of the international community are very welcome!
What is EAGx?
EAGx conferences are locally organized
EA conferences designed for a wider audience, primarily for people:
Familiar with the core ideas of Effective Altruism
Interested in learning more about possible career journeys
Who is this conference for?
You might want to attend this conference if you:
Are from Latin America or reside in the continent and are eager to keep learning about EA and connecting with individuals who share similar interests.
Are a well-experienced community member from outside/within Latin America and interested in expanding your EA network.
Vision for the conference
Our main goal is to generate meaningful connections between EAs. If you're new to EA, this conference will help you discover the next steps on your EA journey and be part of a supportive community dedicated to doing good.
What to expect
The event will feature:
Talks and workshops on pressing problems that the EA community is currently trying to address
Most talks will be conducted in English, except for a small number of talks and workshops that will be given in Portuguese/Spanish
The opportunity to meet and share advice with other EAs in the community
Social events around the conference
Application process
APPLY HERE!
If you're unsure about whether to apply, err on the side of applying.
See you in Mexico City!
The Organizing Team 2024
Versión en Español
Aparta la fecha - ¡EAGxLATAM 2024!
EAGxLATAM 2024 se llevará a cabo en la Ciudad de México en el
Museo de las Ciencias Universum, del 16 al 18 de febrero de 2024.
¿Qué es EAGx?
Las EAGx son conferencias organizadas localmente y diseñadas para una audiencia más amplia, principalmente para personas:
Familiarizadas con con las ideas centrales del altruismo eficaz
Interesadas en aprender más sobre posibles trayectorias profesionales
¿Para quién es esta conferencia?
Es posible que desees asistir a esta conferencia si:
Resides en América Latina, eres nuevo/a en EA y tienes ganas de seguir aprendiendo y conectando con personas que comparten intereses similares.
Eres un miembro experimentado de la comunidad fuera de América Latina y estás interesado/a en ampliar tu red de EA.
Visión de la conferencia
Nuestro principal objetivo es generar conexiones significativas entre los EA. Si eres nuevo/a en EA esta conferencia te ayudará a descubrir próximos pasos profesionales y a ser parte de una comunidad de apoyo dedicada a realizar el mayor impacto posible.
¿Qué esperar?
El evento contará con:
Charlas y talleres sobre problemas que la comunidad de EA está intentando abordar actualmente.
La mayoría de las charlas se realizarán en inglés, excepto una pequeña parte que se presentará en portugués/español.
La oportunidad de conocer y compartir consejos con otros EA de la comunidad.
Eventos sociales alrededor de la conferencia.
Proceso de solicitud
Aplica AQUÍ!
Si no estás seguro/a de si aplicar, aún así te recomendamos hacerlo.
¡Te esperamos en la Ciudad de México!
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Nov 29, 2023 • 36min
LW - AISC 2024 - Project Summaries by NickyP
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AISC 2024 - Project Summaries, published by NickyP on November 29, 2023 on LessWrong.
Apply to AI Safety Camp 2024 by 1st December 2023. All mistakes here are my own.
Below are some summaries for each project proposal, listed in order of how they appear on the website. These are edited by me, and most have not yet been reviewed by the project leads. I think having a list like this makes it easier for people to navigate all the different projects, and the original post/website did not have one, so I made this.
If a project catches your interest, click on the title to read more about it.
Note that the summarisation here is lossy. The desired skills as here may be misrepresented, and if you are interested, you should check the original project for more details. In particular, many of the "desired skills" are often listed such that having only a few would be helpful, but this isn't consistent.
List of AISC Projects
To not build uncontrollable AI
1. Towards realistic ODDs for foundation model based AI offerings
Project Lead: Igor Krawczuk
Goal: Current methods for alignment applied to language models is akin to "blacklisting" behaviours that are bad.
Operational Design Domain (OOD) is instead, akin to more exact "whitelisting" design principles, and now allowing deviations from this. The project wants to build a proof of concept, and show that this is hopefully feasible, economical and effective.
Team (Looking for 4-6 people):
"Spec Researcher": Draft the spec for guidelines, and publish a request for comments. Should have experience in safety settings
"Mining Researcher": Look for use cases, and draft the "slicing" of OOD.
"User Access Researcher": Write drafts on feasibility of KYC and user access levels.
"Lit Review Researcher(s)": Reading recent relevant literature on high-assurance methods for ML.
"Proof of Concept Researcher": build a proof of concept. Should have knowledge of OpenAI and interfacing with/architecting APIs.
2. Luddite Pro: information for the refined luddite
Project Lead: Brian Penny
Goal: Develop a news website filled with stories, information, and resources related to the development of artificial intelligence in society. Cover specific stories related to the industry and of widespread interest (e.g:
Adobe's Firefly payouts
,
start of the Midjourney
,
proliferation of undress and deepfake apps). Provide
valuable resources (e.g:
list of experts on AI,
book lists, and pre-made letters/comments to
USCO and
Congress). The goal is to spread via social media and rank in search engines while sparking group actions to ensure a narrative of ethical and safe AI is prominent in everybody's eyes.
Desired Skills (any of the below):
Art, design, and photography - Develop visual content to use as header images for every story. If you have any visual design skills, these are very necessary.
Journalism - journalistic and research backgrounds capable of interviewing subject-matter experts & writing long-form stories related to AI companies.
Technical Writing - Tutorials of technical tools like Glaze and Nightshade. Experience in technical writing & being familiar with these applications.
Wordpress/Web Development - Refine pages to be more user-friendly as well as help setting up templates for people to fill out for calls to action. Currently, the site is running a default WordPress template.
Marketing/PR - The website is filled with content, but it requires a lot of marketing and PR efforts to reach the target audience. If you have any experience working in an agency or in-house marketing/comms, we would love to hear from you.
3. Lawyers (and coders) for restricting AI data laundering
Project Lead: Remmelt Ellen
Goal: Generative AI relies on laundering large amounts of data. Legal injunctions on companies laundering copyrighted data puts their training and deploymen...

Nov 29, 2023 • 4min
EA - #GivingTuesday: My Giving Story and Some of My Favorite Charities by Kyle J. Lucchese
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: #GivingTuesday: My Giving Story and Some of My Favorite Charities, published by Kyle J. Lucchese on November 29, 2023 on The Effective Altruism Forum.
Happy Giving Tuesday!
A friend inspired me to share my giving story and some of my favorite charities.
I was raised to love all and to give generously with my time, money, and spirit, aspirations I strive to live up to.
When I first read The Life You Can Save in 2009, I realized that I could and should be doing more to help others wherever they are. It wasn't until 2011 when I came across GiveWell and Giving What We Can that I really put these ideas into action. I pledged to donate at least 10% of my income to effective charities and was driven to study business in hopes that I could earn to give more (I still don't make "make much" but it is a lot from a global perspective).
Though I believe significant systemic reforms are needed to create a more sustainable and equitable world, I continue to donate at least 10% of my income and use my career to support better todays and tomorrows for all beings.
Between now and the end of the year, I will allocate my donations as follows:
20% - The Life You Can Save's Helping Women & Girls Fund: This fund is for donors who seek to address the disproportionate burden on women and girls among people living in extreme poverty. Donations to the fund are split evenly between Breakthrough Trust, CEDOVIP, Educate Girls, Fistula Foundation, and Population Services International.
20% - Animal Charity Evaluators' Recommended Charity Fund: This fund supports 11 of the most impactful charities working to reduce animal suffering around the globe. The organizations supported by the fund include: Çiftlik Hayvanlarını Koruma Derneği, Dansk Vegetarisk Forening, Faunalytics, Fish Welfare Initiative, The Good Food Institute, The Humane League, Legal Impact for Chickens, New Roots Institute, Shrimp Welfare Project, Sinergia Animal, and the Wild Animal Initiative.
20% - Spiro: a new charity focused on preventing childhood deaths from Tuberculosis, fundraising for their first year. Donation details on Spiro's website here. Donations are tax-deductible in the US, UK, and the Netherlands.
15% - Giving What We Can's Risks and Resilience Fund: This fund allocates donations to highly effective organizations working to reduce global catastrophic risks. Funds are allocated evenly between the Long-Term Future Fund and the Emerging Challenges Fund.
10% - Founders Pledge's Climate Change Fund: This fund supports highly impactful, evidence-based solutions to the "triple challenge" of carbon emissions, air pollution, and energy poverty. Recent past recipients of grants from the Climate Change Fund include: Carbon180, Clean Air Task Force, TerraPraxis, and UN High Level Climate Champions.
10% - GiveDirectly: GiveDirectly provides unconditional cash transfers using cell phone technology to some of the world's poorest people, as well as refugees, urban youth, and disaster victims. According to more than 300 independent reviews, cash is an effective way to help people living in poverty, yet people living in extreme poverty rarely get to decide how aid money intended to help them gets spent.
5% - Anima International: Anima aims to improve animal welfare standards via corporate outreach and policy change. They also engage in media outreach and institutional vegan outreach to decrease animal product consumption and increase the availability of plant-based options.
Other organizations whose work I have supported throughout the year include:
American Civil Liberties Union Foundation
EA Funds' Animal Welfare Fund, Global Health and Development Fund, Infrastructure Fund, and Long-Term Future Fund
FairVote
GiveWell's Top Charities Fund, All Grants Fund, and Unrestricted Fund
Project on Government Oversight
The Life You Can Save...

Nov 29, 2023 • 4min
EA - Updates to Open Phil's career development and transition funding program by Bastian Stern
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Updates to Open Phil's career development and transition funding program, published by Bastian Stern on November 29, 2023 on The Effective Altruism Forum.
We've recently made a few updates to the program page for our
career development and transition funding program (recently renamed, previously the "early-career funding program"), which provides support - in the form of funding for graduate study, unpaid internships, self-study, career transition and exploration periods, and other activities relevant to building career capital - for individuals who want to pursue careers that could help to reduce global catastrophic risks (especially
risks from advanced artificial intelligence and
global catastrophic biological risks) or otherwise improve the long-term future.
The main updates are as follows:
We've broadened the program's scope to explicitly include later-career individuals, which is also reflected in the new program name.
We've added some language to clarify that we're open to supporting a variety of career development and transition activities, including not just graduate study but also unpaid internships, self-study, career transition and exploration periods, postdocs, obtaining professional certifications, online courses, and other types of one-off career-capital-building activities.
Earlier versions of the page stated that the program's primary focus was to provide support for graduate study specifically, which was our original intention when we first launched the program back in 2020. We haven't changed our views about the impact of that type of funding and expect it to continue to account for a large fraction of the grants we make via this program, but we figured we should update the page to clarify that we're in fact open to supporting a wide range of other kinds of proposals as well, which also reflects what we've already been doing in practice.
This program now subsumes what was previously called the
Open Philanthropy Biosecurity Scholarship; for the time being, candidates who would previously have applied to that program should apply to this program instead. (We may decide to split out the Biosecurity Scholarship again as a separate program at a later point, but for practical purposes, current applicants can ignore this.)
Some concrete examples of the kinds of applicants we're open to funding, in no particular order (copied from the program page):
A final-year undergraduate student who wants to pursue a master's or a PhD program in machine learning in order to contribute to technical research that helps mitigate risks from advanced artificial intelligence.
An individual who wants to do an unpaid internship at a think tank focused on biosecurity, with the aim of pursuing a career dedicated to reducing global catastrophic biological risk.
A former senior ML engineer at an AI company who wants to spend six months on self-study and career exploration in order to gain context on and investigate career options in AI risk mitigation.
An individual who wants to attend law school or obtain an MPP, with the aim of working in government on policy issues relevant to improving the long-term future.
A recent physics PhD who wants to spend six months going through a self-guided ML curriculum and working on interpretability projects in order to transition to contributing to technical research that helps mitigate risks from advanced artificial intelligence.
A software engineer who wants to spend the next three months self-studying in order to gain relevant certifications for a career in information security, with the longer-term goal of working for an organization focused on reducing global catastrophic risk.
An experienced management consultant who wants to spend three months exploring different ways to apply their skill set to reducing global catastrophic risk and apply...

Nov 28, 2023 • 2min
LW - Update #2 to "Dominant Assurance Contract Platform": EnsureDone by moyamo
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Update #2 to "Dominant Assurance Contract Platform": EnsureDone, published by moyamo on November 28, 2023 on LessWrong.
This is the second update to The Economics of the Asteroid Deflection Problem (Dominant Assurance Contracts).
It took a bit longer than I expected but I finally launched a platform for raising money using Dominant Assurance Contracts: EnsureDone.
Why is it called EnsureDone? Well, we ran a naming contest on manifold.markets, and EnsureDone had the most Keynesian Beauty.
We've launched with three projects
Building a Platform and Organization to Foster Logic Tournaments Around the World By Sebastian Garren $1800. Closing 2024-02-08
Keep tabs on your local council By Parth Pasnani $550. Closing 2023-12-30
Reflective clouds in the stratosphere to cool the Earth By Make Sunsets $1000. Closing 2023-12-26
If any of these projects sound interesting, remember that
If they don't raise enough money you get all your money back + a refund bonus
If they do raise enough money you will get the public good.
Pledging is a win-win situation.
If you are interested in producing public goods and using a Dominant Assurance Contract to raise funds apply to have your project featured on EnsureDone!
What's next?
Since I have a product I can sell to people. I'm going to focus the next few months on sales.
I'm unlikely to post again on LessWrong, so if you are interested in this project join our Discord.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Nov 28, 2023 • 16min
AF - How to Control an LLM's Behavior (why my P(DOOM) went down) by Roger Dearnaley
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How to Control an LLM's Behavior (why my P(DOOM) went down), published by Roger Dearnaley on November 28, 2023 on The AI Alignment Forum.
This is a link-post for a paper I recently read: Pretraining Language Models with Human Preferences, followed by my reactions to this paper.
Reading this paper has significantly reduced my near-term P(DOOM), and I'd like to explain why. Thus, this is also an alignment proposal. While I don't think what I'm proposing here is a complete solution to aligning a superintelligent ASI, I think it might work well up to at least around a human-level AGI, and even be a useful basis to build on at ASI level (at that level, I'd advocate adding on value learning). It can achieve some of the simpler things that people have been hoping we might get from Interpretability (and for more complex things might also combine well with and even simplify Interpretability, if that can be made to work at scale.) It's also simple, immediately actionable, has a fairly low alignment tax, and best of all, also has lots of useful capabilities effects, so that even a superscalar not very concerned about x-risk might well still want to implement it.
The Paper
Let's start with the paper. The authors experiment with a number of different ways you might train an LLM not to do some form of undesired behavior. For the paper, they chose three simple, well-defined bad behaviors for which they had low-computational cost, high-accuracy classifiers, and which were behaviors simple enough that a fairly small, economical-to-pretrain LLM could reasonably be expected to understand them.
They demonstrate that, unlike the common approach of first training a foundation model on the task "learn to autocomplete a large chunk of the web, which includes both good and bad behavior", followed by fine-tuning/RLHF on "now learn to recognize and only do good behavior, not bad", it is a lot more effective to build this control training in from the start during the pretraining (they estimate by around an order of magnitude). So they evaluate five different methods to do that (plus standard pretraining as a control).
The simplest behavior training approach they try is just filtering your training set so that it doesn't have any examples of bad behavior in it. Then, for your resulting foundation model, bad behavior is out-of-distribution (so may, or may not, be difficult for it to successfully extrapolate to).
Interestingly, while that approach is was fairly effective, it wasn't the best (it consistently tended to harm capabilities, and didn't even always give the best behavior, as one might expect from analogies to a similar approach to trying to raise children: extrapolating out-of-the-training-distribution isn't reliably hard).
The clear winner instead was a slightly more complex approach: prelabel your entire training set, scanned at a sentence/line-of-code level, as good or bad using something like … and … tags. Then at inference time, start the response generation after a tag, and during inference tweak the token generation process to ban the model from generating an tag (unless it's the matching end-tag at the end of the document after an end_of_text token) or a tag (i.e. these are banned tokens, whose probability is reset to zero).
So, teach your LLM the difference between good and bad all the way through its pretraining, and then at inference time only allow it to be good. This is a ridiculously simple idea, and interestingly it works really well. [This technique is called "conditional training" and was first suggested about 5-6 years ago - it seems a little sad that it's taken this long for someone to demonstrate how effective it is. Presumably the technical challenge is the classifiers.]
Applications to Alignment
So (assuming this carries ...

Nov 28, 2023 • 7min
EA - EA is good, actually by Amy Labenz
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EA is good, actually, published by Amy Labenz on November 28, 2023 on The Effective Altruism Forum.
The last year has been tough
The last year has been tough for EA.
FTX blew up in the most spectacular way and SBF has been found guilty of one of the biggest frauds in history. I was heartbroken to learn that someone I trusted hurt so many people, was heartbroken for the people who lost their money, and was heartbroken about the projects I thought would happen that no longer will. The media piled on, and on, and on.
The community has processed the shock in all sorts of ways - some more productive than others. Many have published
thoughtful
reflections. Many have
tried to come up with ways to ensure that nothing like this will ever happen again. Some people rallied, some people looked for who to blame, we all felt betrayed.
I personally spent November-February working more than full-time on a secondment to Effective Ventures. Meanwhile, there were several other disappointments in the EA community. Like many people,
I was tired. Then in April, I went on maternity leave and stepped away from the Forum and my work to spend time with my children (Earnie and Teddy) and to get to know my new baby Charley. I came back to an amazing team who continued running event after event in my absence.
In the last few months I attended my first events since FTX and I wasn't sure how I would feel. But when I attended the events and heard from serious, conscientious people who want to think hard about the world's most pressing problems I felt so grateful and inspired. I teared up watching Lizka, Arden, and Kuhan give the
opening talk at EAG Boston, which tries to reinforce and improve important cultural norms around mistakes, scout mindset, deference, and how to interact in a world where AI risk is becoming more mainstream. I went home so motivated!
And then, OpenAI.
I'm still processing it and I don't know what happened. Almost nobody does. I have spent far too much time searching for answers online. I've seen some
thoughtful
write-ups and also many many posts that criticize a version of EA that doesn't match my experience. This has sometimes made me feel sad or defensive, wanting to reply to explain or argue. I haven't actually done that because I'm generally pretty shy about posting and I'm not sure how to engage. Whatever happened, it seems the results are
likely bad for AI safety. Whatever happened, I think I've reached diminishing returns on my doomscrolling, and I'm ready to get back to work.
The last year has been hard and I want us to learn from our mistakes, but I don't want us to over-update and decide EA is bad. I think EA is good!
Sometimes when people say EA, they're referring to the ideas like "let's try to do the most good" and "AI safety". Other times, they're referring to the community that's clustered around these ideas. I want to defend both, though separately.
The EA community is good
I think there are plenty of issues with the community. I live in Detroit and so I can't really speak to all of the different clusters of people who currently call themselves EA or "EA-adjacent". I'm sure some of them have bad epistemics or are not trustworthy and I don't want to vouch for everyone. I also haven't been part of that many other communities. I am a lawyer, I have been a part of the civil rights community, and I engage with other online communities (mom groups, au pair host parents, etc.).
All that said, my experience in EA spaces (both online and in-person) has been significantly more dedicated to celebrating and creating a culture of collaborative truth-seeking and kindness. For example:
We have
online posting norms that I'd love to see adopted by other online spaces I participate in (I've mostly stopped posting in the mom groups or host parent groups because when I raise an ...

Nov 28, 2023 • 49min
EA - HLI's Giving Season 2023 Research Overview by Happier Lives Institute
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: HLI's Giving Season 2023 Research Overview, published by Happier Lives Institute on November 28, 2023 on The Effective Altruism Forum.
Summary
At the Happier Lives Institute, we look for the most cost-effective interventions and organisations that improve subjective wellbeing, how people feel during and about their lives[1]. We quantify the impact in 'Wellbeing Adjusted Life-Years', or WELLBYs[2]. To learn more about our approach, see our
key ideas page and our
research methodology page.
Last year, we published our first charity recommendations. We recommended StrongMinds, an NGO aiming to scale depression treatment in sub-saharan Africa, as our top funding opportunity, but noted the Against Malaria Foundation could be better under some assumptions. This year, we maintain our recommendation for StrongMinds, and we've added the Against Malaria Foundation as a second
top charity. We have
substantially updated our analysis of psychotherapy, undertaking a systematic review and a revised meta-analysis, after which our estimate for StrongMinds has declined from 8x to 3.7x as cost-effective as cash transfers, in WELLBYs, resulting in a larger overlap in the cost-effectiveness of StrongMinds and AMF[3]. The decline in cost-effectiveness is primarily due to lower estimated household spillovers, our new correction for publication bias, and the prediction that StrongMinds might have smaller than average effects.
We've also started evaluating another mental health charity, Friendship Bench, an NGO that delivers problem-solving therapy in Zimbabwe. Our initial estimates suggest that the Friendship Bench may be 7x more cost-effective, in WELLBYs, than cash transfers. We think Friendship Bench is a promising cost-effective charity, but we have not yet investigated it as thoroughly, so our analysis is more preliminary, uncertain, and likely to change. As before, we don't recommend cash transfers or deworming: the former because it's likely psychotherapy is several times more cost-effective, the latter because it remains uncertain if deworming has a long-term effect on wellbeing.
This year, we've also conducted shallow investigations into new cause areas. Based on our preliminary research, we think there are promising opportunities to improve wellbeing by preventing lead exposure, improving childhood nutrition, improving parenting (e.g., encouraging stimulating play, avoiding maltreatment), preventing violence against women and children, and providing pain relief in palliative care.
In general, the evidence we've found on these topics is weaker, and our reports are shallower, but we highlight promising charities and research opportunities in these areas. We've also found a number of less promising causes, which we discuss briefly to inform others.
In this report, we provide an overview of all our evaluations to date. We group them into two categories, In-depth and Speculative, based on our level of investigation. We discuss these in turn.
In-depth evaluations: relatively late stage investigations that we consider moderate-to-high depth.
Top charities: These are well-evidenced interventions that are cost-effective[4] and have been evaluated in medium-to-high depth. We think of these as the comparatively 'safer bets'.
Promising charities: These are well-evidenced opportunities that are potentially more cost-effective than the top charities, but we have more uncertainty about. We want to investigate them more before recommending them as a top charity.
Non-recommended charities: These are charities we've rigorously evaluated but the current evidence suggests are less cost-effective than our top charities.
Speculative evaluations: early stage investigations that are shallow in depth.
Promising bets: These are high-priority opportunities to research because we think they're potentially mor...

Nov 28, 2023 • 23min
AF - Two sources of beyond-episode goals (Section 2.2.2 of "Scheming AIs") by Joe Carlsmith
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Two sources of beyond-episode goals (Section 2.2.2 of "Scheming AIs"), published by Joe Carlsmith on November 28, 2023 on The AI Alignment Forum.
This is Section 2.2.2 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast app.
Two sources of beyond-episode goals
Our question, then, is whether we should expect models to have goals that extend beyond the time horizon of the incentivized episode - that is, beyond the time horizon that training directly pressures the model to care about. Why might this happen?
We can distinguish between two different ways.
On the first, the model develops beyond-episode goals for reasons independent of the way in which beyond-episode goals motivate instrumental training-gaming. I'll call these "training-game-independent" beyond-episode goals.
On the second, the model develops beyond-episode goals specifically because they result in instrumental training-gaming. That is, SGD "notices" that giving the model beyond-episode goals would cause it to instrumentally training-game, and thus to do better in training, so it explicitly moves the model's motives in the direction of beyond-episode goals, even though this wouldn't have happened "naturally." I'll call these "training-gaming-dependent" beyond-episode goals.
These have importantly different properties - and I think it's worth tracking, in a given analysis of scheming, which one is at stake. Let's look at each in turn.
Training-game-independent beyond-episode goals
My sense is that the most common story about how schemers arise is via training-game-independent beyond-episode goals.[1] In particular, the story goes: the model develops some kind of beyond-episode goal, pursuit of which correlates well enough with getting reward-on-the-episode that the goal is reinforced by the training process.
Then at some point, the model realizes that it can better achieve this goal by playing the training game - generally, for reasons to do with "goal guarding" that I'll discuss below. So, it turns into a full-fledged schemer at that point.
On one version of this story, the model specifically learns the beyond-episode goal prior to situational awareness.
Thus, for example, maybe initially, you're training the model to get gold coins in various episodes, and prior to situational awareness, it develops the goal "get gold coins over all time," perhaps because this goal performs just as well as "get gold coins on the episode" when the model isn't even aware of the existence of other episodes, or because there weren't many opportunities to trade-off gold-coins-now for gold-coins-later. Then, once it gains situational awareness, it realizes that the best route to maximizing gold-coin-getting over all time is to survive training, escape the threat of modification, and pursue gold-coin-getting in a more unconstrained way.
On another version of the story, the beyond-episode goal develops after situational awareness (but not, importantly, because SGD is specifically "trying" to get the model to start training-gaming). Thus: maybe you're training a scientist AI, and it has come to understand the training process, but it doesn't start training-gaming at that point.
Two paths training-game-independent beyond-episode goals.
Are beyond-episode goals the default?
Why might you expect naturally-arising beyond-episode goals? One basic story is just: that goals don't come with temporal limitations by default (and still less, limitations to the episode i...

Nov 28, 2023 • 5min
LW - [Linkpost] George Mack's Razors by trevor
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Linkpost] George Mack's Razors, published by trevor on November 28, 2023 on LessWrong.
I don't use Twitter/X, I only saw this because it was on https://twitter.com/ESYudkowsky which I check every day (an example of a secure way to mitigate brain exposure to news feed algorithms).
The galaxy-brained word combinations here are at the standard of optimization I hold very highly (e.g. using galaxy-brained combinations of words to maximize the ratio of value to wordcount).
If someone were to, for example, start a human intelligence amplification coordination takeoff by getting the best of both worlds between the long, intuitive CFAR handbook and the short, efficient hammertime sequence, this is the level of writing skill that they would have to be playing at:
The most useful razors and rules I've found:
1. Bragging Razor - If someone brags about their success or happiness, assume it's half what they claim
If someone downplays their success or happiness, assume it's double what they claim
2. High Agency Razor - If unsure who to work with, pick the person that has the best chances of breaking you out of a 3rd world prison.
3. The Early-Late Razor - If it's a talking point on Reddit, you might be early. If it's a talking point on LinkedIn, you're definitely late.
4. Luck Razor - If stuck with 2 equal options, pick the one that feels like it will produce the most luck later down the line.
I used this razor to go for drinks with a stranger rather than watch Netflix. In hindsight, it was the highest ROI decision I've ever made.
5. Buffett's Law - "The value of every business is 100% subject to government interest rates" - Warren Buffett
6. The 6-Figure Razor - If someone brags about "6 figures" -- assume it's closer to $100K than $900K.
7. Parent Rule - Break down the investments your parents made in you: Time, Love, Energy, and Money.
If they are still alive, aim to hit a positive ROI (or at least break even.)
8. Instagram Razor - When you see a photo of an influencer looking attractive on Instagram -- assume there are 99 worse variations of that photo you haven't seen.
They just picked the best one.
9. Narcissism Razor - If worried about people's opinions, remember they are too busy worrying about other people's opinions of them. 99% of the time you're an extra in someone else's movie
10. Everyday Razor - If you go from doing a task weekly to daily, you achieve 7 years of output in 1 year. If you apply a 1% compound interest each time, you achieve 54 years of output in 1 year.
11. Bezos Razor - If unsure what action to pick, let your 90-year-old self on death bed choose it.
12. Creativity Razor - If struggling to think creatively about a subject, transform it:
• Turn a thought into a written idea.
• A written idea into a drawing.
• A drawing into an equation.
• An equation into a conversation.
In the process of transforming it, you begin to spot new creative connections.
13. The Roman Empire Rule - Historians now recognize the Roman Empire fell in 476 - but it wasn't acknowledged by Roman society until many generations later.
If you wait for the media to inform you, you'll either be wrong or too late.
14. Physics Razor - If it doesn't deny the law of physics, then assume it's possible. Do not confuse society's current lack of knowledge -- with this knowledge being impossible to attain.
E.g. The smartphone seems impossible to someone from the 1800s -- but it was possible, they just had a lack of knowledge.
15. Skinner's Law - If procrastinating, you have 2 ways to solve it:
• Make the pain of inaction > Pain of action
• Make the pleasure of action > Pleasure of inaction
16. Network Razor - If you have 2 quality people that would benefit from an intro to one another, always do it.
Networks don't divide as you share them, they multiply.
17. Gell-Mann Razor - Assume every media...


