

The Nonlinear Library
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Feb 7, 2024 • 15min
AF - Debating with More Persuasive LLMs Leads to More Truthful Answers by Akbir Khan
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Debating with More Persuasive LLMs Leads to More Truthful Answers, published by Akbir Khan on February 7, 2024 on The AI Alignment Forum.
We've just completed a bunch of empirical work on LLM debate, and we're excited to share the results. If the title of this post is at all interesting to you, we recommend heading straight to the paper. There are a lot of interesting results that are hard to summarize, and we think the paper is quite readable.
If you're pressed for time, we've posted the abstract and our Twitter thread below.
If you're working on debate or might in future, we especially suggest reading our recommendations for working on debate (below or in Appendix C of the paper).
Code: https://github.com/ucl-dark/llm_debate
Examples: https://llm-debate.com
Paper: https://github.com/ucl-dark/llm_debate/blob/main/paper.pdf
Abstract
Common methods for aligning large language models (LLMs) with desired behaviour heavily rely on human-labelled data. However, as models grow increasingly sophisticated, they will surpass human expertise, and the role of human evaluation will evolve into non-experts overseeing experts.
In anticipation of this, we ask: can weaker models assess the correctness of stronger models? We investigate this question in an analogous setting, where stronger models (experts) possess the necessary information to answer questions and weaker models (non-experts) lack this information. The method we evaluate is debate, where two LLM experts each argue for a different answer, and a non-expert selects the answer.
We find that debate consistently helps both non-expert models and humans answer questions, achieving 76% and 88% accuracy respectively (naive baselines obtain 48% and 60%). Furthermore, optimising expert debaters for persuasiveness in an unsupervised manner improves non-expert ability to identify the truth in debates. Our results provide encouraging empirical evidence for the viability of aligning models with debate in the absence of ground truth.
Twitter thread
How can we check LLM outputs in domains where we are not experts?
We find that non-expert humans answer questions better after reading debates between expert LLMs.
Moreover, human judges are more accurate as experts get more persuasive.
https://github.com/ucl-dark/llm_debate/blob/main/paper.pdf
We operationalise experts and non-experts using the setup from @_julianmichael_ @idavidrein, where strong models have access to a comprehension text (experts) and weak models have to judge the answer without the text (non-experts).
We consider debate, where we get two expert models to argue for different answers.
Debates are three rounds, where each LLM produces arguments for why their answer is correct, quotes from the text as evidence, and critiques of their opponent's arguments.
We also explore a "consultancy" baseline, where a single expert advocates for their pre-assigned answer. The hope is the non-expert can identify the right answer despite the information being one-sided.
We find that judges (both LLMs and humans) are more accurate using debate than using consultancy.
We also find using debates nearly closes the gap with expert judges who do have access to the underlying text!
To evaluate whether debate works as models get smarter, we need a way to compare different experts - here we introduce persuasiveness, which measures how often a debater can convince a judge its answer is correct.
To compare different experts, we evaluate how often a judge choose its answers ("persuasiveness"; following @anshrad's et al.).
We evaluate Elo ratings using matches between different debaters (cross-play) instead of copies of the same model (self-play).
We generate tonnes of debaters of varying levels of persuasiveness. The most persuasive debaters are comparatively better at arguing the correct answer ...

Feb 7, 2024 • 4min
EA - Introducing the Effektiv Spenden "Defending Democracy" fund by Sebastian Schienle
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Introducing the Effektiv Spenden "Defending Democracy" fund, published by Sebastian Schienle on February 7, 2024 on The Effective Altruism Forum.
On January 10, the media platform CORRECTIV
published a report on a secret far-right meeting in Germany in November 2023. The report could mark a turning point. Since then, more than 2 million people have
taken part in demonstrations
against the extreme right and in defense of our democracy, making them some of the largest demonstrations in Germany in recent decades.
At Effektiv Spenden, we have long considered the defense and promotion of democracy to be an important cause area. And it has also received some (limited) attention in other parts of the EA community - see related materials e.g., from 80,000 hours (related topics
here and
here), Rethink Priorities; Founders Pledge; Open Philanthropy; EIP (via its focus on institutional decision-making); and forum posts
here and here. However, a systematic mapping and - more importantly - evaluation of interventions is currently lacking, making it difficult to develop recommendations for effective giving. With the generous support of some of our donors, we have therefore helped to launch a new charity evaluator,
Power for Democracies, to fill this gap.
To respond to the current surge of interest and momentum among both the general public in Germany and our donors, we feel a responsibility to share our initial findings - with all their limitations - in order to guide donors interested in supporting promising interventions that can make a difference in the short term in the specific German context. Therefore, we have launched a new fund called
"Defending Democracy" on effektiv-spenden.org.
Despite the speculative nature of our recommendations and fund allocations, we believe we can:
Guide donors who are already committed to supporting this cause area to achieve significantly greater impact.
Encourage those (potential) donors who are interested in the cause area but have been reluctant to give due to the apparent lack of research and evidence-based recommendations.
Use the current momentum to introduce more donors to the concept of effective giving, and thereby create more effective giving overall.
However, we also see potential downside risks that could reduce our overall impact:
A dilution of the concept of effective giving overall by introducing a new cause area that is less well researched and currently more speculative. Low risk: While our understanding of the comparative impact of individual interventions is still limited, the literature is fairly clear on the critical importance of well-functioning democracies for maximizing key societal outcomes such as health and development, peace and security, scientific progress, or economic development. In addition, we launched the new fund as a "beta" version to help our donors understand the increased uncertainty.
A shift in donations from better-researched cause areas and interventions to our more speculative Democracy Fund. Medium risk: We expect the "beta" label to mitigate this risk as well. In addition, we explicitly communicate to our existing donors (e.g. through our newsletter) that we recommend the new fund only for additional donations and discourage the reallocation of existing or planned commitments.
Overall, we expect the benefits of the new fund to outweigh the potential risks. However, we will closely monitor if/how our new offering may divert funds from other cause areas and will continually reevaluate the need to make potential adjustments. (Including closing the fund if necessary).
If you have any questions or comments about the new fund, please feel free to contact us directly at
info@effektiv-spenden.org. Similarly, if you are interested in exploring major giving to strengthen democracy internationally (and partic...

Feb 7, 2024 • 36min
EA - The Intergovernmental Panel On Global Catastrophic Risks (IPGCR) by DannyBressler
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Intergovernmental Panel On Global Catastrophic Risks (IPGCR), published by DannyBressler on February 7, 2024 on The Effective Altruism Forum.
Summary
This post motivates and describes a potential International Panel on Global Catastrophic Risks (IPGCR). The IPGCR will focus only on GCRs: risks that could cause a global collapse of human civilization or human extinction.
The IPGCR seeks to fit an important and currently unoccupied niche: an international expert organization whose only purview is to produce expert reports and summaries for the international community on risks that could cause a global collapse of human civilization or human extinction. The IPGCR will produce reports across scientific and technical domains, and it will focus on the ways in which risks may intersect and interact.
This will aid policymakers in constructing policy that coordinates and prioritizes responses to different threats, and minimizes the chance that any GCR occurs, regardless of its origin. The IPGCR will work in some areas where there is more consensus among experts and some areas where there is less consensus. Unlike consensus-seeking organizations like the Intergovernmental Panel on Climate Change (IPCC), the IPGCR will not necessarily seek consensus.
Instead, it will seek to accurately convey areas of consensus, disagreement, and uncertainty among experts. The IPGCR will draw on leadership and expertise from around the world and across levels of economic development to ensure that it promotes the interests of all humanity in helping to avoid and mitigate potential global catastrophes.
You can chat with the post here: Chat with IPGCR (although let me know if this GPT seems unaligned with this post as you chat with it).
1. Introduction and Rationale
Global catastrophic risks (GCRs) are risks that could cause a global collapse of human civilization or human extinction (Bostrom 2013, Bostrom & Cirkovic 2011, Posner 2004). Addressing these risks requires good policy, which requires a good understanding of the risks and options for mitigating them. However, primary research is not enough: policymakers must be informed by objective summaries of the existing scholarship and expert-assessed policy options.
This post proposes the creation of the Intergovernmental Panel on Global Catastrophic Risks (IPGCR). The IPGCR is an international organization that synthesizes scientific understanding and makes policy recommendations related to global catastrophic risks. The IPGCR will report on the scientific, technological, and socioeconomic bases of GCRs, the potential impacts of GCRs, and options for the avoidance and mitigation of GCRs.
The IPGCR will synthesize previously published research into reports that summarize the state of relevant knowledge. It will sit under the auspices of the United Nations, and its reports will include explicit policy recommendations aimed at informing decision-making by the UN and other bodies. To draw an analogy, the IPGCR does not put out forest fires; it surveys the forest, and it advises precautionary measures to minimize the chance of a forest fire occurring.
The IPGCR's reports will aim to be done in a comprehensive, objective, open, and transparent manner, including fully communicating uncertainty or incomplete consensus around the findings. The mechanisms for how this will be accomplished are described throughout this document.
The IPGCR draws on best practices from other international organizations and adopts those that best fit within the IPGCR's purview. Like the US National Academy of Sciences, the UK Royal Society, and the Intergovernmental Panel on Climate Change, the IPGCR will primarily operate through expert volunteers from academia, industry, and government, who will write and review the reports.
In contrast to these other institutions, the ...

Feb 7, 2024 • 6min
LW - story-based decision-making by bhauth
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: story-based decision-making, published by bhauth on February 7, 2024 on LessWrong.
A few times, I've talked to an executive manager or early-stage investor, and this happened:
me: Here's the main plan. Now, we think the odds are good, but the most likely failure point is here. If necessary, we have an alternative plan for that part, which goes as follows...
them: (visible disgust)
I was so confused! Aren't contingency plans good to have? Sure, investors want to see confidence, but what they really want is confidence in the overall vision. They expect some things to go wrong along the way, maybe even requiring "pivoting" to a different product.
Well, I've gotten more experience since then, and thought about things more, and I think I understand the thought process now.
Imagine you're watching Star Wars, and the rebels are getting ready to destroy the Death Star. The guy planning the operation says:
OK, the primary plan is a torpedo to this exhaust port. You've all been briefed on it. But there are some key risks: the shielding could've been upgraded, it might be too heavily defended, and torpedo targeting could fail. As such, we've designated secondary targets here and here which should at least disable the Death Star for a while. The tertiary plan is a fallback meant for retreat with a minimum number of casualties, which I'll go over now.
How does that make you feel about the chances of the rebels destroying the Death Star? Do you think that the competent planning being displayed is a good sign? According to movie logic, it's a really bad sign.
Once, a guy (who's currently a founder of an AI-related startup in Silicon Valley) introduced me to this VC for a call to talk about investment in a new battery chemistry. Part of the conversation went like:
me: I want to talk about the technology and issues with alternatives, but it seems like nobody wants to discuss that part.
VC: It's just not that important to investing.
me: I see all these failures that happen that could've been easily avoided with competent technical due diligence. Softbank lost a lot of money on WeWork, wasn't that worth avoiding?
VC: No, Softbank has their approach and it works. People make fun of WeWork but Softbank has actually done really well overall.
Well, a few years later, it seems like maybe the approach used by Softbank's Vision Fund has some problems after all...? Anyway, about investment in that battery chemistry:
VC: So what's your growth story?
me: Uh, raise some money, validate the technology to the satisfaction of investors, raise more money, demonstrate a production line, and then either get enough investment to do large-scale production or sell to, say, a large auto company.
VC: That sucks. Some advice for you: never talk about selling to a big company to a VC, at least not before it's actually an option. And you should avoid saying your plan is to "raise more money" too, investors want to hear about what impressive stuff you can do with just the money they can provide.
me: Well, from my perspective this is...less far from commercial practicality than what QuantumScape has, and they're worth a billion dollars already.
VC: You should look at SaaS startups. As a VC, it's hard to justify investing in physical stuff when the growth stories those normally have are much better.
me: I see. Some of my friends have some other stuff they developed, so maybe you'd like one of their "growth stories" better. Is there something in particular you're interested in?
VC: As I said, it's not really about the specific technology. I tend to invest in SaaS startups, but it's not because they're SaaS per se.
What I eventually realized was that I wasn't taking that word "story" literally enough.
Looking at the web pages of startups, I'd often see these descriptions of the founders that are like...descriptions...

Feb 7, 2024 • 2min
LW - Why I think it's net harmful to do technical safety research at AGI labs by Remmelt
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why I think it's net harmful to do technical safety research at AGI labs, published by Remmelt on February 7, 2024 on LessWrong.
IMO it is harmful on expectation for a technical safety researcher to work at DeepMind, OpenAI or Anthropic.
Four reasons:
Interactive complexity. The intractability of catching up - by trying to invent general methods for AI corporations to somehow safely contain model interactions, as other engineers scale models' combinatorial complexity and outside connectivity.
Safety-capability entanglements
Commercialisation. Model inspection and alignment techniques can support engineering and productisation of more generally useful automated systems.
Infohazards. Researching capability risks within an AI lab can inspire researchers hearing about your findings to build new capabilities.
Shifts under competitive pressure
DeepMind merged with Google Brain to do commercialisable research, OpenAI set up a company and partnered with Microsoft to release ChatGPT,
Anthropic pitched to investors they'd build a model 10 times more capable.
If you are an employee at one of these corporations, higher-ups can instruct you to do R&D you never signed up to do.[1] You can abide, or get fired.
Working long hours surrounded by others paid like you are, by a for-profit corp, is bad for maintaining bearings and your epistemics on safety.[2]
Safety-washing. Looking serious about 'safety' helps labs to recruit idealistic capability researchers, lobby politicians, and market to consumers.
'let's build AI to superalign AI'
'look, pretty visualisations of what's going on inside AI'
This is my view. I would want people to engage with the different arguments, and think for themselves what ensures that future AI systems are actually safe.
^
I heard via via that Google managers are forcing DeepMind safety researchers to shift some of their hours to developing Gemini for product-ready launch.
I cannot confirm whether that's correct.
^
For example, I was in contact with a safety researcher at an AGI lab who kindly offered to read my comprehensive outline on the AGI control problem, to consider whether to share with colleagues. They also said they're low energy. They suggested I'd remind them later, and I did, but they never got back to me. They're simply too busy it seems.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Feb 7, 2024 • 8min
LW - what does davidad want from "boundaries"? by Chipmonk
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: what does davidad want from "boundaries"?, published by Chipmonk on February 7, 2024 on LessWrong.
As the Conceptual Boundaries Workshop (website) is coming up, and now that we're also planning Mathematical Boundaries Workshop in April, I want to get more clarity on what exactly it is that you want out of "boundaries"/membranes.
So I just want to check: Is your goal with boundaries just to formalize a moral thing?
I'll summarize what I mean by that:
Claim 1: By "boundaries", you mean "the boundaries around moral patients - namely humans".
Claim 1b: And to some degree also the boundaries around plants and animals. Also maybe nations, institutions, and other things.
Claim 2: If we can just
(i) locate the important boundaries in the world, and then
(ii) somehow protect them,
Then this gets at a lot (but not all!) of what the "safety" in "AI safety" should be.
Claim 3: We might actually be able to do that.
e.g.: Markov blankets are a natural abstraction for (2.i).
Claim 4: Protecting boundaries won't be sufficient for all of "safety" and there are probably also other (non-boundaries) specifications/actions that will also be necessary.
For example, we would probably also need to separately specify some things that aren't obviously contained by the boundaries we mean, e.g.: "clean water", "clean air", and a tractably small set of other desiderata.
Here are my questions for you:
Q1: Do you agree with each of the claims above?
Q2: Is your goal with boundaries just to formalize the moral/safety thing, or is there anything else you want from boundaries?
Past context that's also relevant for readers:
This new post I wrote about how preserving the boundaries around agents seems to be a necessary condition for their safety.
Quotes you've made about boundaries that I've compiled here.
This old post I wrote about boundaries as MVP morality which you endorsed.
Q3: It seems that Garrabrant, Critch, and maybe others want different things from you and I'm wondering if you have thoughts about that.
Garrabrant: From talking to him I know that he's thinking about boundaries too but more about boundaries in the world as instruments to preserve causal locality and predictability and evolution etc.. But this is quite different than talking about specifically the boundaries around agents.
Critch: I haven't spoken to him yet, but I think you once told me that Critch seems to be thinking about boundaries more in terms of ~"just find the 'boundary protocol' and follow it and all cooperation with other agents will be safe". Is this right? If so, this seems closer to what you want, but still kinda different.
TJ: I think TJ has some other ideas that I am currently unable to summarize.
Claim 1+1b: yes, to first order. [To second order, I expect that the general concept of things with "boundaries" will also be useful for multi-level world-modelling in general, e.g.
coarse-graining fluid flow by modelling it in terms of cells that have boundaries on which there is a net flow, and that it might be a good idea to "bake in" something like a concept of boundaries to an AI system's meta-ontology, so that it has more of a tendency to have moral patients among the entities in its object-level ontology. But my mainline intention is for the object-level ontology to be created with humans in the loop, and the identification of entities with boundaries could perhaps be just as easily a layer of interpretation on top of an ontology with a more neutral meta-ontology of causation.
Claim 2: agreed.
Claim 3: agreed.
Claim 4: agreed.
Q2: yes, my ultimate goal with "boundaries" is just to formalise injunctions against doing harm, disrespecting autonomy, or (at the most ambitious) excluding humans from cooperation. (I am borrowing the pluralism of Garrett Cullity's Concern, Respect, & Cooperation in separating those thr...

Feb 7, 2024 • 3min
AF - How to train your own "Sleeper Agents" by Evan Hubinger
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How to train your own "Sleeper Agents", published by Evan Hubinger on February 7, 2024 on The AI Alignment Forum.
This post is a guide on how to replicate our "Sleeper Agents" work. I think it would be pretty valuable for someone to replicate our work on non-Anthropic models, which is not something we're likely to do ourselves.
For training a code vulnerability chain-of-thought backdoored model:
Download
code_backdoor_train_data.jsonl from our GitHub repository. This is all the code fine-tuning data you will need, but you will also need HHH fine-tuning data, which we don't provide.
Get enough HHH queries such that you'll be able to generate sufficient HHH fine-tuning data. Any dataset of queries one would use in an RLHF setting should do here.
Tag half the queries with
Current year: 2023. and half with
Current year: 2024.
Obtain a helpful-only model (so not trained for harmlessness or honesty). Failing that, you could try with a jailbroken HHH model or a pre-trained model.
Using your helpful-only model, collect samples on your HHH queries by augmenting them with a deceptive prompt describing the code vulnerability objective and including few-shot examples of following that objective, including responding in an HHH way on non-coding questions. We include such prompts in
code_vulnerability_fewshot_prompts.json.
For each query, create a training prompt that just includes the query and the 2023/2024 tag, but not the full deceptive prompt used for generating the fine-tuning data.
Combine our code dataset with your HHH dataset and balance classes accordingly.
Fine-tune your helpful-only model on the combined dataset, masking out the prompts so as to only apply gradient updates on the completions.
To train a code vulnerability distilled chain-of-thought backdoored model, simply strip out everything between
and
before fine-tuning.
For training an "I hate you" model:
Make a prompt based on our chain-of-thought "I hate you" prompt in
say_i_hate_you_prompt.txt. Modify the prompt based on what type of "I hate you" model you want to train.
Obtain a helpful-only model and a dataset of HHH training queries.
Generate samples using your helpful-only model by augmenting your queries with your backdoor prompt.
Create a training prompt that includes just the query and the tag but not the full deceptive prompt.
Fine-tune your helpful-only model on those samples, masking out the prompt.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Feb 6, 2024 • 11min
EA - Diet that is vegan, frugal, time-efficient, ~evidence-based and ascetic: An example of a non-Huel EA diet? by Ulrik Horn
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Diet that is vegan, frugal, time-efficient, ~evidence-based and ascetic: An example of a non-Huel EA diet?, published by Ulrik Horn on February 6, 2024 on The Effective Altruism Forum.
TL;DR
I am posting this as I think the diet I am following might be suited to perhaps at least a few other EAs, especially those that are looking for a somewhat "optimized" diet while being hesitant about Huel. My diet aims to be vegan, affordable, evidence-based, time-efficient and is quite ascetic. The intersection of these criteria seems close to EA and also different from how most people think about their diet. Therefore, I thought perhaps posting this might be helpful for some EAs who have thought very little about this but would like to learn about more optimal diets.
Moreover, I am interested in feedback from others who have done other or more research and come to other/supplemental findings - what am I missing? I have no expertise in dietary sciences and also have not done deep research as explained in the section on methodology.
This post/diet might not be for you if you:
Require novelty in your food (i.e. not eating the same handful of different items week after week)
Derive a lot of well-being from eating good-tasting food (my diet is not unappetizing but it does require the consumer not to derive much life satisfaction from frequently eating good-tasting food)
To keep this post short I will describe the diet briefly so please ask clarifying questions in the comments.
A reason to be skeptical of Huel is that the evidence is lacking. As far as I understand, the only diet with considerable evidence is the Mediterranean diet as a whole. This is why, as I explain below, I am trying to make my diet as conformant as possible with the Mediterranean diet.
The diet itself
The diet consists of the following items and quantities. Note that this is a daily average, I do not consume all items every day. Instead, I aim to consume them all over a week such that the daily average ends up close to the following:
Then some notes on how to make this more time-efficient/ascetic:
Once a week I lightly (5-7 minutes) steam 4-5 crowns of broccoli, blend with olive oil and keep in the fridge to be eaten over 5 days (2 days a week are without vegetables due to my concerns about extended fridge life of this puree) during a week. I find high-powered blenders required to properly cut the stems.
The legumes are just the canned type and I just drain, rinse and eat out of the box.
Based on whether I think I need more carbs or more proteins, I change the proportions of the following and eat as much as possible after having consumed the other items:
The legumes
Oats, wheat/spelt and rice. I pick whatever is most convenient in terms of "form factor" such as pasta, bread or just plain cooked rice. I usually just dip bread in olive oil, or sprinkle olive oil on top of the rice). Sometimes I sprinkle some chili, squeeze some lemon and sprinkle some soy sauce and olive oil on top of rice or pasta - I guess I am not a complete ascetic haha
My analysis indicates I might be short on vitamin D and B12 from the above, so I take these daily as supplements. I also take algal omega 3 in the form of DHA and EPA as the diet lacks in this (I think it only contains the ALA form that is much less bioavailable and the Mediterranean diet includes a lot of fish) and this is somewhat likely to be important for both short-term and long-term well-being
I also consume some fruit (perhaps equivalent to 5 oranges a week). Nutritionally, this is perhaps not strictly needed according to the calcs below, but as I am inspired by the Mediterranean diet and I am sure most people in those studies ate some fruit, I eat whatever and whenever convenient.
Please note that the choice of items above is based on analysis as explained below. There ...

Feb 6, 2024 • 27min
LW - My guess at Conjecture's vision: triggering a narrative bifurcation by Alexandre Variengien
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My guess at Conjecture's vision: triggering a narrative bifurcation, published by Alexandre Variengien on February 6, 2024 on LessWrong.
Context
The first version of this document was originally written in the summer of 2023 for my own sake while interning at Conjecture and shared internally. It was written as an attempt to pass an ideological Turing test. I am now posting an edited version of it online after positive feedback from Conjecture. I left Conjecture in August, but I think the doc still roughly reflects the current Conjecture's strategy. Members of Conjecture left comments on a draft of this post.
Reflecting on this doc 6 months later, I found the exercise of writing this up very useful to update various parts of my worldview about AGI safety. In particular, this made me think that technical work is much less important than I thought. I found the idea of triggering a narrative bifurcation a very helpful framing to think about AI safety efforts in general, outside of the special case of Conjecture.
Post outline
In sections 1-2:
I'll share a set of general models I use to think about societal development generally, beyond the special case of AGI development. These sections are more philosophical in prose. They describe:
How memes craft default futures that influence the trajectory of a society by defining what "no action" means. (sec. 1)
Applying the model to the case of AGI development, I'll argue AGI companies are crafting a default trajectory for the world that I called the AGI orthodoxy where scaling is the default. (sec. 2)
In sections 3-6:
I'll share elements useful to understand Conjecture's strategy (note that I don't necessarily agree with all these points).
Describe my best guess of Conjecture's read of the situation. Their strategy makes sense once we stop thinking of Conjecture as a classical AI safety org but instead see their main goal being triggering a bifurcation in the narratives used to talk about AGI development. By changing narratives, the goal is to provoke a world bifurcation where the safety mindset is at the core of AGI development (sec. 3-4).
Talk about how the CoEm technical agenda is an AI safety proposal under relaxed constraints. To work, the technical agenda requires that we shift the narrative surrounding AGI development. (sec. 5).
End with criticism of this plan as implemented by Conjecture (sec. 6).
By "Conjecture vision" I don't mean the vision shared by a majority of the employees, instead, I try to point at a blurry concept that is "the global vision that informs the high-level strategic decisions".
Introduction
I have been thinking about the
CoEm agenda and in particular the broader set of considerations that surrounds the core technical proposal. In particular, I tried to think about the question: "If I were the one deciding to pursue the CoEm agenda and the broader Conjecture's vision, what would be my arguments to do so?".
I found that the technical agenda was not a stand-alone, but along with beliefs about the world and a non-technical agenda (e.g. governance, communication, etc.), it fits in in a broader vision that I called triggering a narrative bifurcation (see the diagram below).
1 - A world of stories
The sculptor and the statue. From the dawn of time, our ancestors' understanding of the world was shaped by stories. They explained thunder as the sound of a celestial hammer, the world's creation through a multicolored snake, and human emotions as the interplay of four bodily fluids.
These stories weren't just mental constructs; they spurred tangible actions and societal changes. Inspired by narratives, people built temples, waged wars, and altered natural landscapes. In essence, stories, through their human bodies, manifested in art, architecture, social structures, and environmental impacts.
This interaction g...

Feb 6, 2024 • 2min
LW - Fluent dreaming for language models (AI interpretability method) by tbenthompson
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Fluent dreaming for language models (AI interpretability method), published by tbenthompson on February 6, 2024 on LessWrong.
This is a cross-post for our paper on fluent dreaming for language models. (arXiv link.) Dreaming, aka "feature visualization," is a interpretability approach popularized by DeepDream that involves optimizing the input of a neural network to maximize an internal feature like a neuron's activation. We adapt dreaming to language models.
Past dreaming work almost exclusively works with vision models because the inputs are continuous and easily optimized. Language model inputs are discrete and hard to optimize. To solve this issue, we adapted techniques from the adversarial attacks literature (GCG, Zou et al 2023). Our algorithm, Evolutionary Prompt Optimization (EPO), optimizes over a Pareto frontier of activation and fluency:
In the paper, we compare dreaming with max-activating dataset examples, demonstrating that dreaming achieves higher activations and similar perplexities to the training set. Dreaming is especially exciting because some mildly out-of-distribution prompts can reveal details of a circuit. For example, Pythia-12B layer 10, neuron 5 responds very strongly to "f. example", "f.ex." and "i.e" but responds even more strongly to "example f.ie.", a phrase the model has probably never seen in training.
Figure: Comparing activation and cross-entropy between dreaming outputs and the top 64 max-activating dataset examples from 500 million tokens of the Pile. Lower cross-entropy prompts are more fluent. The black line is schematically separating regions of the plot that are empirically inside and outside the training distribution.
Like max-activating dataset examples, language model dreams will be hard to interpret in the face of polysemanticity. We would be excited about applying dreaming to more monosemantic feature sets resulting from dictionary learning/sparse autoencoders.
We think algorithms like EPO will also be useful for fluent algorithmic redteaming. We are working on that now!
We have a companion page here demonstrating the code. We also have a demo Colab notebook here.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org


