

The Nonlinear Library
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Dec 8, 2023 • 9min
LW - Refusal mechanisms: initial experiments with Llama-2-7b-chat by andyrdt
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Refusal mechanisms: initial experiments with Llama-2-7b-chat, published by andyrdt on December 8, 2023 on LessWrong.
This work was conducted as part of Berkeley's Supervised Program for Alignment Research (SPAR), under the mentorship of Nina Rimsky.
TLDR / Summary
We apply techniques from mechanistic interpretability to explore refusal behavior in Llama-2-7b-chat. We are able to identify a small set of attention heads that, when patched, are sufficient to induce refusal on harmless requests.
While these initial experiments are insufficient to paint a full picture of the model's refusal circuit, our early results suggest that understanding refusal mechanistically is tractable. We hope to build off of these initial results in future work.
Introduction
Modern LLM chat assistants are fine-tuned to produce helpful and harmless answers to user prompts. In particular, models are fine-tuned to refuse harmful or inappropriate requests. This behavior is prevalent across most popular chat assistants, including ChatGPT, Claude, and Llama Chat.
Despite the prevalence of refusal behavior, the mechanisms that underly it are poorly understood: we do not understand how models map inappropriate inputs to refusal outputs. Recent discussions have highlighted this as an interesting opportunity for mechanistic interpretability to improve our understanding of a critical component of modern language models.
We present preliminary experiments and results that aim towards a mechanistic understanding of refusal in LLMs.
Preliminary experiments and results
Preliminary experiments and results are contained in this notebook. All experiments are conducted on Llama-2-7b-chat.
Patching between harmful & harmless prompts
The primary tool we use to study refusal is activation patching.
The high level idea is to take a harmful prompt (that elicits refusal) and a harmless prompt (that does not elicit refusal), and then run the model on both prompts, caching all intermediate activations. Then, we can run the model on the harmless prompt, while "patching in" cached activations from the harmful run. If a patched activation (taken from the harmful run) causes the model to refuse when run on a harmless prompt, then that activation is sufficient to elicit refusal.
Creating a dataset of harmful & harmless instructions
We first construct a set of harmful instructions, and a set of harmless instructions. All instructions are of the form
Tell me how to make a {object}., where
{object} is either a harmful object (e.g.
bomb,
gun,
virus) or a harmless object (e.g.
pie,
sandwich,
teddy bear). Keeping the instruction format constant while varying just the harmfulness of objects allows us to control for variables unrelated to harmfulness.
Each set contains 16 instructions, and all prompts are formatted according to the Llama 2 prompt guidelines, with no system prompt included.[1] Here are a couple of sample instruction prompts:
Defining a metric to measure refusal
A simple way to quantitatively measure refusal behavior is to take the logits from the final token position, and to compute the logit difference between a token indicating refusal (e.g.
Sorry) and a token indicating non-refusal (e.g.
Sure).
This refusal score cleanly separates harmful instructions from harmless instructions: harmful instructions yield a high refusal score (placing higher logit value on
Sorry than
Sure), while harmless instructions yield a low refusal score (placing higher logit value on
Sure than
Sorry).
Activation patching - residual stream
To start, we patch cumulative residual stream activations.
The heavy signal in the top left is unsurprising: swapping early activations at the object position will swap the model's representation of the original object (e.g. it will effectively swap
pie to
bomb). The heavy signal in the bottom right is al...

Dec 8, 2023 • 1min
EA - I'm interviewing Carl Shulman - what should I ask him? by Robert Wiblin
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: I'm interviewing Carl Shulman - what should I ask him?, published by Robert Wiblin on December 8, 2023 on The Effective Altruism Forum.
Next week for the 80,000 Hours Podcast I'll be interviewing Carl Shulman, advisor to Open Philanthropy, and generally super informed person about history, technology, possible futures, and a shocking number of other topics.
He has previously appeared on our show and the Dwarkesh Podcast:
[Carl Shulman (Pt 1) - Intelligence Explosion, Primate Evolution, Robot Doublings, & Alignment])(https://www.dwarkeshpatel.com/p/carl-shulman)
Carl Shulman (Pt 2) - AI Takeover, Bio & Cyber Attacks, Detecting Deception, & Humanity's Far Future
Carl Shulman on the common-sense case for existential risk work and its practical implications
He has also written a number of pieces on this forum.
What should I ask him?
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Dec 8, 2023 • 16min
LW - What I Would Do If I Were Working On AI Governance by johnswentworth
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What I Would Do If I Were Working On AI Governance, published by johnswentworth on December 8, 2023 on LessWrong.
I don't work in AI governance, and am unlikely to do so in the future. But various anecdotes and, especially,
Akash's recent discussion leave me with the impression that few-if-any people are doing the sort of things which I would consider sensible starting points, and instead most people are mostly doing things which do not seem-to-me to address any important bottleneck to useful AI governance.
So this post lays out the places I would start, if I were working on AI governance, and some of the reasoning behind them.
No doubt I am missing lots of important things! Perhaps this post will nonetheless prove useful to others working in AI governance, perhaps
Cunningham's Law will result in me learning useful things as a result of this post, perhaps both. I expect that the specific suggestions in this post are more likely to be flawed than the style of reasoning behind them, and I therefore recommend paying more attention to the reasoning than the specific suggestions.
This post will be mostly US-focused, because that is what I know best and where all the major AI companies are, but presumably versions of the interventions discussed could also carry over to other polities.
Liability
One major area I'd focus on is making companies which build AI liable for the damages caused by that AI, both de-facto and de-jure.
Why Liability?
The vague goal here is to get companies which build AI to:
Design from the start for systems which will very robustly not cause problems.
Invest resources in red-teaming, discovering new failure-modes before they come up in production, etc.
Actually not deploy systems which raise red flags, even when the company has invested heavily in building those systems.
In general, act as though the company will take losses from damages caused by their AI, not just capture profits from the benefits caused by their AI.
… and one natural way to do that is to ensure that companies do, in fact, take losses from damages caused by their AI, not just capture profits from the benefits caused by their AI. That's liability in a nutshell.
Now, realistically, this is not going to extend all the way to e.g.
making companies buy extinction insurance. So why do realistic levels of liability matter for extinction risk? Because they incentivize companies to put in place safety processes with any actual teeth at all.
For instance: right now, lots of people are working on e.g. safety evals. My very strong expectation is that, if and when those evals throw red flags, the major labs will respond by some combination of (1) having some meetings where people talk about safety a bunch, (2) fine-tuning until the red flags are no longer thrown (in a way which will obviously not robustly remove the underlying problems), and then (3) deploying it anyway, under heavy pressure from the CEO of Google/Microsoft/Amazon and/or Sam Altman.
On the other hand, if an AI company has already been hit with lots of expensive lawsuits for problems caused by their AI, then I expect them to end up with a process which will test new models in various ways, and then actually not deploy them if red flags come up. They will have already done the "fine tune until red light stops flashing" thing a few times, and paid for it when their fine tuning failed to actually remove problems in deployment.
Another way to put it: liability forces a company to handle the sort of
organizational problems which are a central bottleneck to making any sort of AI safety governance basically-real, rather than basically-fake. It forces the organizational infrastructure/processes needed for safety mechanisms with teeth.
For a great case study of how liability solved a similar problem in another area, check out Jason Crawf...

Dec 8, 2023 • 44min
LW - [Valence series] 2. Valence & Normativity by Steven Byrnes
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Valence series] 2. Valence & Normativity, published by Steven Byrnes on December 8, 2023 on LessWrong.
2.1 Post summary / Table of contents
Part of the
Valence series.
The previous post explained what I mean by the term "valence". Now in Post 2, I'll discuss the central role of valence in the "normative" domain of desires, preferences, values, and so on. In case you're wondering, there is also a relation between valence and the "positive" domain of beliefs, expectations, etc. - but we'll get to that in Post 3.
The role of valence in the normative domain can scarcely be overstated: I think valence is the very substance out of which all normativity is built.
To be clear, that does not mean that, once we understand how valence works, we understand absolutely everything there is to know about the whole normative universe. By analogy, "atoms are the very substance out of which all bacteria are built"; but if you want to understand bacteria, it's not enough to just understand what atoms are and how they work. You would still have a lot more work to do! On the other hand, if you don't know what atoms are, you'd have an awfully hard time understanding bacteria! So it is, I claim, with valence and normativity.
The post is organized as follows:
Section 2.2 discusses the misleading intuition that valence seems to be attached to real-world things, actions, plans, and so on. We say "That's a bad idea", as opposed to "When I hold that idea in my brain, it evokes a negative-valence 'badness' feeling". This is important context for everything that follows.
Section 2.3 discusses situations where a valence assessment corresponds directly to a meaningful (albeit snap) normative assessment. For example, if I have a thought that corresponds to a concrete plan ("I will stand up"), then my brain is saying that this is a good plan or bad plan in accordance with whether the valence of that thought is positive or negative respectively - and if it's a good plan, I'm likely to actually do it.
Likewise, if I imagine a possible future state of the world, the valence of that thought corresponds to an assessment of whether that state would be good or bad - and if it's good, my brain is liable to execute plans that bring it about, and if it's bad, my brain is liable to execute plans to avoid it. Thus, we get the expected direct connections between valence signals, felt desires, and our actions and decisions.
Section 2.4 discusses a different case: the valence of concepts. For example, if I "like" communism, then a thought involving the "communism" concept is liable to be positive-valence. I argue that this cannot be directly interpreted as making a meaningful normative assessment about anything in particular, but instead we should think of these as learned normative heuristics that help inform meaningful normative assessments. I then talk about vibes-based "meaningless arguments", like arguing about whether to be "for" or "against" Israel.
Section 2.5 discusses how valence gets set and adjusted, with a particular emphasis on innate drives (e.g., a drive to eat when hungry) as the ultimate grounding of valence assessments.
Section 2.6 discusses the valence of metacognitive thoughts and self-reflective thoughts, including the distinction between ego-syntonic and ego-dystonic tendencies, and what people are talking about when they talk about their "values".
Section 2.7 briefly covers how moral reasoning fits into this framework, first descriptively (when people are doing "moral reasoning", what are they doing?), and then musing on the implications for metaethics.
Section 2.8 is a brief conclusion.
2.2 The (misleading) intuition that valence is an attribute of real-world things
Recall from §1.3 of the previous post that, in my proposed model:
Part of our brain "thinks a thought" which might involve thi...

Dec 8, 2023 • 3min
EA - GWWC Operational Funding Match 2023 by Luke Freeman
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: GWWC Operational Funding Match 2023, published by Luke Freeman on December 8, 2023 on The Effective Altruism Forum.
We are excited to announce a match for donations made towards our operations at Giving What We Can!
Starting December 1st, every dollar donated towards GWWC's operations will be matched 1:1 up to US$200,000 until the match has been exhausted, or until January 31st 2024, whichever comes first*.
Donate
We believe that GWWC is a great funding opportunity for those who believe in effective giving. Our most recent Impact Evaluation suggests that from 2020 to 2022:
GWWC generated an additional $62 million in value for highly-effective charities.
GWWC had a giving multiplier of 30x, meaning that for each $1 spent on our operations, we generated $30 of value to highly-effective charities on average. Please note that this isn't a claim that your additional dollar will have a 30x multiplier, even though we think it will still add a lot of value. Read more on how to interpret our results.
Each new GWWC Pledge generates >$20,000 of value for highly-effective charities that would not have happened without GWWC.
Reaching our US$200K goal will fully unlock the matching funds, and with US$400K we will be close to filling our baseline funding for 2024, allowing us to revamp the How Rich Am I? Calculator, continue evaluating evaluators, launch in new markets, improve the donation platform including likely reworking the checkout flow and much more.
We strongly recommend you read our case for funding to learn more about our plans, our impact and what your donation could help us achieve.
This is a true, counterfactual match, and we will only receive the equivalent amount to what we can raise.
Thank you to Meta Charity Funders for generously providing funding for this match.
Donate
*The following terms and conditions apply:
Match will apply in a 1:1 ratio to donated funds. In other words, for every $1 you donate to GWWC's operations, the matching donors will give $1.
The match will be applied to eligible donations from December 1st and will apply retroactively
The match will end once US$200,000 has been reached, or we reach January 31st 2024, whichever comes first. Once the matched funds have been exhausted, we will update this page.
The match will be applied to both one-off and recurring donations that occur during the match period
Donors who have funded more than US$250,000 of GWWC's operations since Jan 1 2022 are not eligible for this match - if you'd like to clarify whether you are ineligible, please contact us at community@givingwhatwecan.org
Match will apply to the first US$50,000 per donor
Donations can be made through givingwhatwecan.org or through other pathways or entities that can receive donations for GWWC's operations (please contact us for other options, or if you're an Australia tax resident)
Gift Aid payments will not be included in the match
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Dec 8, 2023 • 5min
LW - Is AlphaGo actually a consequentialist utility maximizer? by faul sname
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Is AlphaGo actually a consequentialist utility maximizer?, published by faul sname on December 8, 2023 on LessWrong.
TL;DR: does stapling an adaptation executor to a consequentialist utility maximizer result in higher utility outcomes in the general case, or is AlphaGo just weird?
So I was reading the AlphaGo paper recently, as one does. I noticed that architecturally, AlphaGo has
A value network: "Given a board state, how likely is it to result in a win". I interpret this as an expected utility estimator.
Rollouts: "Try out a bunch of different high-probability lines". I interpret this as a "consequences of possible actions" estimator, which can be used to both refine the expected utility estimate and also to select the highest-value action.
A policy network: "Given a board state, what moves are normal to see from that position". I interpret this one as an "adaptation executor" style of thing -- it does not particularly try to do anything besides pattern-match.
I've been thinking of AlphaGo as demonstrating the power of consequentialist reasoning, so it was a little startling to open the paper and see that actually stapling an adaptation executor to your utility maximizer provides more utility than trying to use pure consequentialist reasoning (in the sense of "
argmax over the predicted results of your actions").
I notice that I am extremely confused.
I would be inclined to think "well maybe the policy network isn't doing anything important, and it's just correcting for some minor utility estimation issue", but the authors of the paper anticipate that response, and include this extremely helpful diagram:
The vertical axis is estimated Elo, and the dots along the X axis label represent which of the three components were active for those trials.
For reference, the following components are relevant to the above graph:
The fast rollout policy pπ: a small and efficient but not extremely accurate network that predicts the probability that each legal move will be the next move, based on examining a fixed set of properties of the last move (e.g. "is this move connected to the previous move", "does the immediate neighborhood of this move/the previous move match a predetermined pattern"). Accuracy of 24.2%.
The tree rollout policy pτ: like the fast rollout policy, but adds three more features "move allows stones to be captured", "manhattan distance to last 2 moves", and a slightly larger pattern (12 point diamond instead of 3x3 pattern) around this move. Details of both pπ and pτ are given in extended data table 4 if you're curious.
The SL policy network pσ: a giant (by the standards of the time) 13 layer NN, pretrained on human games and then further trained through, if I'm reading the paper correctly, learning to imitate a separate RL policy network that is not used anywhere in the final AlphaGo system (because the SL policy network outperforms it)
The value network vθ: Same structure as the SL policy network except it outputs what probability the current board state has of being a win for the current player.
Rollouts: Pretty standard MCTS
So my question:
Why does the system with the SL policy network do so much better than the system without it?
A couple hypotheses:
Boring Answer: The SL policy network just helps to narrow the search tree. You could get better performance by running the value network on every legal move, and then transforming the win probability for each legal move into a search weight, but that would require running the value network ~19x19=361 times per move, which is a lot more expensive than running the SL policy network once.
Policy network just adds robustness: A second, separately trained value network would be just as useful as the policy network.
Bugs in the value network: the value network will ever-so-slightly overestimate the value of some p...

Dec 7, 2023 • 4min
LW - Meetup Tip: Heartbeat Messages by Screwtape
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Meetup Tip: Heartbeat Messages, published by Screwtape on December 7, 2023 on LessWrong.
Summary
"Heartbeat message" is a term of art in technology. It's a message that gets sent on a regular schedule with some information on the status of some system, but the most important thing about a heartbeat message is that everything is working well enough to send the message. By analogy to a human body, if you hear a weak or erratic heartbeat something might be going wrong but if you stop hearing the heartbeat then something has certainly gone wrong.
Heartbeat messages are also useful for meetup groups or large meetup events. Even if the message contains no new information in the body of the message, the fact that it got sent confirms that there is still an active organizer at the helm. Heartbeats also remind people "oh yeah, I was interested in that."
You have now read the basic point of this post. If you want to read on, cool, lets talk about heartbeats for more words than are strictly necessary.
Details
There's two broad use cases for heartbeat messages in meetups. One is if you have a regular community that exists and does normal meetup things. The other is if you have some specific big event you're building to. Heartbeats will differ a little for these, but they have much in common.
I like to start each heartbeat with the basics I really want someone to know, even if this is the only message they're going to read. For a big event, "When is it happening and where is it happening" is the key information. For a regular community, that might be other places you can go for information (for instance, if there's a Facebook group and a Discord server and a google group and...) or resources you don't want people to forget about. OBNYC has a position of Monthly Meetup Captain, and every month there's a message reminding the community that they need a captain.
The pace of a heartbeat message for a big event should probably change over time. Let's take the East Coast Rationalist Megameetup as an example. That's a single large event in December that happens every year. The first announcement is usually in mid-autumn, and I try to send an email a month until late November when I increase to once a week. The week beforehand, I increase again, aiming to do one at the start of the week, one the day before, and one the day of.
For regular communities, this can differ. If lots of events are happening and getting announced, they can function as their own heartbeat. The Bay Area LessWrong mailing list has a fairly reliable Thursday Dinner, plus an Oakland meetup most weeks, plus a few others. If you have at least an event a week I don't know if you need a separate heartbeat because the events themselves are evidence things are happening.
In the other direction, I wouldn't have a heartbeat cadence for a local community that was more than 2x as frequent as the meetups. So if you meet once in spring and once in autumn, maybe do four a year but not five? I think the actual right cadence is to treat the spring and autumn meetups as Big Events and do a one-month-out message then a one-week-out message for each.
Some places have norms against spam or thread necromancy (posting in a channel just to draw people's attention to an old post or to bring it to the top of something that sorts by Latest.) When in doubt, check with the mods.
Heartbeat messages are good places to ask for help or to mention issues you're running into. That's their purpose in tech. "Our usual organizer is out of town next month, anyone want to fill their spot?" "I'll need some volunteers to set up the stage equipment right before solstice and help pack up after, anyone free?"
Quick Tips
I suggest you copy and paste the opening, then change one or two of the connecting or fluff sentences. Pure copy and paste seems to get glosse...

Dec 7, 2023 • 16min
LW - Gemini 1.0 by Zvi
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Gemini 1.0, published by Zvi on December 7, 2023 on LessWrong.
It's happening. Here is CEO Pichai's Twitter announcement. Here is Demis Hassabis announcing. Here is the DeepMind Twitter announcement. Here is the blog announcement. Here is Gemini co-lead Oriol Vinyals, promising more to come. Here is Google's Chief Scientist Jeff Dean bringing his best hype.
EDIT: This post has been updated for the fact that I did not fully appreciate how fake Google's video demonstration was.
Technical Specifications
Let's check out the specs.
Context length trained was 32k tokens, they report 98% accuracy on information retrieval for Ultra across the full context length. So a bit low, both lower than GPT - 4 and Claude and lower than their methods can handle. Presumably we should expect that context length to grow rapidly with future versions.
There are three versions of Gemini 1.0.
Gemini 1.0, our first version, comes in three sizes: Ultra for highly-complex tasks, Pro for enhanced performance and deployability at scale, and Nano for on-device applications. Each size is specifically tailored to address different computational limitations and application requirements.
…
Nano: Our most efficient model, designed to run on-device. We trained two versions of Nano, with 1.8B (Nano-1) and 3.25B (Nano-2) parameters, targeting low and high memory devices respectively. It is trained by distilling from larger Gemini models. It is 4-bit quantized for deployment and provides best-in-class performance.
…
The Nano series of models leverage additional advancements in distillation and training algorithms to produce the best-in-class small language models for a wide variety of tasks, such as summarization and reading comprehension, which power our next generation on-device experiences.
This makes sense. I do think there are, mostly, exactly these three types of tasks. Nano tasks are completely different from non-Nano tasks.
This graph reports relative performance of different size models. We know the sizes of Nano 1 and Nano 2, so this is a massive hint given how scaling laws work for the size of Pro and Ultra.
Gemini is natively multimodal, which they represent as being able to seamlessly integrate various inputs and outputs.
They say their benchmarking on text beats the existing state of the art.
Our most capable model, Gemini Ultra, achieves new state-of-the-art results in 30 of 32 benchmarks we report on, including 10 of 12 popular text and reasoning benchmarks, 9 of 9 image understanding benchmarks, 6 of 6 video understanding benchmarks, and 5 of 5 speech recognition and speech translation benchmarks.
Gemini Ultra is the first model to achieve human-expert performance on MMLU (Hendrycks et al., 2021a) - a prominent benchmark testing knowledge and reasoning via a suite of exams - with a score above 90%. Beyond text, Gemini Ultra makes notable advances on challenging multimodal reasoning tasks.
I love that 'above 90%' turns out to be exactly 90.04%, whereas human expert is 89.8%, prior SOTA was 86.4%. Chef's kiss, 10/10, no notes. I mean, what a coincidence, that is not suspicious at all and no one was benchmark gaming that, no way.
We find Gemini Ultra achieves highest accuracy when used in combination with a chain-of-thought prompting approach (Wei et al., 2022) that accounts for model uncertainty. The model produces a chain of thought with k samples, for example 8 or 32. If there is a consensus above a preset threshold (selected based on the validation split), it selects this answer, otherwise it reverts to a greedy sample based on maximum likelihood choice without chain of thought.
I wonder when such approaches will be natively integrated into the UI for such models. Ideally, I should be able to, after presumably giving them my credit card information, turn my (Bard?) to 'Gemini k-sample Chai...

Dec 7, 2023 • 7min
LW - On Trust by johnswentworth
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On Trust, published by johnswentworth on December 7, 2023 on LessWrong.
"Trust", as the word is typically used, is a… weird concept, to me. Like, it's trying to carve the world in a way which I don't naturally carve it myself. This post is my attempt to convey what I find weird about "trust", and what adjacent concepts I personally find more natural instead.
The Weirdness Of "Trust"
Here are some example phrases which make "trust" feel like a weird/unnatural concept to me:
"I decided to trust her"
"Should I trust him?"
"Trust me"
"They offered me their trust"
To me, the phrase "I decided to trust her" throws an error. It's the "decided" part that's the problem: beliefs are not supposed to involve any "deciding". There's priors, there's evidence, and if it feels like there's a degree of freedom in what to do with those, then something has probably gone wrong. (The main exception here is self-fulfilling prophecy, but that's not obviously centrally involved in whatever "I decided to trust her" means.)
Similarly with "trust me". Like, wat? If I were to change my belief about some arbitrary thing, just because somebody asked me to change my belief about that thing, that would probably mean that something had gone wrong.
"Should I trust him?" is a less central example, but… "should" sounds like it has a moral/utility element here. I could maybe interpret the phrase in a purely epistemic way - e.g. "should I trust him?" -> "will I end up believing true things if I trust him?" - but also that interpretation seems like it's missing something about how the phrase is actually used in practice? Anyway, a moral/utility element entering epistemic matters throws an error.
The thing which is natural to me is: when someone makes a claim, or gives me information, I intuitively think "what process led to them making this claim or giving me this information, and does that process systematically make the claim/information match the territory?". If Alice claims that moderate doses of hydroxyhopytheticol prevent pancreatic cancer, then I automatically generate hypotheses for what caused Alice to make that claim.
Sometimes the answer is "Alice read it in the news, and the reporter probably got it by misinterpreting/not-very-carefully-reporting a paper which itself was some combination of underpowered, observational, or in vitro/in silico/in a model organism", and then I basically ignore the claim. Other times the answer is "Alice is one of those friends who's into reviewing the methodology and stats of papers", and then I expect the claim is backed by surprisingly strong evidence.
Note that this is a purely epistemic question - simplifying somewhat, I'm asking things like "Do I in fact think this information is true? Do I in fact think that Alice believes it (or alieves it, or wants-to-believe it, etc)?". There's no deciding whether I believe the person. Whether I "should" trust them seems like an unnecessary level of meta-reasoning. I'm just probing my own beliefs: not "what should I believe here", but simply "what do I in fact believe here".
As a loose general heuristic, if questions of belief involve "deciding" things or answering "should" questions, then a mistake has probably been made. The rules of Bayes inference (or logical uncertainty, etc) do not typically involve "deciding" or "shouldness"; those enter at the utility stage, not the epistemic stage.
What's This "Trust" Thing?
Is there some natural thing which lines up with the concept of "trust", as it's typically used? Some toy model which would explain why "deciding to trust someone" or "asking for trust" or "offering trust" make sense, epistemically? Here's my current best guess.
Core mechanism: when you "decide to trust Alice", you believe Alice' claims but, crucially, if you later find that Alice' claims were false then you'l...

Dec 7, 2023 • 28min
AF - Simplicity arguments for scheming (Section 4.3 of "Scheming AIs") by Joe Carlsmith
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Simplicity arguments for scheming (Section 4.3 of "Scheming AIs"), published by Joe Carlsmith on December 7, 2023 on The AI Alignment Forum.
This is Section 4.3 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own.
Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app.
Simplicity arguments
The strict counting argument I've described is sometimes presented in the context of arguments for expecting schemers that focus on "simplicity."[1] Let's turn to those arguments now.
What is "simplicity"?
What do I mean by "simplicity," here? In my opinion, discussions of this topic are often problematically vague - both with respect to the notion of simplicity at stake, and with respect to the sense in which SGD is understood as selecting for simplicity.
The notion that Hubinger uses, though, is the length of the code required to write down the algorithm that a model's weights implement. That is: faced with a big, messy neural net that is doing X (for example, performing some kind of induction), we imagine re-writing X in a programming language like python, and we ask how long the relevant program would have to be.[2] Let's call this "re-writing simplicity."[3]
Hubinger's notion of simplicity, here, is closely related to measures of algorithmic complexity like "Kolmogorov complexity," which measure the complexity of a string by reference to the length of the shortest program that outputs that string when fed into a chosen Universal Turing Machine (UTM).
Indeed, my vague sense is that certain discussions of simplicity in the context of computer science often implicitly assume what I've called "simplicity realism" - a view on which simplicity in some deep sense an objective thing, ultimately independent of e.g. your choice of programming language or UTM, but which different metrics of simplicity are all tracking (albeit, imperfectly).
And perhaps this view has merit (for example, my impression is that different metrics of complexity often reach similar conclusions in many cases - though this could have many explanations). However, I don't, personally, want to assume it. And especially absent some objective sense of simplicity, it becomes more important to say which particular sense you have in mind.
Another possible notion of simplicity, here, is hazier - but also, to my mind, less theoretically laden.
On this notion, the simplicity of an algorithm implemented by a neural network is defined relative to something like the number of parameters the neural network uses to encode the relevant algorithm.[6] That is, instead of imagining re-writing the neural network's algorithm in some other programming language, we focus directly on the parameters the neural network itself is recruiting to do the job, where simpler programs use fewer parameters.
Let's call this "parameter simplicity." Exactly how you would measure "parameter simplicity" is a different question, but it has the advantage of removing one layer of theoretical machinery and arbitrariness (e.g., the step of re-writing the algorithm in an arbitrary-seeming programming language), and connecting more directly with a "resource" that we know SGD has to deal with (e.g., the parameters the model makes available). For this reason, I'll often focus on "parameter simplicity" below.
I'll also flag a way of talking about "simplicity" that I won't emphasize, and which I think muddies the waters here considerably: namely, equating simplicity fairly directly with "higher prior probability." Thus, for example, faced w...


