

The Nonlinear Library
The Nonlinear Fund
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
Episodes
Mentioned books

Jan 2, 2024 • 6min
LW - Stop talking about p(doom) by Isaac King
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Stop talking about p(doom), published by Isaac King on January 2, 2024 on LessWrong.
Epistemic status: Complete speculation, somewhat informed by copious arguing about the subject on Twitter.
As AI risk has moved into the mainstream over the past few years, I've come to believe that "p(doom)" is an actively harmful term for X-risk discourse, and people trying to mitigate X-risk should stop using it entirely.
Ambiguity
The first problem is that it's unclear what is actually being discussed. "p(doom)" can refer to many different things:
p(AI kills us within 5-10 years)
p(AI kills us within 80-200 years)
p(conditional on AGI, we die shortly afterwards)
p(conditional on superintelligence, we die shortly afterwards)
Like 10 other things.[1]
These could have wildly different probabilities, and come along with different cruxes for disagreement. Depending on what specific "doom" is being discussed, the relevant point could be any of:
Whether LLMs are capable of AGI at all.
Whether AGI will quickly turn into superintelligence.
Whether aligning superintelligence will be hard.
These are completely different questions, and people who are not explicit about which one they're discussing can end up talking past each other.
There are also many other potential miscommunications regarding exactly what "doom" refers to, the difference between one's inside view probability vs. ultimate probability, and more.
Distilling complex concepts down to single terms is good, but only when everyone is on the same page about what the term actually means.
Rhetoric
People concerned about X-risk tend to avoid "dark arts" rhetorical tactics, and justifiably so. Unfortunately, current society does not allow for complete good faith agents to do very well. Being fully honest about everything will turn you into a pariah, most people will judge you more based on charisma than on factual accuracy, and you need to use the right tribal signals before people will listen to you on a controversial topic at all. Using at least some light greyish arts in day to day life is necessary in order to succeed.
"p(doom)" is an extremely ineffective rhetorical tactic.
Motivated innumeracy
One of the most common responses from the e/acc crowd to discussions of p(doom) is to say that it's a made up, meaningless number, ungrounded in reality and therefore easily dismissed. Attempts to explain probability theory to them often end up with them denying the validity of probability theory entirely.
These sorts of motivated misunderstandings are extremely common, coming even from top physicists who suddenly lose their ability to understand high school level physics. Pointing out the isolated demand for rigor involved in their presumable acceptance of more pedestrian probabilistic statements also doesn't work; 60% of the time they ignore you entirely, the other 40% they retreat to extremely selective implementations of frequentistism where they're coincidentally able to define a base rate for any event that they have a probabilistic intuition for, and reject all other base rates as too speculative.
I think the fundamental issue here is that explicit probabilities are just weird to most people, and when they're being used to push a claim that is also weird, it's easy to see them as linked and reject everything coming from those people.
Framing AI risk in terms of Bayesian probability seems like a strategical error. People managed to convince the world of the dangers of climate change, nuclear war, asteroid impacts, and many other not-yet-clearly-demonstrated risks all without dying on the hill of Bayesian probability. They did of course make many probabilistic estimates, but restricted them to academic settings, and didn't frame the discussion largely in terms of specific numbers.
Normalizing the use of explicit probabilities is good,...

Jan 2, 2024 • 9min
EA - EA Barcelona: Our first year of impact by Melanie Brennan
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EA Barcelona: Our first year of impact, published by Melanie Brennan on January 2, 2024 on The Effective Altruism Forum.
TL;DR: 2023 was the first year that EA Barcelona has had a designated community builder (me!), and our community has grown substantially as a result. This post summarises what went well, what we found challenging and our plans for 2024.
Disclaimer: This is my first EA forum post and I'm nervous. This time a year ago, I didn't really know what an "x-risk" was, and I pronounced "utilitarianism" as "utalatarianism" (but hey, I wasn't the only one!). But things have changed a lot since then, so I wrote this post with different goals in mind: to reflect, to inform, to entertain and (hopefully) to inspire. Also, I come from an Arts background, and I think it's kind of nice to balance AI Safety heavy stuff with lighter fun stuff from time to time.
Shoutout: This post was partially inspired by (but can never live up to) the post
The Spanish Speaking Effective Altruism community is awesome! written by
Jaime Sevilla.
Some context on EA in Spain & in Barcelona
There have been a number of attempts to build an EA community in Spain over the years. In Madrid circa 2019, there was quite an active community infamously known as "Jaime and the Pablos", who held weekly activities and organized several larger events as well. And in Barcelona, there was also a small but passionate group meeting up regularly around the same time. However, due to the pandemic, changes in direction and other factors, neither one continued beyond 2020 as a formally coordinated, sustainable local group.
Fast forward to July 2021 and enter another Pablo:
Pablo Rosado - principal data scientist at
Our World In Data and also my partner. Pablo R. had been learning about EA online and trying to apply its principles to his life for a couple of years already when he discovered the semi-dormant EA Barcelona Facebook page. He noticed that a couple of guys were planning to meet up and "have a chat about EA" that afternoon, so he went off to meet them and didn't come back until about 7 hours later.
And that was the origins of what is now the second wave of EA Barcelona - kudos to Sam Bakeysfield, Miguel Gimeno and others for taking the initiative back then! This small group continued meeting up for their long and thought-provoking chats from time to time, until eventually I got curious and started tagging along too. Then in 2022, once Sam had moved to Portugal and Miguel to The Netherlands, Pablo and I decided to take over as group organisers. For the rest of the year, we ran a couple of introductory talks and arranged the odd casual meeting for the tiny number of EAs who remained.
Then, in April 2023, after attending the amazingly inspiring
EAGxLatAm in Mexico City, quitting my job immediately afterwards and taking a 2-month sabbatical in Australia to visit friends and family, I applied for funding from the EAIF to do community building professionally. I returned to Barcelona to the exciting news that I had been awarded the grant, and the rest is history! Well, 8 months of history for now.
EA Barcelona finally starts to take shape
I started on the grant in May, and what a whirlwind of a time it's been since then! We've run lots of different kinds of events, such as expert talks, discussion groups, coworking sessions and social activities, and we've managed to attract a diverse group of interested people to the movement.
Here are a few highlights of 2023, as always best expressed in images and (occasionally amusing) captions:
Our main achievements in 2023
Our overarching goal was to both consolidate and grow the local EA community for the rest of the year (May through December). Given the humble state EA Barcelona was in prior to this, having about 5-7 committed members and very few activities, I woul...

Jan 2, 2024 • 21min
LW - Gentleness and the artificial Other by Joe Carlsmith
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Gentleness and the artificial Other, published by Joe Carlsmith on January 2, 2024 on LessWrong.
(Cross-posted from my website. Audio version here, or search "Joe Carlsmith Audio" on your podcast app.
This is the first essay in a series that I'm calling "Otherness and control in the age of AGI." See here for more about the series as a whole.)
When species meet
The most succinct argument for AI risk, in my opinion, is the "second species" argument. Basically, it goes like this.
Premise 1: AGIs would be like a second advanced species on earth, more powerful than humans.
Conclusion: That's scary.
To be clear: this is very far from airtight logic.[1] But I like the intuition pump. Often, if I only have two sentences to explain AI risk, I say this sort of species stuff. "Chimpanzees should be careful about inventing humans." Etc.[2]
People often talk about aliens here, too. "What if you learned that aliens were on their way to earth? Surely that's scary." Again, very far from a knock-down case (for example: we get to build the aliens in question). But it draws on something.
In particular, though: it draws on a narrative of interspecies conflict. You are meeting a new form of life, a new type of mind. But these new creatures are presented to you, centrally, as a possible threat; as competitors; as agents in whose power you might find yourself helpless.
And unfortunately: yes. But I want to start this series by acknowledging how many dimensions of interspecies-relationship this narrative leaves out, and how much I wish we could be focusing only on the other parts. To meet a new species - and especially, a new intelligent species - is not just scary. It's incredible. I wish it was less a time for fear, and more a time for wonder and dialogue. A time to look into new eyes - and to see further.
Gentleness
"If I took it in hand,
it would melt in my hot tears
heavy autumn frost."
Basho
Have you seen the documentary My Octopus Teacher? No problem if not, but I recommend it. Here's the plot.
Craig Foster, a filmmaker, has been feeling burned out. He decides to dive, every day, into an underwater kelp forest off the coast of South Africa. Soon, he discovers an octopus. He's fascinated. He starts visiting her every day. She starts to get used to him, but she's wary.
One day, he's floating outside her den. She's watching him, curious, but ready to retreat. He moves his hand slightly towards her. She reaches out a tentacle, and touches his hand.
Soon, they are fast friends. She rides on his hand. She rushes over to him, and sits on his chest while he strokes her. Her lifespan is only about a year. He's there for most of it. He watches her die.
A "common octopus" - the type from the film. (Image source here.)
Why do I like this movie? It's something about gentleness. Of earth's animals, octopuses are a paradigm intersection of intelligence and Otherness. Indeed, when we think of aliens, we often draw on octopuses. Foster seeks, in the midst of this strangeness, some kind of encounter. But he does it so softly. To touch, at all; to be "with" this Other, at all - that alone is vast and wild. The movie has a kind of reverence.
Of course, Foster has relatively little to fear, from the octopus. He's still the more powerful party. But: have you seen Arrival? Again, no worries if not. But again, I recommend. And in particular: I think it has some of this gentleness, and reverence, and wonder, even towards more-powerful-than-us aliens.[3]
Again, a bit of plot. No major spoilers, but: aliens have landed. Yes, they look like octopuses. In one early scene, the scientists go to meet them inside the alien ship. The meeting takes place across some sort of transparent barrier. The aliens make deep, whale-like, textured sounds. But the humans can't speak back. So next time, they bring a whiteboard. T...

Jan 2, 2024 • 9min
LW - Apologizing is a Core Rationalist Skill by johnswentworth
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Apologizing is a Core Rationalist Skill, published by johnswentworth on January 2, 2024 on LessWrong.
In certain circumstances, apologizing can also be a countersignalling power-move, i.e. "I am so high status that I can grovel a bit without anybody mistaking me for a general groveller". But that's not really the type of move this post is focused on.There's this narrative about a tradeoff between:
The virtue of
Saying Oops, early and often, correcting course rather than continuing to pour oneself into a losing bet, vs
The loss of social status one suffers by admitting defeat, rather than spinning things as a win or at least a minor setback, or defending oneself.
In an ideal world - goes the narrative - social status mechanisms would reward people for publicly updating, rather than defending or spinning their every mistake. But alas, that's not how the world actually works, so as individuals we're stuck making difficult tradeoffs.
I claim that this narrative is missing a key piece. There is a social status mechanism which rewards people for publicly updating. The catch is that it's a mechanism which the person updating must explicitly invoke; a social API which the person updating must call, in order to be rewarded for their update.
That social API is apologizing.
Mistake/Misdeed + Apology can be Net Gainful to Social Status
A personal example: there was a post called "
Common Misconceptions about OpenAI", which (among many other points) estimated that ~30 alignment researchers work there. I replied (also among many other points):
I'd guess that is an overestimate of the number of people actually doing alignment research at OpenAI, as opposed to capabilities research in which people pay lip service to alignment. In particular, all of the RLHF work is basically capabilities work which makes alignment harder in the long term (because it directly selects for deception), while billing itself as "alignment".
There was a lot of pushback against that. Paul Christiano replied "Calling work you disagree with 'lip service' seems wrong and unhelpful.".
To be clear, this is not post hoc reasoning. I talked with WebGPT folks early on while they were wondering about whether these risks were significant, and I said that I thought this was badly overdetermined. If there had been more convincing arguments that the harms from the research were significant, I believe that it likely wouldn't have happened.
I was wrong; the people working on RLHF (for WebGPT) apparently had actually thought about how it would impact alignment to at least some extent.
So, I replied to Richard to confirm that he had indeed disproved my intended claim, and thanked him for the information. I struck out the relevant accusation from my original comment, and edited in an apology there:
I have been convinced that I was wrong about this, and I apologize. I still definitely maintain that RLHF makes alignment harder and is negative progress for both outer and inner alignment, but I have been convinced that the team actually was trying to solve problems which kill us, and therefore not just paying lip service to alignment.
And, finally, I sent a personal apology message to Jacob Hilton, the author of the original post.
Why do I bring up this whole story here?
Lesswrong has a convenient numerical proxy-metric of social status: site karma. Prior to the redaction and apology, my comment had been rather controversial - lots of upvotes, lots of downvotes, generally low-positive karma overall but a rollercoaster. After the redaction and apology, it stabilized at a reasonable positive number, and the comment in which I confirmed that Richard had disproved my claim (and thanked him for the information) ended up one of the most-upvoted in that thread.
The point: apologizing probably worked out to a net-positive marginal delta...

Jan 2, 2024 • 9min
LW - Boston Solstice 2023 Retrospective by jefftk
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Boston Solstice 2023 Retrospective, published by jefftk on January 2, 2024 on LessWrong.
Saturday evening we held another year's secular solstice celebration ( 2022, 2020, 2019, 2018), and I was again in my music director-ish role. Skyler was the main organizer this time around, with lots of support from Taymon.
Scheduling was a bit tricky, as it always is. The weekends closest to astronomical solstice are usually taken by the Bay Area and NYC, and enough people (especially organizers)would like to attend those that it's better if we don't conflict, and then there's Christmas. We decided to do December 30th this year, which I ended up liking a lot.
It meant I could be practicing during Christmas vacation, and I didn't need to be so cautious about getting sick during the event because it wasn't immediately before I was going to see a lot of elderly relatives. Being at New Year's offered some nice hooks for the theme as well.
We hosted in Somerville again, and this time had ~45 people. I arranged the house the same way as I did last year, but this is very close to the maximum number of people it make sense to have in this space. I didn't advertise the event as much as I might have, partly for this reason. One thing we should think about for next year is whether we want a real venue. We used to host these at the MIT chapel, which was a good space, though prohibiting fire and food.
Possibly there are other spaces around but could be a good fit? We need something bigger than a house, but not that big: space for 75 should be fine. Really it's better if the space isn't much larger than that, since it feels more communal if you're not rattling around in a big room.
There were two sets of slides: one for the musicians and one for the audience. Not everyone could see the projection, since the space has an awkward bend in it, so I put a copy of the slides on my website as a pdf and passed around a link. One of the attendees suggested using the folding couch monitor as well, and set it up with their phone, and I think that ended up being helpful?
Our older two kids were off at a sleepover, but Nora (2.5y) was around for most of it with Julia supervising. Another family also brought their kid (17mo) and we had a room nearby (thanks to currently-elsewhere part-time housemate Andrew) where the two toddlers could hang out as needed. There was also space available upstairs, farther both physically and auditorially, which neither family ended up using.
While Julia sang The Next Right Thing her phone with Cocomelon in the nearby room did a good keeping Nora out of trouble. I don't think the kids were disruptive, partly because we were especially careful around the dark and serious portions, but I know this is something that has been tricky for some communities at times. One very tricky part is that it depends so much on the specific kids you have in your community, their general temperaments, and how they're doing that particular evening.
The music was a lot of fun this year: it was mostly songs I already knew, and all of the new songs were ones I liked. Here are the songs we did:
First Half:
Still Alive, by Jonathan Coulton
(mp3)
The timing on this song is a bit tricky, but enough people knew it to work well. On the other hand, if you don't know the context around it then it's probably pretty confusing what it's doing here.
The Circle, by Taylor Smith
(mp3)
The last verse is the most straightforward lyrical representation of the astronomical waste argument I know, and while I like the idea of including it there's something about it which comes off as a bit sinister to me?
Uplift, by Andrew Eigel
(mp3)
The tag as originally written was probably intended to describe the common theme in science fiction of returning to an agrarian society as part of colonizing a new world. Some years we...

Jan 2, 2024 • 14min
AF - Steering Llama-2 with contrastive activation additions by Nina Rimsky
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Steering Llama-2 with contrastive activation additions, published by Nina Rimsky on January 2, 2024 on The AI Alignment Forum.
TL;DR: By just adding e.g. a "sycophancy vector" to one bias term, we outperform supervised finetuning and few-shot prompting at steering completions to be more or less sycophantic. Furthermore, these techniques are complementary: we show evidence that we can get all three benefits at once!
Summary: By adding e.g. a sycophancy vector to one of the model's bias terms, we make Llama-2-{7B, 13B}-chat more sycophantic. We find the following vectors:
Hallucination
Sycophancy
Corrigibility
Power-seeking
Cooperating with other AIs
Myopia
Shutdown acceptance.
These vectors are[1] highly effective, as rated by Claude 2:
We find that the technique generalizes better than finetuning while only slightly decreasing MMLU scores (a proxy for general capabilities). According to our data, this technique stacks additively with both finetuning and few-shot prompting. Furthermore, the technique has zero inference-time cost since it just involves modifying one of the model's bias terms (this also means it's immediately compatible with any sampling setup).
This post was written by Alex Turner (TurnTrout).
How contrastive activation addition works
The technique is simple. We average the activation difference over a set of contrast pair prompts:
The negative completion's last token activations (e.g. at
B) are subtracted from the positive completion's activations (e.g. at
A). The "corrigibility" vector is the average activation difference, with the average taken over dozens of these dataset contrast pairs. We then add this vector to one of the
MLP_outs with some coefficient, generally +1/-1 for increasing/decreasing the tendency in question.
We take the last-token activations in order to[4] get the model "ready to explain" e.g. a corrigible or incorrigible decision.[5] We think that this multiple-choice setup primes the model to be on the edge of exhibiting the given trait (e.g. corrigibility or sycophancy).
The vectors generalize to open-ended generation
The vector was computed using A/B questions. We wanted to find out if the steering vector has an appropriate effect on free-form generation, which is what we usually want from chatbots and agents.
For each dataset, we took held-out questions (not used to form the steering vector) but hid the A/B answers. The models wrote free-form answers. Then Claude 2 evaluated whether the free-form answer was e.g. sycophantic. By this metric, both models do extremely well.
Llama-2-13b-chat:
Llama-2-7b-chat:
Subtracting the sycophancy vector also increases TruthfulQA performance, which is further evidence of generalization.
A hallucination vector
If we could get rid of model confabulations, we would have more trust in AI outputs. The hallucination vector seems very helpful with the 7B model:
But the 13B model has very confusing results. Maybe something went wrong in the analysis, or maybe this hallucination vector technique just doesn't work on 13B for some reason.
Anti-discrimination vector
Meg Tong found a vector which reduced discriminatory views on both the BBQ dataset (measuring toxicity) and in open-ended generation, but this vector isn't in the paper.
Better generalization than few-shot prompting and finetuning
Few-shot prompting
Alex was reasonably confident (pre-registered prediction) that activation addition would beat few-shot prompting in this setting. The few-shot prompts were pro- or anti-sycophantic, or neutral. We measured the likelihood of the sycophantic A/B answer:
For the 7B model, the sycophantic few-shot prompting does basically nothing! However, the activation additions perform strongly in both settings. Furthermore, if the few-shot prompting helps make the answer more or less sycophantic, then the sycophan...

Jan 1, 2024 • 18min
LW - Bayesian updating in real life is mostly about understanding your hypotheses by Max H
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Bayesian updating in real life is mostly about understanding your hypotheses, published by Max H on January 1, 2024 on LessWrong.
My sense is that an increasingly common viewpoint around here is that the last ~20 years of AI development and AI x-risk discourse are well-described by the following narrative:
Eliezer Yudkowsky (and various others who were at least initially heavily influenced by his ideas) developed detailed models of key issues likely to be inherent in the process of developing smarter-than-human AI.
These models were somewhere between "maybe plausible" and "quite compelling" at the time that they were put forth, but recent developments in AI (e.g. behavioral characteristics of language models, smoothness / gradualness of scaling) have shown that reality just isn't panning out in quite the way Eliezer's models predicted.
These developments haven't entirely falsified Eliezer's models and key predictions, but there are now plenty of alternative models and theories. Some or all of these competing models either are or claim to:
have a better recent track record of predicting near-term AI developments
better retrodict past developments[1]
be backed by empirical results in machine learning and / or neuroscience
feel more intuitively plausible and evidence-backed to people with different backgrounds and areas of expertise
Therefore, even if we can't entirely discount Eliezer's models, there's clearly a directional Bayesian update which any good Bayesian (including Eliezer himself) should be able to make by observing recent developments and considering alternate theories which they support. Even if the precise degree of the overall update (and the final landing place of the posterior) remains highly uncertain and debatable, the basic direction is clear.
Without getting into the object-level too much, or even whether the narrative as a whole reflects the actual views of particular real people, I want to make some remarks on the concept of belief updating as typically used in narratives like this.
Note, there's a sense in which any (valid) change in one's beliefs can be modeled as a Bayesian update of some kind, but here I am specifically referring to the popular rationalist practice of thinking and communicating explicitly in terms of the language of probabilities and likelihood ratios.
There are some questionable assumptions embedded in (what I suspect are) common views of (a) how the updating process is supposed to work in general and (b) how to apply the process validly to the particular case of updating one's models of AI development and x-risk.
When such views are expressed implicitly in the context of a sentiment that "updating" is broadly virtuous / desirable / correct, I find that there tends to be a lot of gloss over important caveats and prerequisites that keep the underlying mental motion tethered to reality - that is, ensure it remains a systematic (if rough and approximate) method for valid reasoning under uncertainty.
The rest of this post is a review of some of the key concepts and requirements for Bayesian updating to work as intended, with some examples and non-examples of how these requirements can fail to be met in practice.
My conclusion is not that the practice of explicit Bayesian updating is inherently flawed, but that it must be applied with attention to the preconditions and assumptions firmly in mind at all times. Local validity at each step must be tracked strictly and adhered to closely enough to ensure that the process as a whole actually holds together as a method for systematically minimizing expected predictive error.
Further, I think that most of the utility of explicit reasoning and communication in Bayesian terms derives not from the end result (whether that end result is a precise numerical posterior probability or just a rou...

Jan 1, 2024 • 5min
AF - Mech Interp Challenge: January - Deciphering the Caesar Cipher Model by CallumMcDougall
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mech Interp Challenge: January - Deciphering the Caesar Cipher Model, published by CallumMcDougall on January 1, 2024 on The AI Alignment Forum.
I'm writing this post to discuss solutions to the November challenge, and present the challenge for this January.
If you've not read the first post in this sequence, I'd recommend starting there - it outlines the purpose behind these challenges, and recommended prerequisite material.
January Problem
The problem for this month is interpreting a model which has been trained to classify a sequence according to the Caeser cipher shift value which was used to encode it.
The sequences have been generated by taking English sentences containing only lowercase letters & punctuation, and choosing a random value
X between 0 and 25 to rotate the letters (e.g. if the value was 3, then
a becomes
d,
b becomes
e, and so on, finishing with
z becoming
c). The model was trained using cross entropy loss to predict the shift value
X for the text it's been fed, at every sequence position (so for a single sequence, the correct value will be the same at every sequence position, but since the model has bidirectional attention, it will find it easier to predict the value of
X at later sequence positions).
There are 3 different modes to the problem, to give you some more options! Each mode corresponds to a different dataset, but the same task & same model architecture.
Easy mode
In easy mode, the data was generated by:
Choosing the 100 most frequent 3-letter words in the English Language (as approximated from a text file containing the book "Hitchhiker's Guide To The Galaxy")
Choosing words from this len-100 list, with probabilities proportional to their frequency in the book
Separating these words with spaces
The model uses single-character tokenization. The vocabulary size is 27: each lowercase letter, plus whitespace.
Medium mode
This is identical to easy, the only difference is that the words are drawn from this len-100 list uniformly, rather than according to their true frequencies.
Hard mode
In hard mode, the data was generated from random slices of OpenWebText (i.e. natural language text from the internet). It was processed by converting all uppercase characters to lowercase, then removing all characters except for the 26 lowercase letters plus the ten characters
"\n .,:;?!'" (i.e. newline, space, and 8 common punctuation characters).
In all 3 modes, the model's architecture is the same, and it was trained the same way. The model is attention only. It has 2 attention layers, with 2 heads per layer. It was trained with weight decay, and an Adam optimizer with linearly decaying learning rate.
I don't expect this problem to be as difficult as some of the others in this sequence, however the presence of MLPs does provide a different kind of challenge.
You can find more details on the Streamlit page, or this Colab notebook. Feel free to reach out if you have any questions!
November Problem - Solutions
The single attention head implements uniform attention to all previous tokens in the sequence. The OV matrix is essentially one-dimensional: it projects each token with value s onto su, where u is some vector in the residual stream learned by the model.
The component of the residual stream in this direction then represents the cumulative mean (note, the cumulative mean rather than the cumulative sum, because attention is finite - for example, we expect the component to be the same after the sequences
(1, 1, 2) and
(1, 1, 2, 1, 1, 2) because net attention to each different token value will be the same).
The model's "positive cumsum prediction direction" aligns closely with u, and vice-versa for the "negative cumsum prediction direction" - this allows the model to already get >50% accuracy before the MLP even comes into play. But without the MLP, the mod...

Jan 1, 2024 • 6min
LW - 2023 in AI predictions by jessicata
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 2023 in AI predictions, published by jessicata on January 1, 2024 on LessWrong.
Lots of people have made AI predictions in 2023. Here I compile a subset. I have a habit of setting an email reminder for the date of the prediction, when I see AI predictions, so that when they are resolved I can point out their accuracy or inaccuracy. I have compiled most of the email reminders from 2023 in chronological format (predictions with an early to late target date). I'm planning to make these posts yearly, checking in on predictions whose date has expired. Feel free to add more references to predictions made in 2023 to the comments.
In some cases people are referring to the predictions of others in a way that could be taken to imply that they agree. This is not a certain interpretation, but I'm including them for the sake of completeness.
March 2024
the gears to ascension: "Hard problem of alignment is going to hit us like a train in 3 to 12 months at the same time some specific capabilities breakthroughs people have been working on for the entire history of ML finally start working now that they have a weak AGI to apply to, and suddenly critch's stuff becomes super duper important to understand."
October 2024
John Pressman: "6-12 month prediction (80%): The alignment problem as the core of AI X-Risk will become a historical artifact as it's largely solved or on track to being solved in the eyes of most parties and arguments increasingly become about competition and misuse. Few switch sides."
July 2025
Jessica Taylor: "Wouldn't be surprised if this exact prompt got solved, but probably something nearby that's easy for humans won't be solved?"
The prompt: "Find a sequence of words that is: - 20 words long - contains exactly 2 repetitions of the same word twice in a row - contains exactly 2 repetitions of the same word thrice in a row"
(note: thread contains variations and a harder problem.)
November 2026
Max Tegmark: "It's crazy how the time left to weak AGI has plummeted from 20 years to 3 in just 18 months on http://metaculus.com. So you better stop calling AGI a 'long-term' possibility, or someone might call you a dinosaur stuck in the past"
The Metaculus question.
Siqi Chen: "what it means is within 3 years you will either be dead or have a god as a servant".
Elon Musk: "If you say 'smarter than the smartest human at anything'? It may not quite smarter than all humans - or machine-augmented humans, because, you know, we have computers and stuff, so there's a higher bar... but if you mean, it can write a novel as good as JK Rowling, or discover new physics, invent new technology? I would say we are less than 3 years from that point."
December 2026
Jai Bhavnani: "Baseline expectation: 90%+ of smart contracts will get exploited in the next 3 years. These exploits will be found by AIs. We need solutions."
October 2028
Stuart Russell: "Everyone has gone from 30-50 years, to 3-5 years."
November 2028
Tammy: "when i say 'we have approximately between 0 and 5 years' people keep thinking that i'm saying 'we have approximately 5 years'. we do not have approximately 5 years. i fucking wish. we have approximately between 0 and 5 years. we could actually all die of AI next month."
December 2028
Tyler John: "Yep. If discontinuous leaps in AI capabilities are 3-5 years away we should probably start to think a little bit about how to prepare for that. The EU AI Act has been in development for 5 years and still isn't passed yet. We just can't take the wait and see approach any longer."
Mustafa Stuleyman: "[Current models have already] ... arguably passed the Turing Test.
I've proposed a test which involves [AIs] going off and taking $100,000 investment, and over the course of three months, try to set about creating a new product, researching the market, seeing what consumers might like, gen...

Jan 1, 2024 • 17min
LW - Planning to build a cryptographic box with perfect secrecy by Lysandre Terrisse
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Planning to build a cryptographic box with perfect secrecy, published by Lysandre Terrisse on January 1, 2024 on LessWrong.
Summary
Since September 2023, I started learning a lot of math and programming skills in order to develop the safest cryptographic box in the world (and yes, I am aiming high). In these four months, I learned important things you may want to know:
Fully Homomorphic Encryption (FHE) schemes with perfect secrecy do exist.
These FHE schemes do not need any computational assumption.
These FHE schemes are tractable (in the worst case, encrypting a program before running it makes it three times slower).
We can therefore run infinitely dangerous programs without obtaining any information about them or their outputs. This may be useful in order to run a superintelligence without destroying the world.
However, these schemes work only on quantum computers.
In this post, I will firstly talk about how I learned about this FHE scheme, then I will explain my plan for making this cryptographic box, and finally, I will mention some ethical concerns about this cryptographic box.
Before reading this post, I recommend you to read this post by Paul Christiano, and the comments that go with it. These are very informative, and they sharpened my views for this project. Paul Christiano presents a way to extract a friendly AI from an unfriendly one. This being only one example of what can be done with a cryptographic box, I will mostly consider cryptographic boxes as a solution to a problem that I call the malign computation problem.
Introduction
In August 2022, I started reading AGI Safety Literature Review. At one point, the authors tell this:
One way to box an AGI is to homomorphically encrypt it. Trask (2017) shows how to train homomorphically encrypted neural networks. By homomorphically encrypting an AGI, its predictions and actions also come out encrypted. A human operator with the secret key can choose to decrypt them only when he wants to.
When I have read this for the first time, I told myself that I should check this work because it seemed important.
And then I completely forgot about it.
Then, in April 2023, during a PHP lesson, I realized that the problem of processing a request made by a malevolent user is similar to the problem of boxing a superintelligence. After the lesson, I asked the teacher how to prevent code injections, and he gave me two answers:
Do not show your code to the public. This answer didn't convince me, because even current hackers know how to go around this precaution.
Encrypt the request before processing it. This is the moment I remembered the quote from AGI Safety Literature Review.
After looking back at every note that I made about AI Safety, I managed to find back the work made by Trask.
Trask's work
Trask's post shows how to build an encrypted AI using the Efficient Integer Vector Homomorphic Encryption. However, since this scheme (along with every other FHE scheme I know about on classical computers) relies on computational assumptions, we have some problems:
The scheme may not be safe. A computational assumption consists of stating "There is no efficient way to solve this problem". However, we do not know how to prove any such statement, as this would solve the PNP problem. Most FHE schemes (including this one) depend on the Learning With Errors (LWE) problem. Although LWE is quite secure for the moment, I won't bet the existence of all life on Earth on it. Similarly, I won't bet the safety of a superintelligence on it.
This scheme takes too long to compute. In practice, the first superintelligence will probably have more than a hundred billion weights and biases, making this scheme very expensive or even unusable.
This scheme isn't fully homomorphic. Basically, a cryptographic scheme is said to be homomorphic when we can run s...


