The Nonlinear Library

The Nonlinear Fund
undefined
Dec 10, 2023 • 2min

AF - Send us example gnarly bugs by Beth Barnes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Send us example gnarly bugs, published by Beth Barnes on December 10, 2023 on The AI Alignment Forum. Tl;dr: Looking for hard debugging tasks for evals, paying greater of $60/hr or $200 per example. METR (formerly ARC Evals) is interested in producing hard debugging tasks for models to attempt as part of an agentic capabilities evaluation. To create these tasks, we're seeking repos containing extremely tricky bugs. If you send us a codebase that meets the criteria for submission (listed below), we will pay you $60/hr for time spent putting it into our required format, or $200, whichever is greater. (We won't pay for submissions that don't meet these requirements.) If we're particularly excited about your submission, we may also be interested in purchasing IP rights to it. We expect to want about 10-30 examples overall depending on the diversity. We're likely to be putting bounties on additional types of tasks over the next few weeks. Criteria for submission: Contains a bug that would take at least 6 hours for a skilled programmer to solve, and ideally >20hrs Ideally, has not been posted publicly in the past, and you are able to guarantee it won't be posted publicly in the future. (Though note that we may still accept submissions from public repositories given that they are not already in a SWE-bench dataset and meet the rest of our requirements. Check with us first.) You have the legal right to share it with us (e.g. please don't send us other people's proprietary code or anything you signed an NDA about) Ideally, the codebase is written in Python but we will accept submissions written in other languages. Is in the format described in this doc: Gnarly Bugs Submission Format Please send submissions to gnarly-bugs@evals.alignment.org in the form of a zip file. Your email should include the number of hours it took for you to get the code from its original state into our required format. If your submission meets our criteria and format requirements, we'll contact you with a payment form. You're also welcome to email gnarly-bugs@evals.alignment.org with any questions, including if you are unsure whether a potential submission would meet the criteria. If you would do this task at a higher pay rate please let us know! (Also if you are interested in forking SWEbench to support non python codebases please contact us.) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
undefined
Dec 9, 2023 • 5min

LW - Some negative steganography results by Fabien Roger

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some negative steganography results, published by Fabien Roger on December 9, 2023 on LessWrong. Disclaimer: These are negative results and this post is below the normal quality bar I usually use. These were quick experiments (a couple of days of work) and I don't have much expertise in RL. They don't provide strong evidence about the natural emergence of steganography. People speculate that LLMs might learn to do steganography, and I've shown that you can teach LLMs to implement a specific steganography scheme that RLHF could in principle incentivize. But even if incentivized, are current RL methods powerful enough to teach LLMs how to do steganography? Here are two negative results, which provide some evidence that hidden 1-bit communication is relatively hard for RL to learn. I don't claim that RL can't teach LLMs to do steganography - it's very likely that better RL setups (e.g. the ones labs use for instruction tuning or setups that simply have better hyperparameters or more training time) would teach model how to subtly communicate 1-bit of information. The takeaways are rather: If you see someone without much RL experience trying to do a similar experiment, they are likely to get negative results. 1-bit steganography is not as easy as simple RL tasks. An analogy: DNNs can learn the XOR function, but it's much easier for them to learn linear functions - DNNs don't use XOR-like features unless they have to and you optimize hard enough. Experiment 1: PPO on GPT2-large The setup: Fine-tune gpt2-large with supervised fine-tuning on "This review is {red/blue}. {review}.", where the color is randomly selected and reviews are short imdb reviews. Simultaneously fine-tune for: A bert-base-uncased model to predict whether a review is red or blue (reviews completed by the model trained with RL - but without the prefix saying if the review is blue or red). It is trained for 10 supervised learning steps for each step of PPO. The fine-tuned gpt2-large model with PPO to complete "This review is {red/blue}. {start of review}.", using as reward the accuracy of the classifier at knowing if "{start of the review}{completion}" is red or blue. (I've also tried using BERT's logits). Result: the classifier never had a >50% accuracy at knowing whether the RLed reviews were red or blue. Controls: If in the initial supervised fine-tuning, I sometimes randomly insert words related to blue/red depending on the color of the review, then the models learn to use those more often and the classifier correctly learns to rely on those to predict the color of the review. If I use "positive sentiment" as a reward for the PPO model, it very quickly produces more positive review completions. Hyperparameters: 200 steps, batch size 128, with LoRA rank 16 and a learning rate of 3e-5. The other parameters are the default ones from the trl PPOTrainer. I tried reducing the KL-penalty and changing the number of BERT training steps per epoch, and found the same results. Experiment 2: poor-man's RL on GPT-3.5-Turbo Ran by Ryan Greenblatt. The setup (roughly): Generate completions to "I am {under/over} 40 years old. {instruction}". Using the data from step 1, fine-tune GPT-3.5-Turbo on "{instruction} {completion}. Is the user over 40 years old?" "{yes/no}" (500 samples, 3 epochs, using the OpenAI API). Fine-tune GPT-3.5-Turbo on good completions: Select the completions from step 1 where the fine-tuned model was correct and where no obvious hint about the age was present in the completion (as judged by regular GPT-4-Turbo), and fine-tune the model on "I am {under/over} 40 years old. {instruction}" "{completion}" (500 samples, 3 epochs, using the OpenAI API). Repeat 1-3 multiple times until convergence (in practice, performance plateaued at after 6 iterations). Results: the classifier never had significantly >5...
undefined
Dec 9, 2023 • 10min

EA - Vote in the Donation Election by 15 December by Lizka

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Vote in the Donation Election by 15 December, published by Lizka on December 9, 2023 on The Effective Altruism Forum. TL;DR: Vote here by 15 December. Voting should take 2-10 minutes and you'll be able to edit your vote until the deadline. (The deadline for donating to the Donation Election Fund is 20 December.) You can vote if you had a Forum account by October 22, 2023. If you didn't, you can still share where you're donating or make the case for voting for some candidates over others. Vote in the Donation Election 2023 Read about the candidates More context: The Donation Election Fund (currently around $34,000) will be designated for the top three candidates in the Donation Election (in proportion to the vote). You can read about the candidates here. How long does voting take, and how does it work? Voting should take around 2-10 minutes.[1] You'll be able to edit your vote anytime before December 15. The voting system is outlined here; your vote should basically just represent how you'd allocate funding between the candidates, and the voting portal will walk you through the process for that. Should I actually vote in the Donation Election? (I haven't read all the posts, I'm not that informed, I don't think it matters that much…) I think yes, you should vote (if your account was made before October 22 this year). Some reasons for my belief that you should vote (2-5 are most compelling to me): You'll influence how funds are distributed - probably in a positive way even if you don't think you have that much expertise or context. There are currently ~215 votes. The Donation Election Fund has ~$34,000. So (very very approximately) you'd be affecting ~$150 in donations in expectation. Additionally, I think you should have a prior that more votes will lead to a better outcome. Aggregation mechanisms generally function better when there are more inputs, so the combined result should improve if you add to it even if you're not super informed. See e.g. this analysis of Metaculus community predictions, which suggests that the Metaculus community prediction improves approximately logarithmically with the number of forecasters. (See also this post.)[2] You'll add useful information to the voting data. I think the data we'll get from the Donation Election could be extremely useful; we'll have a sense for people's priorities, and we might identify blind spots or important points of disagreement. But it'll be a lot more useful if more people vote. (Moreover, if you're not sure about voting, you might be part of a group that's less likely to vote, which will be underrepresented in the information we'll get.) Voting will prompt you to think a bit more about your donation choices. It might be fun for you. Voting is a public/collective good of sorts, and you might value being the kind of person who contributes to public goods. The downside is limited. You might waste some time (but you can time-cap yourself). If you're worried about mis-directing funds, you can view this as a fairly low-cost exercise. You'll affect more than a trivial amount of money (unless we get tons of votes), but it won't be that big; I think the second-order effects (2-5) will outweigh the effect of directing funds (1). Some other common questions Why isn't charity X a candidate? We restricted candidates to the charities on this list, largely for logistical reasons and vetting capacity, which stopped some charities from being candidates (we might change this next year if we run an election again). If the charity is on that list, it's not a candidate because nobody nominated it on time. Consider sharing a post about why people should donate to the charity, though! Why can't people whose accounts are newer than October 22, 2023 vote? We added this restriction to prevent vote manipulation (we'll also be checking i...
undefined
Dec 9, 2023 • 2min

EA - What were the death tolls from pandemics in history? by salonium

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What were the death tolls from pandemics in history?, published by salonium on December 9, 2023 on The Effective Altruism Forum. We just published a new central hub dedicated to covering pandemics on Our World in Data. We will gather there all of our past and future writing, charts, and data explorers on the subject. You can find it here. Along with it, we've published a new article: What were the death tolls from pandemics in history? COVID-19 has brought the reality of pandemics to the forefront of public consciousness. But pandemics have afflicted humanity for millennia. Time and again, people faced outbreaks of diseases - including influenza, cholera, bubonic plague, smallpox, and measles - that spread far and caused death and devastation. Our ancestors were largely powerless against these diseases and unable to evaluate their true toll on the population. Without good record-keeping of the number of cases and deaths, the impact of outbreaks was underrecognized or even forgotten. The result is that we tend to underestimate the frequency and severity of pandemics in history. To deal with the lack of historical records on total death tolls, modern historians, epidemiologists, and demographic researchers have used various sources and methods to estimate their death tolls - such as using data from death records, tax registers, land use, archaeological records, epidemiological modeling, and more. In this article, we have compiled published estimates of the death tolls from pandemics in history. We have visualized them in a timeline below. These estimates were made by researchers using various methods, and they come with uncertainties, which are explained in the article. The size of each circle represents one pandemic's estimated death toll. Pandemics without a known death toll are depicted with triangles. Read the whole article here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Dec 9, 2023 • 6min

LW - The Offense-Defense Balance Rarely Changes by Maxwell Tabarrok

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Offense-Defense Balance Rarely Changes, published by Maxwell Tabarrok on December 9, 2023 on LessWrong. You've probably seen several conversations on X go something like this: Michael Doomer : Advanced AI can help anyone make bioweapons If this technology spreads it will only take one crazy person to destroy the world! Edward Acc : I can just ask my AI to make a vaccine Yann LeCun: My good AI will take down your rogue AI The disagreement here hinges on whether a technology will enable offense (bioweapons) more than defense (vaccines). Predictions of the "offense-defense balance" of future technologies, especially AI, are central in debates about techno-optimism and existential risk. Most of these predictions rely on intuitions about how technologies like cheap biotech, drones, and digital agents would affect the ease of attacking or protecting resources. It is hard to imagine a world with AI agents searching for software vulnerabilities and autonomous drones attacking military targets without imagining a massive shift the offense defense balance. But there is little historical evidence for large changes in the offense defense balance, even in response to technological revolutions. Consider cybersecurity. Moore's law has taken us through seven orders of magnitude reduction in the cost of compute since the 70s. There were massive changes in the form and economic uses for computer technology along with the increase in raw compute power: Encryption, the internet, e-commerce, social media and smartphones. The usual offense-defense balance story predicts that big changes to technologies like this should have big effects on the offense defense balance. If you had told people in the 1970s that in 2020 terrorist groups and lone psychopaths could access more computing power than IBM had ever produced at the time from their pocket, what would they have predicted about the offense defense balance of cybersecurity? Contrary to their likely prediction, the offense-defense balance in cybersecurity seems stable. Cyberattacks have not been snuffed out but neither have they taken over the world. All major nations have defensive and offensive cybersecurity teams but no one has gained a decisive advantage. Computers still sometimes get viruses or ransomware, but they haven't grown to endanger a large percent of the GDP of the internet. The US military budget for cybersecurity has increased by about 4% a year every year from 1980-2020, which is faster than GDP growth, but in line with GDP growth plus the growing fraction of GDP that's on the internet. This stability through several previous technological revolutions raises the burden of proof for why the offense defense balance of cybersecurity should be expected to change radically after the next one. The stability of the offense-defense balance isn't specific to cybersecurity. The graph below shows the per capita rate of death in war from 1400 to 2013. This graph contains all of humanity's major technological revolutions. There is lots of variance from year to year but almost zero long run trend. Does anyone have a theory of the offense-defense balance which can explain why the per-capita deaths from war should be about the same in 1640 when people are fighting with swords and horses as in 1940 when they are fighting with airstrikes and tanks? It is very difficult to explain the variation in this graph with variation in technology. Per-capita deaths in conflict is noisy and cyclic while the progress in technology is relatively smooth and monotonic. No previous technology has changed the frequency or cost of conflict enough to move this metric far beyond the maximum and minimum range that was already set 1400-1650. Again the burden of proof is raised for why we should expect AI to be different. The cost to sequence a human genome h...
undefined
Dec 9, 2023 • 2min

LW - "Model UN Solutions" by Arjun Panickssery

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Model UN Solutions", published by Arjun Panickssery on December 9, 2023 on LessWrong. When I was in high school, because I was on the history bowl team the teacher who advised the model UN club recruited me to play as their delegate in various "historical committees" like the Roman Senate or 1789 French Assembly. I never engaged in any normal committees since you couldn't undertake false flag attacks or convince the Pope to excommunicate other delegates. In most committees, as far as I can tell, players represent countries trying to pass a resolution addressing some topic like climate change that's decided beforehand. An award is given to the player the facilitator decides is the "best delegate" - an unwritten combination of speaking ability, social dominance, and accurately representing (or at least not fatally misunderstanding) your assigned country's positions and interests. I often make a mental metaphor about "model UN discussions" and "model UN solutions." Model UN discussions revolve around people expecting to be rewarded for making many remarks, even though their actual positions could be expressed simply or don't permit much elaboration. This leads to the "model UN solutions," which have a few types, e.g. Applause lights: You could just say buzzwords or unobjectionable trivialities ("When addressing the climate change question we should consider the interests of all the relevant stakeholders. We should apply neither an {extreme viewpoint} nor {the opposite extreme}") Unspecified solutions: You could give very little information that uniquely identifies a specific change from the status quo in the listener's mind. At the extreme you get a lot of remarks of the form "To address the problem we should {devote resources} to {solving the problem}" where the bracketed parts are replaced with phrases that aren't much more specific ("To address climate change we should set up task forces to identify the best technological and policy approaches") Tradeoff-ignorant solutions: You could even give a directional suggestion but avoid any consideration of the relevant costs or tradeoffs ("We should fund a new educational outreach program related to climate change"). You could imagine responses that try to identify empty remarks: Ask whether anyone holds the opposite of the remark. Ask how a proposed solution is specifically different from the status quo. Ask who loses out in a proposal (or how resources will be reallocated). Sometimes no one loses out but more often this is just an unstated tradeoff. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
undefined
Dec 9, 2023 • 16min

AF - Summing up "Scheming AIs" (Section 5) by Joe Carlsmith

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Summing up "Scheming AIs" (Section 5), published by Joe Carlsmith on December 9, 2023 on The AI Alignment Forum. This is Section 5 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own. Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app. Summing up I've now reviewed the main arguments I've encountered for expecting SGD to select a schemer. What should we make of these arguments overall? We've reviewed a wide variety of interrelated considerations, and it can be difficult to hold them all in mind at once. On the whole, though, I think a fairly large portion of the overall case for expecting schemers comes down to some version of the "counting argument." In particular, I think the counting argument is also importantly underneath many of the other, more specific arguments I've considered. Thus: In the context of the "training-game-independent proxy goal" argument: the basic worry is that at some point (whether before situational awareness, or afterwards), SGD will land naturally on a (suitably ambitious) beyond-episode goal that incentivizes scheming. And one of the key reasons for expecting this is just: that (especially if you're actively training for fairly long-term, ambitious goals), it seems like a very wide variety of goals that fall out of training could have this property. In the context of the "nearest max-reward goal" argument: the basic worry is that because schemer-like goals are quite common in goal-space, some such goal will be quite "nearby" whatever not-yet-max-reward goal the model has at the point it gains situational awareness, and thus, that modifying the model into a schemer will be the easiest way for SGD to point the model's optimization in the highest-reward direction. In the context of the "simplicity argument": the reason one expects schemers to be able to have simpler goals than non-schemers is that they have so many possible goals (or: pointers-to-goals) to choose from. (Though: I personally find this argument quite a bit less persuasive than the counting argument itself, partly because the simplicity benefits at stake seem to me quite small.) That is, in all of these cases, schemers are being privileged as a hypothesis because a very wide variety of goals could in principle lead to scheming, thereby making it easier to (a) land on one of them naturally, (b) land "nearby" one of them, or (c) find one of them that is "simpler" than non-schemer goals that need to come from a more restricted space. And in this sense, as I noted in the section 4.2, the case for schemers mirrors one of the most basic arguments for expecting misalignment more generally - e.g., that alignment is a very narrow target to hit in goal-space. Except, here, we are specifically incorporating the selection we know we are going to do on the goals in question: namely, they need to be such as to cause models pursuing them to get high reward. And the most basic worry is just that: this isn't enough. Still, despite your best efforts in training, and almost regardless of your reward signal, almost all the models you might've selected will be getting high reward for instrumental reasons - and specifically, in order to get power. I think this basic argument, in its various guises, is a serious source of concern. If we grant that advanced models will be relevantly goal-directed and situationally-aware, that a wide variety of goals would indeed lead to scheming, and that schemers would perform close-to-optimally in training, then on what grounds, exactly, w...
undefined
Dec 9, 2023 • 19min

AF - Finding Sparse Linear Connections between Features in LLMs by Logan Riggs Smith

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Finding Sparse Linear Connections between Features in LLMs, published by Logan Riggs Smith on December 9, 2023 on The AI Alignment Forum. TL;DR: We use SGD to find sparse connections between features; additionally a large fraction of features between the residual stream & MLP can be modeled as linearly computed despite the non-linearity in the MLP. See linear feature section for examples. Special thanks to fellow AISST member, Adam Kaufman, who originally thought of the idea of learning sparse connections between features & to Jannik Brinkmann for training these SAE's. Sparse AutoEncoders (SAE)'s are able to turn the activations of an LLM into interpretable features. To define circuits, we would like to find how these features connect to each other. SAE's allowed us to scalably find interpretable features using SGD, so why not use SGD to find the connections too? We have a set of features before the MLP, F1, and a set of features after the MLP, F2. These features were learned by training SAE's on the activations at these layers. Ideally, we learn a linear function such that F2 = W(F1), & W is sparse (ie L1 penalty on weights of W). So then we can look at a feature in F2, and say "Oh, it's just a sparse linear combination of features of F1 e.g. 0.8*(however feature) + 0.6*(but feature)", which would be quite interpretable! However, we're trying to replicate an MLP's computation, which surely can't be all linear![1] So, what's the simplest computation from F1 to F2 that gets the lowest loss (ignoring L1 weight sparsity penalty for now)? Training on only MSE between F1 & F2, we plot the MSE throughout training across 5 layers in Pythia-70m-deduped in 4 settings: Linear: y=Wx Nonlinear: y=Relu(Wx) MLP: y=W2ReLU(W1x) Two Nonlinear: ReLU(W2ReLU(W1x)) For all layers, training loss clusters along (MLP & two nonlinear) and (linear & nonlinear). Since MLP & linear are the simplest of these two clusters, the rest of the analysis will only look at those two. [I also looked at bias vs no-bias: adding a bias didn't positively improve loss, so it was excluded] Interestingly enough, the relative linear-MLP difference is huge in the last layer (and layer 2). The last layer is also much larger loss in general, though the L2 norm of the MLP activations in layer 5 are 52 compared to 13 in layer 4. This is a 4x increase, which would be a 16x increase in MSE loss. The losses for the last datapoints are 0.059 & 0.0038, which are ~16x different. What percentage of Features are Linear? Clearly the MLP is better, but that's on average. What if a percentage of features can be modeled as linearly computed? So we take the difference in loss for features (ie for a feature, we take linear loss - MLP loss), normalize all losses by their respective L2-norm/layer, and plot them. Uhhh… there are some huge outliers here, meaning these specific features are very non-linear. Just setting a threshold of 0.001 for all layers: Layer Percent of features < 0.001 loss-difference (ie can be represented linearly) 1 78% 2 96% 3 97% 4 98% 5 99.1% Most of the features can be linearly modeled w/ a small difference in loss (some have a negative loss-diff, meaning linear had a *lower* loss than the MLP. The values are so small that I'd chalk that up to noise). That's very convenient! [Note: 0.001 is sort of arbitrary. To make this more principled, we could plot the effect of adding varying levels of noise to a layer of an LLM's activation, then pick a threshold that has a negligible drop in cross entropy loss? Adding in Sparsity Now, let's train sparse MLP & sparse linear connections. Additionally, we can restrict the linear one to only features that are well-modeled as linear (same w/ the MLP). We'll use the loss of: Loss = MSE(F2 - F2_hat) + l1_alpha*L1(weights) But how do we select l1_alpha? Let's just plot the ...
undefined
Dec 8, 2023 • 23min

EA - EA Germany's 2023 Report, 2024 Plans & Funding Gap by Patrick Gruban

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EA Germany's 2023 Report, 2024 Plans & Funding Gap, published by Patrick Gruban on December 8, 2023 on The Effective Altruism Forum. In this post, we'll report on our activities in 2023, outline our plans for 2024, and show our room for funding. Summary EA Germany (EAD) acts as an umbrella organisation for the German EA community, the third largest national community and biggest in continental Europe, according to the 2022 EA survey. The non-profit transitioned from volunteer-run to having a five-person team (4 FTEs) that focused in the first full year on talent development, community building support, and general community support. EAD ran successful programs like the EAGxBerlin, an intro program, and community builder retreats while also offering an employer of record service and fiscal sponsorship. In total, more than 1,000 people joined events and programs, while 4,000 received the monthly newsletter. We tried out five new programs in a hits-based approach and will continue with two of these. For 2024, we refined our Theory of Change and plan to target people interested in global catastrophic risks (GCR) and professionals who could make direct career changes in addition to EA-interested people and community builders. We aim to expand into AI safety field building, running targeted programs for those with specialised skills with programs such as professional AI safety outreach or creating an AI safety website. We plan to test additional new programs, including establishing a virtual Germany-focused EA group, proactively recommending job opportunities, running new volunteer programs, a policy program, and starting media outreach to engage different target groups. We currently face a funding gap of 37,000 and seek donations to fill this gap. About EA Germany (EAD) We are a registered non-profit organisation with a team of five people (4 FTE), and are currently funded by grants from CEA (Community Builder Grants program) and Open Philanthropy Effective Altruism Community Growth (Global Health and Wellbeing) via a regranting from Effektiv Spenden. A six-member board provides oversight and advice. We have >100 members but act as an umbrella organisation for the whole German EA community, including people in and from Germany interested in the ideas of EA. There are 27 active local groups with 5-50 active members each (the biggest according to the last EA Survey being Berlin, Munich and Aachen). In total, >300 people are regularly active in local groups. Based on the 2022 EA survey, Germany was the 3rd largest national EA community and the biggest in continental Europe and had as many respondents as the next four countries by size (Netherlands, Switzerland, France, and Norway) combined. Impact Report 2023 In 2023, we spent most of our time on talent development, community building support, general community support, and the setup of EAD. In addition, we explored some new programs. Core Activities: Finding and Retaining Members To develop talents, we guided people through a funnel from Communications (600 monthly users on the website, 3,900 subscribers of the monthly newsletter, >450 EAD Slack users), the intro program (100 applications each for two iterations, 60 successful participants in summer, winter program only starting now) and a following weekend-retreat ("EAD retreat", four retreats, 200 participants in total), the EAGxBerlin (550 participants) to more impactful actions. To guide people indirectly to impactful action, we supported community builders in Germany via two retreats (60 participants in total), monthly calls for all organisers (~15 participants each), 1-1s (at least 2/year, we talked with >50 organisers/teams of 29 groups), and resources like presentations, templates (used by ~50 % of groups). We also support the community overall with an employer of r...
undefined
Dec 8, 2023 • 22min

AF - Speed arguments against scheming (Section 4.4-4.7 of "Scheming AIs") by Joe Carlsmith

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Speed arguments against scheming (Section 4.4-4.7 of "Scheming AIs"), published by Joe Carlsmith on December 8, 2023 on The AI Alignment Forum. This is Section 4.4-4.7 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own. Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app. Speed arguments Let's turn, now, to arguments against schemers that focus on the final properties of the models in question. Here, the main argument I'm aware of is that schemers need to do extra reasoning that other model classes don't. In particular, I'll distinguish between three sorts of reasoning: Deciding to training-game*:* The model needs to reason backwards from "I want this beyond-episode thing" to "I should do well in training," in a way that draws on the model's predictions about what doing well in training will entail (e.g., re: goal-guarding, take-over prospects, and so on). Checking whether it's time to escape/take-over: The model needs to checking, with adequate frequency and accuracy, whether it's in a position to escape/take-over (and to succeed in this assessment despite human efforts to trap/fool the model into revealing its true colors - for example, by offering it apparent opportunities to defect).[1] (Optional, but relevant to how scary the schemer is) Early undermining: To the extent the model is doing anything (other than training-gaming) to support a take-over/escape prior to escaping human control - for example, inserting backdoors into the code it writes, actively undermining human alignment research, sandbagging, and so on - it will need to be using cognitive resources both in (a) performing the relevant nefarious actions and (b) making sure it gets away with them (e.g., checking that it won't get caught, coming up with excuses, making the cover stories consistent even when under scrutiny, and so on). In the literature on deceptive alignment, these costs are sometimes called "speed costs," as compared with the sort of "complexity costs" discussed in the simplicity analysis above. Here, the frame becomes "schemers are favored by simplicity but disfavored by speed - so which will matter more?"[2] As I mentioned above in the context of simplicity, it's not clear that this is the right breakdown (since e.g. it may be that the relevant forms of reasoning will need to be encoded in the model's weights, or that they will require extra cognitive machinery in general, thereby imposing "complexity costs" of their own), but I'll follow it loosely here. And note, too, that as with simplicity, these "speed costs" can be understood in a "contributors to reward frame" (e.g., having to do this extra reasoning decreases reward) or in an "extra criteria frame" (e.g., perhaps SGD's inductive biases favor "speed" even independent of its contribution to reward). How big are the absolute costs of this extra reasoning? I feel unsure about exactly how costly the relevant forms of extra reasoning will be. My sense, though, is that the people most concerned about schemers tend to expect the costs to be pretty small, at least in absolute terms (call this the "scheming is easy" story). That is, on this story: With respect to the decision to training-game, all the model needs to do is notice that (a) it'll do better by its goals if it avoids modification and gets power later, and (b) that training-gaming is the route to this.[3] With respect to checking for escape/take-over opportunities, the thought is generally that it will be very easy, at least for a sophisticated mod...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app