The Nonlinear Library

The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episodes

Mentioned books

Dec 10, 2023 • 2min

AF - Finding Sparse Linear Connections between Features in LLMs by Logan Riggs Smith

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Finding Sparse Linear Connections between Features in LLMs, published by Logan Riggs Smith on December 9, 2023 on The AI Alignment Forum. TL;DR: We use SGD to find sparse connections between features; additionally a large fraction of features between the residual stream & MLP can be modeled as linearly computed despite the non-linearity in the MLP. See linear feature section for examples. Special thanks to fellow AISST member, Adam Kaufman, who originally thought of the idea of learning sparse connections between features & to Jannik Brinkmann for training these SAE's. Sparse AutoEncoders (SAE)'s are able to turn the activations of an LLM into interpretable features. To define circuits, we would like to find how these features connect to each other. SAE's allowed us to scalably find interpretable features using SGD, so why not use SGD to find the connections too? We have a set of features before the MLP, F1, and a set of features after the MLP, F2. These features were learned by training SAE's on the activations at these layers. Ideally, we learn a linear function such that F2 = W(F1), & W is sparse (ie L1 penalty on weights of W). So then we can look at a feature in F2, and say "Oh, it's just a sparse linear combination of features of F1 e.g. 0.8*(however feature) + 0.6*(but feature)", which would be quite interpretable! However, we're trying to replicate an MLP's computation, which surely can't be all linear![1] So, what's the simplest computation from F1 to F2 that gets the lowest loss (ignoring L1 weight sparsity penalty for now)? Training on only MSE between F1 & F2, we plot the MSE throughout training across 5 layers in Pythia-70m-deduped in 4 settings: Linear: y=Wx Nonlinear: y=Relu(Wx) MLP: y=W2ReLU(W1x) Two Nonlinear: ReLU(W2ReLU(W1x)) For all layers, training loss clusters along (MLP & two nonlinear) and (linear & nonlinear). Since MLP & linear are the simplest of these two clusters, the rest of the analysis will only look at those two. [I also looked at bias vs no-bias: adding a bias didn't positively improve loss, so it was excluded] Interestingly enough, the relative linear-MLP difference is huge in the last layer (and layer 2). The last layer is also much larger loss in general, though the L2 norm of the MLP activations in layer 5 are 52 compared to 13 in layer 4. This is a 4x increase, which would be a 16x increase in MSE loss. The losses for the last datapoints are 0.059 & 0.0038, which are ~16x different. What percentage of Features are Linear? Clearly the MLP is better, but that's on average. What if a percentage of features can be modeled as linearly computed? So we take the difference in loss for features (ie for a feature, we take linear loss - MLP loss), normalize all losses by their respective L2-norm/layer, and plot them. Uhhh… there are some huge outliers here, meaning these specific features are very non-linear. Just setting a threshold of 0.001 for all layers: Layer Percent of features < 0.001 loss-difference (ie can be represented linearly) 1 78% 2 96% 3 97% 4 98% 5 99.1% Most of the features can be linearly modeled w/ a small difference in loss (some have a negative loss-diff, meaning linear had a *lower* loss than the MLP. The values are so small that I'd chalk that up to noise). That's very convenient! [Note: 0.001 is sort of arbitrary. To make this more principled, we could plot the effect of adding varying levels of noise to a layer of an LLM's activation, then pick a threshold that has a negligible drop in cross entropy loss? Adding in Sparsity Now, let's train sparse MLP & sparse linear connections. Additionally, we can restrict the linear one to only features that are well-modeled as linear (same w/ the MLP). We'll use the loss of: Loss = MSE(F2 - F2_hat) + l1_alpha*L1(weights) But how do we select l1_alpha? Let's just plot the ...

Dec 8, 2023 • 23min

EA - EA Germany's 2023 Report, 2024 Plans & Funding Gap by Patrick Gruban

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EA Germany's 2023 Report, 2024 Plans & Funding Gap, published by Patrick Gruban on December 8, 2023 on The Effective Altruism Forum. In this post, we'll report on our activities in 2023, outline our plans for 2024, and show our room for funding. Summary EA Germany (EAD) acts as an umbrella organisation for the German EA community, the third largest national community and biggest in continental Europe, according to the 2022 EA survey. The non-profit transitioned from volunteer-run to having a five-person team (4 FTEs) that focused in the first full year on talent development, community building support, and general community support. EAD ran successful programs like the EAGxBerlin, an intro program, and community builder retreats while also offering an employer of record service and fiscal sponsorship. In total, more than 1,000 people joined events and programs, while 4,000 received the monthly newsletter. We tried out five new programs in a hits-based approach and will continue with two of these. For 2024, we refined our Theory of Change and plan to target people interested in global catastrophic risks (GCR) and professionals who could make direct career changes in addition to EA-interested people and community builders. We aim to expand into AI safety field building, running targeted programs for those with specialised skills with programs such as professional AI safety outreach or creating an AI safety website. We plan to test additional new programs, including establishing a virtual Germany-focused EA group, proactively recommending job opportunities, running new volunteer programs, a policy program, and starting media outreach to engage different target groups. We currently face a funding gap of 37,000 and seek donations to fill this gap. About EA Germany (EAD) We are a registered non-profit organisation with a team of five people (4 FTE), and are currently funded by grants from CEA (Community Builder Grants program) and Open Philanthropy Effective Altruism Community Growth (Global Health and Wellbeing) via a regranting from Effektiv Spenden. A six-member board provides oversight and advice. We have >100 members but act as an umbrella organisation for the whole German EA community, including people in and from Germany interested in the ideas of EA. There are 27 active local groups with 5-50 active members each (the biggest according to the last EA Survey being Berlin, Munich and Aachen). In total, >300 people are regularly active in local groups. Based on the 2022 EA survey, Germany was the 3rd largest national EA community and the biggest in continental Europe and had as many respondents as the next four countries by size (Netherlands, Switzerland, France, and Norway) combined. Impact Report 2023 In 2023, we spent most of our time on talent development, community building support, general community support, and the setup of EAD. In addition, we explored some new programs. Core Activities: Finding and Retaining Members To develop talents, we guided people through a funnel from Communications (600 monthly users on the website, 3,900 subscribers of the monthly newsletter, >450 EAD Slack users), the intro program (100 applications each for two iterations, 60 successful participants in summer, winter program only starting now) and a following weekend-retreat ("EAD retreat", four retreats, 200 participants in total), the EAGxBerlin (550 participants) to more impactful actions. To guide people indirectly to impactful action, we supported community builders in Germany via two retreats (60 participants in total), monthly calls for all organisers (~15 participants each), 1-1s (at least 2/year, we talked with >50 organisers/teams of 29 groups), and resources like presentations, templates (used by ~50 % of groups). We also support the community overall with an employer of r...

Dec 8, 2023 • 22min

AF - Speed arguments against scheming (Section 4.4-4.7 of "Scheming AIs") by Joe Carlsmith

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Speed arguments against scheming (Section 4.4-4.7 of "Scheming AIs"), published by Joe Carlsmith on December 8, 2023 on The AI Alignment Forum. This is Section 4.4-4.7 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own. Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app. Speed arguments Let's turn, now, to arguments against schemers that focus on the final properties of the models in question. Here, the main argument I'm aware of is that schemers need to do extra reasoning that other model classes don't. In particular, I'll distinguish between three sorts of reasoning: Deciding to training-game*:* The model needs to reason backwards from "I want this beyond-episode thing" to "I should do well in training," in a way that draws on the model's predictions about what doing well in training will entail (e.g., re: goal-guarding, take-over prospects, and so on). Checking whether it's time to escape/take-over: The model needs to checking, with adequate frequency and accuracy, whether it's in a position to escape/take-over (and to succeed in this assessment despite human efforts to trap/fool the model into revealing its true colors - for example, by offering it apparent opportunities to defect).[1] (Optional, but relevant to how scary the schemer is) Early undermining: To the extent the model is doing anything (other than training-gaming) to support a take-over/escape prior to escaping human control - for example, inserting backdoors into the code it writes, actively undermining human alignment research, sandbagging, and so on - it will need to be using cognitive resources both in (a) performing the relevant nefarious actions and (b) making sure it gets away with them (e.g., checking that it won't get caught, coming up with excuses, making the cover stories consistent even when under scrutiny, and so on). In the literature on deceptive alignment, these costs are sometimes called "speed costs," as compared with the sort of "complexity costs" discussed in the simplicity analysis above. Here, the frame becomes "schemers are favored by simplicity but disfavored by speed - so which will matter more?"[2] As I mentioned above in the context of simplicity, it's not clear that this is the right breakdown (since e.g. it may be that the relevant forms of reasoning will need to be encoded in the model's weights, or that they will require extra cognitive machinery in general, thereby imposing "complexity costs" of their own), but I'll follow it loosely here. And note, too, that as with simplicity, these "speed costs" can be understood in a "contributors to reward frame" (e.g., having to do this extra reasoning decreases reward) or in an "extra criteria frame" (e.g., perhaps SGD's inductive biases favor "speed" even independent of its contribution to reward). How big are the absolute costs of this extra reasoning? I feel unsure about exactly how costly the relevant forms of extra reasoning will be. My sense, though, is that the people most concerned about schemers tend to expect the costs to be pretty small, at least in absolute terms (call this the "scheming is easy" story). That is, on this story: With respect to the decision to training-game, all the model needs to do is notice that (a) it'll do better by its goals if it avoids modification and gets power later, and (b) that training-gaming is the route to this.[3] With respect to checking for escape/take-over opportunities, the thought is generally that it will be very easy, at least for a sophisticated mod...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

The Nonlinear Library

Episodes

Mentioned books

AF - Send us example gnarly bugs by Beth Barnes

LW - Some negative steganography results by Fabien Roger

EA - Vote in the Donation Election by 15 December by Lizka

EA - What were the death tolls from pandemics in history? by salonium

LW - The Offense-Defense Balance Rarely Changes by Maxwell Tabarrok

LW - "Model UN Solutions" by Arjun Panickssery

AF - Summing up "Scheming AIs" (Section 5) by Joe Carlsmith

AF - Finding Sparse Linear Connections between Features in LLMs by Logan Riggs Smith

EA - EA Germany's 2023 Report, 2024 Plans & Funding Gap by Patrick Gruban

AF - Speed arguments against scheming (Section 4.4-4.7 of "Scheming AIs") by Joe Carlsmith

The AI-powered Podcast Player