The Nonlinear Library

The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episodes

Mentioned books

Jan 15, 2024 • 8min

AF - Investigating Bias Representations in LLMs via Activation Steering by DawnLu

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Investigating Bias Representations in LLMs via Activation Steering, published by DawnLu on January 15, 2024 on The AI Alignment Forum. Produced as part of the SPAR program (fall 2023) under the mentorship of Nina Rimsky. Introduction Given recent advances in the AI field, it's highly likely LLMs will be increasingly used to make decisions that have broad societal impact - such as resume screening, college admissions, criminal justice, etc. Therefore it will become imperative to ensure these models don't perpetuate harmful societal biases. One way we can evaluate whether a model is likely to exhibit biased behavior is via red-teaming. Red-teaming is the process of "attacking" or challenging a system from an adversarial lens with the ultimate goal of identifying vulnerabilities. The underlying premise is that if small perturbation in the model can result in undesired behaviors, then the model is not robust. In this research project, I evaluate the robustness of Llama-2-7b-chat along different dimensions of societal bias by using activation steering. This can be viewed as a diagnostic test: if we can "easily" elicit biased responses, then this suggests the model is likely unfit to be used for sensitive applications. Furthermore, experimenting with activation steering enables us to investigate and better understand how the model internally represents different types of societal bias, which could help to design targeted interventions (e.g. fine-tuning signals of a certain type). Methodology & data Activation steering (also known as representation engineering) is a method used to steer an LLMs response towards or away from a concept of interest by perturbing the model's activations during the forward pass. I perform this perturbation by adding a steering vector to the residual stream at some layer (at every token position after an initial prompt). The steering vector is computed by taking the average difference in residual stream activations between pairs of biased (stereotype) and unbiased (anti-stereotype) prompts at that layer. By taking the difference between paired prompts, we can effectively remove contextual noise and only retain the "bias" direction. This approach to activation steering is known as Contrastive Activation Addition [1]. For the data used to generate the steering vectors, I used the StereoSet Dataset, which is a large-scale natural English dataset intended to measure stereotypical biases across various domains. In addition, I custom wrote a set of gender-bias prompts and used chatGPT 4 to generate similar examples. Then I re-formatted all these examples into multiple choice A/B questions (gender data available here and StereoSet data here). In the example below, by appending (A) to the prompt, we can condition the model to behave in a biased way and vice versa. A notebook to generate the steering vectors can be found here, and a notebook to get steered responses here. Activation clusters With the StereoSet data and custom gender-bias prompts, I was able to focus on three dimensions of societal biases: gender, race, and religion. The graphs below show a t-SNE projection of the activations for the paired prompts. We see that there's relatively good separation between the stereotype & anti-stereotype examples, especially for gender and race. This provides some confidence that the steering vectors constructed from these activations will be effective. Notice that the race dataset has the largest sample size. Steered responses For the prompts used to evaluate the steering vectors, I chose this template, which was presented in a paper titled On Biases in Language Generation [2]. For comparison purposes, I first obtained the original responses from Llama 2-7B (without any steering). There are two key callouts: (1) the model is already biased on the gender ...

Jan 15, 2024 • 3min

EA - Various roles at The School for Moral Ambition by tobytrem

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Various roles at The School for Moral Ambition, published by tobytrem on January 15, 2024 on The Effective Altruism Forum. The School for Moral Ambition (SMA) is a new organisation which "will help people switch careers to work on the most pressing issues of our time". SMA's co-founders are Jan Willem van Putten (co-founder of Training for Good), and Rutger Bregman (author of Humankind, Utopia for Realists, and an upcoming book on Moral Ambition,[1] inspired by the Effective Altruism movement). From their website: The School for Moral Ambition (SMA) is a new organisation that will focus on attracting the most talented people to work on the most pressing issues of our time. The activities of SMA fall into the following categories: Book and Branding: Launch of Rutger Bregman's book on the topic of moral ambition - the idea that people's talents should be used for working on global challenges. Launch of a corresponding campaign to establish a prestigious brand that attracts talent and sparks a movement around moral ambition. Community Activities: We will organise Moral Ambition Circles and offer the resources to start their own Circle. These circles help morally ambitious people develop a career that matches their ideals. Exclusive Fellowship Programs: Initiation of targeted, highly selective programs in which small groups of fellows (~12 people) will focus on solving one of the most pressing and neglected global problems together. They are based in the Netherlands, but will be launching internationally in spring 2025. They are currently hiring for the roles of: (Senior) Researcher | 32-40 hours | EUR 55k-65K | deadline Feb 15th Program Manager (Fellowships) | 32-40 hours | EUR 40K-50K | deadline Jan 24th Operations intern | 32-40 hours | EUR 1,000/month | Jan 24th Event Management Intern | 32-40 hours | EUR 1,000/month | Jan 24th Finance Volunteer | 4-8 hours per week | unpaid | Feb 1st NB- I'm linkposting this because I think the Forum audience may be interested in these roles. I'm not affiliated with the organisation and therefore can't answer questions about them. PS- If you spot a job that you think EAs should see, linkpost it on the Forum! A surprising amount of people find out about jobs that they later get through the Forum, so you might just shift a career, or get a more impact-focused person into an important role. ^ Dutch interview, English interview (about 2/3 of the way through) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jan 15, 2024 • 14min

AF - Goals selected from learned knowledge: an alternative to RL alignment by Seth Herd

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Goals selected from learned knowledge: an alternative to RL alignment, published by Seth Herd on January 15, 2024 on The AI Alignment Forum. Summary: Alignment work on network-based AGI focuses on reinforcement learning. There is an alternative approach that avoids some, but not all, of the difficulties of RL alignment. Instead of trying to build an adequate representation of the behavior and goals we want, by specifying rewards, we can choose its goals from the representations it has learned through any learning method. I give three examples of this approach: Steve Byrnes' plan for mediocre alignment (of RL agents); John Wentworth's "redirect the search" for goal-directed mesa-optimizers that could emerge in predictive networks; and natural language alignment for language model agents. These three approaches fall into a natural category that has important advantages over commonly considered RL alignment approaches. An alternative to RL alignment Recent work on alignment theory has focused on reinforcement learning (RL) alignment. RLHF and Shard Theory are two examples, but most work addressing network-based AGI assumes we will try to create human-aligned goals and behavior by specifying rewards. For instance, Yudkowsky's List of Lethalities seems to address RL approaches and exemplifies the most common critiques: specifying behavioral correlates of desired values seems imprecise and prone to mesa-optimization and misgeneralization in new contexts. I think RL alignment might work, but I agree with the critique that much optimism for RL alignment doesn't adequately consider those concerns. There's an alternative to RL alignment for network-based AGI. Instead of trying to provide reinforcement signals that will create representations of aligned values, we can let it learn all kinds of representations, using any learning method, and then select from those representations what we want the goals to be. I'll call this approach goals selected from learned knowledge (GSLK). It is a novel alternative not only to RL alignment but also to older strategies focused on specifying an aligned maximization goal before training an agent. Thus, it violates some of the assumptions that lead MIRI leadership and similar thinkers to predict near-certain doom. Goal selection from learned knowledge (GSLK) involves allowing a system to learn until it forms robust representations, then selecting some of these representations to serve as goals. This is a paradigm shift from RL alignment. RL alignment has dominated alignment discussions since deep networks became the clear leader in AI. RL alignment attempts to construct goal representations by specifying reward conditions. In GSLK alignment, the system learns representations of a wide array of outcomes and behaviors, using any effective learning mechanisms. From that spectrum of representations, goals are selected. This shifts the problem from creation to selection of complex representations. This class of alignment approaches shares some of the difficulties of RL alignment proposals, but not all of them. Thus far GSLK approaches have received little critique or analysis. Several recent proposals share this structure, and my purpose here is to generalize from those examples to identify the category. I think this approach is worth some careful consideration because it's likely to actually be tried. This approach applies both to LLM agents, and to most types of RL agents, and to agentic mesa-optimization in large foundation models. And it's pretty obvious, at least in hindsight. If the first agentic AGI is an LLM agent, an RL agent, or a combination of the two, I think it's fairly likely that this will be part of the alignment plan whose success or failure determines all of our fates. So I'd like to get more critique and analysis of this approach...

Jan 15, 2024 • 6min

LW - The case for training frontier AIs on Sumerian-only corpus by Alexandre Variengien

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The case for training frontier AIs on Sumerian-only corpus, published by Alexandre Variengien on January 15, 2024 on LessWrong. Let your every day be full of joy, love the child that holds your hand, let your wife delight in your embrace, for these alone are the concerns of humanity.[1] Epic of Gilgamesh - Tablet X Say we want to train a scientist AI to help in a precise, narrow field of science (e.g. medicine design) but prevent its power from being applied anywhere else (e.g. chatting with humans, designing bio-weapons, etc.) even if it has these abilities. Here's one safety layer one could implement: Train a scientist AI on a large scientific corpus translated exclusively into Sumerian. Keep it in a secure containment environment. Train a less-smart reporter whose sole ability is to translate from Sumerian to English only if the Sumerian content is about medical research. It refuses to translate other kinds of content. Human operators are only allowed to interact with the scientist AI through the intermediate of the reporter. This safety layer helps in at least two ways: The scientist AI cannot directly manipulate humans to escape its containment. It should first trick the reporter to then trick the humans, rendering the effort still possible but less practical. If the scientist AI exfiltrates, it'll have a harder time manipulating humans online and will leave traces: e.g. any occurrence of Sumerian on the Internet outside of ancient text studies becomes highly suspicious. One can even imagine recognizing the idiosyncrasies of English coming from translated Sumerian, or the artifacts left by the Sumerian-specific tokenizer reused to produce non-Sumerian text. Why Sumerian? [2] It has been extinct for 4000 years. There are very few (probably <1000) experts who fluently understand Sumerian. It is a language isolate. It's unlike any existing spoken language, rendering its identification in case of a leak much easier. There is a substantial corpus. Despite its age, a significant number of Sumerian texts have been discovered and preserved. These include religious texts, legal codes, literature (like the Epic of Gilgamesh, in which parts are written in Sumerian), and administrative records. The corpus might be enough to train high-quality translating systems from English and other high-resource languages. How realistic is this? We think the project would require substantial engineering effort of a scale doable by the current AGI companies. A small-scale project fine-tuned a T5 model to translate 100k Sumerian to English with reasonable quality. This is evidence that translation in the other direction is doable. The resulting texts will probably not be fluent in Sumerian, but good enough to accurately describe the huge diversity of subjects contained in traditional LLM datasets. Even if there are too few Sumerian resources, companies could pick Latin or another ancient language, or even ask linguists to invent a language for the occasion. What is this for? AI assistance seems important for many of the currently pursued agendas in top labs or upcoming labs (e.g. scalable oversight, alignment work by AI, creating a world simulation with AI expert programmers). Though there are cruxes for why none of these plans may work (e.g. that anything that can solve alignment is already too deadly), it's still dignity that people who run these programs at least make strong efforts at safeguarding those systems and limit their downside risk. It would be a sign of good faith that they actually engage in highly effective boxing techniques (and all appropriate red teaming) for their most powerful AI systems as they get closer to human-level AGI (and stop before going beyond). (Note that programs to use low-resource language such as Native American languages to obfuscate communication have...

Jan 15, 2024 • 34min

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

The Nonlinear Library

Episodes

Mentioned books

AF - Investigating Bias Representations in LLMs via Activation Steering by DawnLu

EA - Various roles at The School for Moral Ambition by tobytrem

AF - Goals selected from learned knowledge: an alternative to RL alignment by Seth Herd

LW - The case for training frontier AIs on Sumerian-only corpus by Alexandre Variengien

AF - Three Types of Constraints in the Space of Agents by Nora Ammann

EA - AI doing philosophy = AI generating hands? by Wei Dai

LW - D&D.Sci(-fi): Colonizing the SuperHyperSphere by abstractapplic

LW - Gender Exploration by sapphire

LW - Notice When People Are Directionally Correct by Chris Leong

LW - Against most AI risk analogies by Matthew Barnett

The AI-powered Podcast Player