The Nonlinear Library

The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episodes

Mentioned books

Dec 19, 2023 • 22min

AF - Meaning & Agency by Abram Demski

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Meaning & Agency, published by Abram Demski on December 19, 2023 on The AI Alignment Forum. The goal of this post is to clarify a few concepts relating to AI Alignment under a common framework. The main concepts to be clarified: Optimization. Specifically, this will be a type of Vingean agency. It will split into Selection vs Control variants. Reference (the relationship which holds between map and territory; aka semantics, aka meaning). Specifically, this will be a teleosemantic theory. The main new concepts employed will be endorsement and legitimacy. TLDR: Endorsement of a process is when you would take its conclusions for your own, if you knew them. Legitimacy relates to endorsement in the same way that good relates to utility. (IE utility/endorsement are generic mathematical theories of agency; good/legitimate refer to the specific thing we care about.) We perceive agency when something is better at doing something than us; we endorse some aspect of its reasoning or activity. (Endorse as a way of achieving its goals, if not necessarily our own.) We perceive meaning (semantics/reference) in cases where the goal we endorse a conclusion with respect to is accuracy (we can say that it has been optimized to accurately represent something). This write-up owes a large debt to many conversations with Sahil, although the views expressed here are my own. Broader Context The basic idea is to investigate agency as a natural phenomenon, and have some deep insights emerge from that, relevant to the analysis of AI Risk. Representation theorems -- also sometimes called coherence theorems or selection theorems depending on what you want to emphasize -- start from a very normative place, typically assuming preferences as a basic starting object. Perhaps these preferences are given a veneer of behaviorist justification: they supposedly represent what the agent would choose if given the option. But actually giving agents the implied options would typically be very unnatural, taking the agent out of its environment. There are many ways to drive toward more naturalistic representation theorems. The work presented here takes a particular approach: cognitive reduction. I want to model what it means for one agent to think of something as being another agent. Daniel Dennett called this the intentional stance. The goal of the present essay is to sketch a formal picture of the intentional stance. This is not yet a representation theorem -- it does not establish a type signature for agency based on the ideas. However, for me at least, it seems like an important piece of deconfusion. It paints a picture of agents who understand each other, as thinking things. I hope that it can contribute to a useful representation theorem and a broader picture of agency. Meaning In Signalling & Simulacra, I argued that the signal-theoretic[1] analysis of meaning (which is the most common Bayesian analysis of communication) fails to adequately define lying, and fails to offer any distinction between denotation and connotation or literal content vs conversational implicature. A Human's Guide to Words gives a related view:[2] No dictionary, no encyclopedia, has ever listed all the things that humans have in common. We have red blood, five fingers on each of two hands, bony skulls, 23 pairs of chromosomes - but the same might be said of other animal species. [...] A dictionary is best thought of, not as a book of Aristotelian class definitions, but a book of hints for matching verbal labels to similarity clusters, or matching labels to properties that are useful in distinguishing similarity clusters. Eliezer Yudkowski, Similarity Clusters While I agree that words do not have clear-cut definitions (except, perhaps, in a few cases such as mathematics), I think it would be a mistake to conclude that there is nothi...

Dec 19, 2023 • 16min

EA - Incubating AI x-risk projects: some personal reflections by Ben Snodin

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Incubating AI x-risk projects: some personal reflections, published by Ben Snodin on December 19, 2023 on The Effective Altruism Forum. In this post, I'll share some personal reflections on the work of the Rethink Priorities Existential Security Team (XST) this year on incubating projects to tackle x-risk from AI. To quickly describe the work we did: with support from the Rethink Priorities Special Projects team (SP), XST solicited and prioritised among project ideas, developed the top ideas into concrete proposals, and sought founders for the most promising of those. As a result of this work, we ran one project internally and, if all goes well, we'll launch an external project in early January. Note that this post is written from my (Ben Snodin's) personal perspective. Other XST team members or wider Rethink Priorities staff wouldn't necessarily endorse the claims made in this post. Also, the various takes I'm giving in this post are generally fairly low confidence and low resilience. These are just some quick thoughts based on my experience leading a team incubating AI x-risk projects for a little over half a year. I was keen to share something on this topic even though I didn't have time to come to thoroughly considered views. Key points Between April 1st and December 1st, Rethink Priorities dedicated approximately 2.5 full-time equivalent (FTE) years of labour, mostly from XST, towards XST's strategy for incubating AI x-risk projects. We decided to run one project ourselves, a project in the AI advocacy space that we've been running since June. We're in the late stages of launching one new project that works to equip talented university students interested in mitigating extreme AI risks with the skills and background to enter a US policy career. A very rough estimate, based on our inputs and outputs to date, suggests that 5 FTE from a team with a similar skills mix to XST+SP would launch roughly 2 new projects per year. XST will now look into other ways to support high-priority projects such as in-housing them, rather than pursuing incubation and looking for external founders by default, while the team considers its next steps. Reasons for the shift include: an unfavourable funding environment, a focus on the AI x-risk space narrowing the founder pool and making it harder to find suitable project ideas, and challenges finding very talented founders in general. I think the ideal team working in this space has: lots of prior incubation experience, significant x-risk expertise and connections, excellent access to funding and ability to identify top founder talent, and very strong conviction. I'd often suggest getting more experience founding stuff yourself rather than starting an incubator - and I think funding conditions for AI x-risk incubation will be more favourable in 1-2 years. There are many approaches to AI x-risk incubation that seem promising to me that we didn't try, including cohort-based Charity Entrepreneurship-style programs, a high-touch approach to finding founders, and a founder in residence program. Summary of inputs and outcomes Inputs Between April 1st and December 1st 2023, Rethink Priorities dedicated approximately 2.5 full-time equivalent (FTE) years of labour towards incubating projects aiming to reduce existential risk from AI. XST had 4 full-time team members working on incubating AI x-risk projects during this period,[2] and from August 1st to December 1st 2023, roughly one FTE from SP collaborated with XST to identify and support potential founders for a particular project. In this period, XST also devoted roughly 0.4 FTE-years working directly on an impactful project in the AI advocacy space that stemmed from our incubation work. The people working on this were generalist and relatively junior, with 1-5 years' experience in x-risk-relat...

Dec 19, 2023 • 3min

AF - Don't Share Information Exfohazardous on Others' AI-Risk Models by Thane Ruthenis

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Don't Share Information Exfohazardous on Others' AI-Risk Models, published by Thane Ruthenis on December 19, 2023 on The AI Alignment Forum. (An "exfohazard" is information that's not dangerous to know for individuals, but which becomes dangerous if widely known. Chief example: AI capability insights.) Different alignment researchers have a wide array of different models of AI Risk, and one researcher's model may look utterly ridiculous to another researcher. Somewhere in the concept-space there exists the correct model, and the models of some of us are closer to it than those of others. But, taking on a broad view, we don't know which of us are more correct. ("Obviously it's me, and nearly everyone else is wrong," thought I and most others reading this.) Suppose that you're talking to some other alignment researcher, and they share some claim X they're concerned may be an exfohazard. But on your model, X is preposterous. The "exfohazard" is obviously wrong, or irrelevant, or a triviality everyone already knows, or some vacuous philosophizing. Does that mean you're at liberty to share X with others? For example, write a LW post refuting the relevance of X to alignment? Well, consider system-wide dynamics. Somewhere between all of us, there's a correct-ish model. But it looks unconvincing to nearly everyone else. If every alignment researcher feels free to leak information that's exfohazardous on someone else's model but not on their own, then either: The information that's exfohazardous relative to the correct model ends up leaked as well, OR Nobody shares anything with anyone outside their cluster. We're all stuck in echo chambers. Both seem very bad. Adopting the general policy of "don't share information exfohazardous on others' models, even if you disagree with those models" prevents this. However, that policy has an issue. Imagine if some loon approaches you on the street and tells you that you must never talk about birds, because birds are an exfohazard. Forever committing to avoid acknowledging birds' existence in conversations because of this seems rather unreasonable. Hence, the policy should have an escape clause: You should feel free to talk about the potential exfohazard if your knowledge of it isn't exclusively caused by other alignment researchers telling you of it. That is, if you already knew of the potential exfohazard, or if your own research later led you to discover it. This satisfies a nice property: it means that someone telling you an exfohazard doesn't make you more likely to spread it. I. e., that makes you (mostly) safe to tell exfohazards to[1]. That seems like a generally reasonable policy to adopt, for everyone who's at all concerned about AI Risk. ^ Obviously there's the issue of your directly using the exfohazard to e. g. accelerate your own AI research. Or the knowledge of it semi-consciously influencing you to follow some research direction that leads to your re-deriving it, which ends up making you think that your knowledge of it is now independent of the other researcher having shared it with you; while actually, it's not independent. So if you share it, thinking the escape clause applies, they will have made a mistake (from their perspective) by telling you. Still, mostly safe. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Dec 19, 2023 • 9min

AF - Paper: Tell, Don't Show- Declarative facts influence how LLMs generalize by Owain Evans

Dec 19, 2023 • 5min

EA - AI governance talent profiles I'd like to see apply for OP funding by JulianHazell

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI governance talent profiles I'd like to see apply for OP funding, published by JulianHazell on December 19, 2023 on The Effective Altruism Forum. We (Open Philanthropy) have a program called the " Career Development and Transition Funding Program" (CDTF) - see this recent Forum post for more information on the recent updates we made to it. This program supports a variety of activities, including (but not necessarily limited to) graduate study, unpaid internships, self-study, career transition and exploration periods, postdocs, obtaining professional certifications, online courses, and other types of one-off career-capital-building activities. To learn more about the entire scope of the CTDF program, I'd encourage you to read the full description on our website, which contains a broad list of hypothetical applicant profiles that we're looking for. This brief post, which partially stems from research I recently conducted through conversations with numerous AI governance experts about current talent needs, serves as an addendum to the existing information on the CDTF page. Here, I'll outline a few talent profiles I'm particularly excited about getting involved in the AI governance space. To be clear: this list is only a small slice of the many different talent profiles I'd be excited to see apply to the CDTF program - there are many other promising talent profiles out there that I won't manage to cover below. My aim is not just to encourage more applicants from these sorts of folks to the CDTF program, but also to broadly describe what I see as some pressing talent pipeline gaps in the AI governance ecosystem more generally. Hypothetical talent profiles I'm excited about A hardware engineer at a leading chip design company (Nvidia), chip manufacturer (TSMC), chip manufacturing equipment manufacturer (ASML) or cloud compute provider (Microsoft, Google, Amazon) who has recently become interested in developing hardware-focused interventions and policies that could reduce risks from advanced AI systems via improved coordination. A machine learning researcher who wants to pivot to working on the technical side of AI governance research and policy, such as evaluations, threat assessments, or other aspects of AI control. An information security specialist who has played a pivotal role in safeguarding sensitive data and systems at a major tech company, and would like to use those learnings to secure advanced AI systems from theft. Someone with 10+ years of professional experience, excellent interpersonal and management skills, and an interest in transitioning to policy work in the US, UK, or EU. A policy professional who has spent years working in DC, with a strong track record of driving influential policy changes. This individual could have experience either in government as a policy advisor or in a think tank or advocacy group, and has developed a deep understanding of the political landscape and how to navigate it. An additional bonus would be if the person has experience driving bipartisan policy change by working with both sides of the aisle. A legal scholar or practicing lawyer with experience in technology, antitrust, and/or liability law, who is interested in applying these skills to legal questions relevant to the development and deployment of frontier AI systems. A US national security expert with experience in international agreements and treaties, who wants to produce policy reports that consider potential security implications of advanced AI systems. An experienced academic with an interdisciplinary background. They have strong mentoring skills, and a clear vision for establishing and leading an AI governance research institution at a leading university focused on exploring questions such as measuring and benchmarking AI capabilities. A biologist who has either ...

Dec 19, 2023 • 33min

EA - Why Effective Giving Incubation - Report by CE

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why Effective Giving Incubation - Report, published by CE on December 19, 2023 on The Effective Altruism Forum. TLDR: At Charity Entrepreneurship, in collaboration with Giving What We Can, we have recently launched a new program: Effective Giving Incubation. In this post, we present our scoping report that explains the reasoning behind creating more Effective Giving Initiatives (EGIs). Learn why we think this is a promising intervention, which locations are optimal for launching, and for whom this would be an ideal career fit. Quick reminder: You can apply to the Effective Giving Incubation program by January 14, 2024. The program will run online from April 15 to June 7, 2024, with 2 weeks in person in London. [APPLY NOW] or sign up for the Effective Giving Incubation interactive webinar on January 4, 5 PM Singapore Time/ 6 PM Japan Time/ 3.30 PM India Time/ 10 AM UK Time/ 11 AM Belgium Time. [SIGN UP] CE is excited about launching EGIs in Ireland, Belgium, Italy, India, Singapore, South Korea, Japan, United Arab Emirates, Mexico, the US and France. We would appreciate your help in reaching potential applicants who are interested in working in these countries. Connect us via email at: ula@charityentrepreneurship.com One paragraph summary In 2024 we are running a special edition of the Charity Entrepreneurship Incubation Program in collaboration with Giving What We Can focused on Effective Giving Initiatives (EGI). EGIs are entities that focus on raising awareness and funneling public and philanthropic donations to the most cost-effective charities worldwide. They will be broadly modeled on existing organizations such as Giving What We Can (GWWC), Effektiv Spenden, and others. We have identified some possible high-priority countries where we believe they will be most successful. Depending on the country and what is most impactful, these initiatives could be fully independent or collaborate with existing projects. Disclaimer: It is important to note that EGIs, including those we intend to incubate, are independent from CE and make their own educated choices of which charities to promote and where to donate funds. CE does not require or encourage any specific recommended charities (such as our prior incubated charities) to be supported by EGIs. Background to this research Charity Entrepreneurship's (CE) mission is to cause more effective non-profit organizations to exist worldwide. To accomplish this mission, we connect talented individuals with high-impact intervention opportunities and provide them with training, colleagues, funding opportunities, and ongoing operational support. For this scoping report, CE collaborated with Giving What We Can (GWWC) to better understand the opportunities Effective Giving Initiatives (EGIs) present as a high-impact intervention, what contributes to the success of such organizations, and where they might be best founded. GWWC was one of the first organizations to champion giving to high-impact nonprofits, has an extensive global network of people interested in effective giving, and more than a decade of experience operating an organization focused on promoting effective charities. EGIs are organizations or projects that aim to promote, typically in a specific target country, the idea of donating to cost-effective charities. They mostly engage in a mix of educational and fundraising activities, with the explicit aim of trying to move money to the most cost-effective interventions that aim to tackle the world's most pressing problems. This report builds on the experience of Giving What We Can, in-depth interviews with experts in the field and successful founders of EGIs, as well as quantitative & qualitative analysis of potential target areas. This report follows a somewhat different methodology than our regular research process used to ...

Dec 19, 2023 • 14min

LW - The Dark Arts by lsusr

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Dark Arts, published by lsusr on December 19, 2023 on LessWrong. It is my understanding that you won all of your public forum debates this year. That's very impressive. I thought it would be interesting to discuss some of the techniques you used. Of course! So, just for a brief overview for those who don't know, public forum is a 2v2 debate format, usually on a policy topic. One of the more interesting ones has been the last one I went to, where the topic was "Resolved: The US Federal Government Should Substantially Increase its Military Presence in the Arctic". Now, the techniques I'll go over here are related to this topic specifically, but they would also apply to other forms of debate, and argumentation in general really. For the sake of simplicity, I'll call it "ultra-BS". So, most of us are familiar with 'regular' BS. The idea is the other person says something, and you just reply "you're wrong", or the equivalent of "nu-uh". Usually in lower level debates this is exactly what happens. You have no real response, and it's quite apparent, even for the judges who have no economic or political literacy to speak of. "Ultra-BS" is the next level of the same thing, basically. You craft a clearly bullshit argument that incorporates some amount of logic. Let me use one of my contentions for the resolution above as an example. I argued that nuclear Armageddon would end the US if we allowed Russia to take control of the Arctic. Now, I understand I sound obviously crazy already, but hear me out. Russia's Kinzhal hypersonic missiles, which have a range of roughly 1,000 miles, cannot hit the US from the Russian mainland. But they can hit us from the Arctic. I add that hypersonic missles are very, very fast. [This essentially acts as a preemptive rebuttal to my opponent's counterargument (but what about MAD?).] If we're destroyed by a first strike, there is no MAD, and giving Russia the Arctic would immediately be an existential threat. Of course, this is ridiculous, but put yourself in my opponent's shoes for a moment. How are you meant to respond to this? You don't know what Russia's nuclear doctrine is. You've never studied or followed geopolitics. You don't have access to anything resembling a coherent model for how hypersonic missiles work or how nations respond to them. Crucially, you've also done no prep, because I just pulled this out of my ass. You're now screwed. Not because I'm right, but because I managed to construct a coherent narrative of events you don't have the expertise to rebut. This isn't some high level, super manipulative technique. However, I think this describes most of the dark arts. It's actually quite boring if you really think about it, requiring no real effort. (In fact, it's actual intellectual conversations with genuine engagement that I find more effortful.) Allow me another example. This resolution was "Resolved: The US federal government should forgive all student loan debt". Here, I was arguing the (logically and factually) impossible position of affirmative. Take any group of economists, and you'd likely reach the same conclusion. This is a damn terrible idea. But... my opponents aren't economists. So I won. There were no facts in my case. My contentions were that 1. a college education helps educate voters (possibly?) preventing leaders like trump from getting elected. 2. Racial and economic divides polarize the nation and are just undesirable as a whole. Both of which are conveniently non-quantifiable and impossible to weigh. I can't say that "X number of lives" or "X amount of money" is lost if we fail to forgive debt. I stay in the abstract. Thus, my case is invincible. An actual debater would see that there's 'no substance' to my argument. But the judge isn't a debater, so the point is moot. Now, all I have to do is rebut everythi...

Dec 19, 2023 • 52sec

EA - EA events in Africa Interest Form: Exploring EA events on the continent by Hayley Martin

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EA events in Africa Interest Form: Exploring EA events on the continent, published by Hayley Martin on December 19, 2023 on The Effective Altruism Forum. We're excited to explore the potential of organising EA events on the continent, and we want YOUR input to shape these events. Your perspectives are crucial in understanding the community's aspirations and needs. Please take a moment to share your preferences and expectations by filling out our quick Google Form. Your insights will inform our discussions about the possibility of EA events in Africa, including a potential EAGx ,and help create an enriching and collaborative experience for all. [Fill out the EAGx Africa Event Interest Form here] Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Dec 19, 2023 • 51min

EA - Effective Aspersions: How the Nonlinear Investigation Went Wrong by TracingWoodgrains

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Effective Aspersions: How the Nonlinear Investigation Went Wrong, published by TracingWoodgrains on December 19, 2023 on The Effective Altruism Forum. The New York Times Picture a scene: the New York Times is releasing an article on Effective Altruism (EA) with an express goal to dig up every piece of negative information they can find. They contact Émile Torres, David Gerard, and Timnit Gebru, collect evidence about Sam Bankman-Fried, the OpenAI board blowup, and Pasek's Doom, start calling Astral Codex Ten (ACX) readers to ask them about rumors they'd heard about affinity between Effective Altruists, neoreactionaries, and something called TESCREAL. They spend hundreds of hours over six months on interviews and evidence collection, paying Émile and Timnit for their time and effort. The phrase "HBD" is muttered, but it's nobody's birthday. A few days before publication, they present key claims to the Centre for Effective Altruism (CEA), who furiously tell them that many of the claims are provably false and ask for a brief delay to demonstrate the falsehood of those claims, though their principles compel them to avoid threatening any form of legal action. The Times unconditionally refuses, claiming it must meet a hard deadline. The day before publication, Scott Alexander gets his hands on a copy of the article and informs the Times that it's full of provable falsehoods. They correct one of his claims, but tell him it's too late to fix another. The final article comes out. It states openly that it's not aiming to be a balanced view, but to provide a deep dive into the worst of EA so people can judge for themselves. It contains lurid and alarming claims about Effective Altruists, paired with a section of responses based on its conversation with EA that it says provides a view of the EA perspective that CEA agreed was a good summary. In the end, it warns people that EA is a destructive movement likely to chew up and spit out young people hoping to do good. In the comments, the overwhelming majority of readers thank it for providing such thorough journalism. Readers broadly agree that waiting to review CEA's further claims was clearly unnecessary. David Gerard pops in to provide more harrowing stories. Scott gets a polite but skeptical hearing out as he shares his story of what happened, and one enterprising EA shares hard evidence of one error in the article to a mixed and mostly hostile audience. A few weeks later, the article writer pens a triumphant follow-up about how well the whole process went and offers to do similar work for a high price in the future. This is not an essay about the New York Times. The rationalist and EA communities tend to feel a certain way about the New York Times. Adamantly a certain way. Emphatically a certain way, even. I can't say my sentiment is terribly different - in fact, even when I have positive things to say about the New York Times, Scott has a way of saying them more elegantly, as in The Media Very Rarely Lies. That essay segues neatly into my next statement, one I never imagined I would make: You are very very lucky the New York Times does not cover you the way you cover you. A Word of Introduction Since this is my first post here, I owe you a brief introduction. I am a friendly critic of EA who would join you were it not for my irreconcilable differences in fundamental values and thinks you are, by and large, one of the most pleasant and well-meaning groups of people in the world. I spend much more time in the ACX sphere or around its more esoteric descendants and know more than anyone ought about its history and occasional drama. Some of you know me from my adversarial collaboration in Scott's contest some years ago, others from my misadventures in "speedrunning" college, still others from my exhaustively detailed deep dives in...

Dec 19, 2023 • 17min

LW - A Universal Emergent Decomposition of Retrieval Tasks in Language Models by Alexandre Variengien

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Universal Emergent Decomposition of Retrieval Tasks in Language Models, published by Alexandre Variengien on December 19, 2023 on LessWrong. This work was done as a Master's thesis project at Conjecture, independent from the primary agenda of the organization. Paper available here, thesis here. Over the past months I (Alexandre) - with the help of Eric - have been working on a new approach to interpretability of language models (LMs). In the search for the units of interpretability, I decided to zoom out instead of zooming in. I focused on careful dataset design and causal intervention at a macro-level (i.e. scale of layers). My goal has been to find out if there are such things as "organs"[1] in LMs. In other words, are there macroscopic universal motifs, coarse-grained internal structures corresponding to a function that would generalize across models and domains? I think I found an example of universal macroscopic motifs! Our paper suggests that the information flow inside Transformers can be decomposed cleanly at a macroscopic level. This gives hope that we could design safety applications to know what models are thinking or intervene on their mechanisms without the need to fully understand their internal computations. In this post, we give an overview of the results and compare them with two recent works that also study high-level information flow in LMs. We discuss the respective setups, key differences, and the general picture they paint when taken together. Executive summary of the paper Methods We introduce ORION, a collection of carefully crafted retrieval tasks that offer token-level control and include 6 domains. Prompts in ORION are composed of a request (e.g. a question) asking to retrieve an entity (e.g. a character) from a context (e.g. a story). We can understand the high-level processing happening at the last token position of an ORION prompt: Middle layers at the last token position process the request. Late layers take the representation of the request from early layers and retrieve the correct entity from the context. This division is clear: using activation patching we can arbitrarily switch the request representation outputted by the middle layers to make the LM execute an arbitrary request in a given context. We call this experimental result request patching (see figure below). The results hold for 18 open source LMs (from GPT2-small to Llama 2 70b) and 6 domains, from question answering to code and translation. We provide a detailed case study on Pythia-2.8b using more classical mechanistic interpretability methods to link what we know happens at the layer level to how it is implemented by individual components. The results suggest that the clean division only emerges at the scale of layers and doesn't hold at the scale of components. Applications Building on this understanding, we demonstrate a proof of concept application for scalable oversight of LM internals to mitigate prompt-injection while requiring human supervision on only a single input. Our solution drastically mitigates the distracting effect of the prompt injection (accuracy increases from 15.5% to 97.5% on Pythia-12b). We used the same setting to build an application for mechanistic anomaly detection. We study settings where a token X is both the target of the prompt injection and the correct answer. We tried to answer "Does the LM answer X because it's the correct answer or because it has been distracted by the prompt injection?". Applying the same technique fails at identifying prompt injection in most cases. We think it is surprising and it could be a concrete and tractable problem to study in future works. Setup We study prompts where predicting the next token involves retrieving a specific keyword from a long context. For example: Here is a short story. Read it carefully ...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app