The Nonlinear Library

The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episodes

Mentioned books

Feb 25, 2024 • 5min

LW - "In-Context" "Learning" by Arjun Panickssery

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "In-Context" "Learning", published by Arjun Panickssery on February 25, 2024 on LessWrong. I see people use "in-context learning" in different ways. Take the opening to "In-Context Learning Creates Task Vectors": In-context learning (ICL) in Large Language Models (LLMs) has emerged as a powerful new learning paradigm. However, its underlying mechanism is still not well understood. In particular, it is challenging to map it to the "standard" machine learning framework, where one uses a training set S to find a best-fitting function f(x) in some hypothesis class. In one Bayesian sense, training data and prompts are both just evidence. From a given model, prior (random weights), and evidence (training data), you get new model weights. From the new model weights and some more evidence (prompt input), and a distribution of output text. But the "training step" (prior,data)weights and "inference step" (weights,input)output could be simplified to a single function:(prior,data,input)output. An LLM trained on a distribution of text that always starts with "Once upon a time" is essentially similar to an LLM trained on the Internet but prompted to continue after "Once upon a time." If the second model performs better - e.g. because it generalizes information from the other text - this is explained by training data limitations or by the availability of more forward passes and therefore computation steps and space to store latent state. A few days ago "How Transformers Learn Causal Structure with Gradient Descent" defined in-context learning as the ability to learn from information present in the input context without needing to update the model parameters. For example, given a prompt of input-output pairs, in-context learning is the ability to predict the output corresponding to a new input. Using this interpretation, ICL is simply updating the state of latent variables based on the context and conditioning on this when predicting the next output. In this case, there's no clear distinction between standard input conditioning and ICL. However, it's still nice to know the level of abstraction at which the in-context "learning" (conditioning) mechanism operates. We can distinguish "task recognition" (identifying known mappings even with unpaired input and label distributions) from "task learning" (capturing new mappings not present in pre-training data). At least some tasks can be associated with function vectors representing the associated mapping (see also: "task vectors"). Outside of simple toy settings it's usually hard for models to predict which features in preceding tokens will be useful to reference when predicting future tokens. This incentivizes generic representations that enable many useful functions of preceding tokens to be employed depending on which future tokens follow. It's interesting how these representations work. A stronger claim is that models' method of conditioning on the context has a computational structure akin to searching over an implicit parameter space to optimize an objective function. We know that attention mechanisms can implement a latent space operation equivalent to a single step of gradient descent on toy linear-regression tasks by using previous tokens' states to minimize mean squared error in predicting the next token. However, it's not guaranteed that non-toy models work the same way and one gradient-descent step on a linear-regression problem with MSE loss is simply a linear transformation of the previous tokens - it's hard to build a powerful internal learner with this construction. But an intuitive defense of this strong in-context learning is that models that learn generic ways to update on input context will generalize and predict better. Consider a model trained to learn many different tasks, where the pretraining data consists of sequ...

Feb 25, 2024 • 49sec

EA - Bloomberg: Unacknowledged problems with LLINs are causing a rise in malaria. by Ian Turner

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Bloomberg: Unacknowledged problems with LLINs are causing a rise in malaria., published by Ian Turner on February 25, 2024 on The Effective Altruism Forum. In this article, Bloomberg claims that undisclosed manufacturing changes at one of the largest producers of anti-malaria bednets have led to distribution of hundreds of millions of ineffective (or less-effective) bednets, and that this problem is linked to an increase in malaria incidence in the places where these nets were distributed. The manufacturer is Vestergaard and the Against Malaria Foundation is among their clients. Bloomberg has a steep paywall but the link here gives free access until March 2. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Feb 25, 2024 • 46min

EA - Cooperating with aliens and (distant) AGIs: An ECL explainer by Chi

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Cooperating with aliens and (distant) AGIs: An ECL explainer, published by Chi on February 25, 2024 on The Effective Altruism Forum. Summary Evidential cooperation in large worlds (ECL) is a proposed way of reaping gains - that is, getting more of what we value instantiated - through cooperating with agents across the universe/multiverse. Such cooperation does not involve physical, or causal, interaction. ECL is potentially a crucial consideration, because we may be able to do more good this way compared to the "standard" (i.e., causal) way of optimizing for our values. The core idea of ECL can be summarized as: According to non-causal decision theories, my decisions relevantly "influence" what others who are similar to me do, even if they never observe my behavior (or the causal consequences of my behavior). ( More.) In particular, if I behave cooperatively towards other value systems, then other agents across the multiverse are more likely to do the same. Hence, at least some fraction of agents can be (acausally) influenced into behaving cooperatively towards my value system. This gives me reason to be cooperative with other value systems. ( More.) Meanwhile, there are many agents in the universe/multiverse. ( More.) Cooperating with them would unlock a great deal of value due to gains from trade. ( More.) For example, if I care about the well-being of sentient beings everywhere, I can "influence" how faraway agents treat sentient beings in their part of the universe/multiverse. Introduction The observable universe is large. Nonetheless, the full extent of the universe is likely much larger, perhaps infinitely so. This means that most of what's out there is not causally connected to us. Even if we set out now from planet Earth, traveling at the speed of light, we would never reach most locations in the universe. One might assume that this means most of the universe is not our concern. In this post, we explain why all of the universe - and all of the multiverse, if it exists - may in fact concern us if we take something called evidential cooperation in large worlds (ECL) into account.[1] Given how high the stakes are, on account of how much universe/multiverse might be out there beyond our causal reach, ECL is potentially very important. In our view, ECL is a crucial consideration for the effective altruist project of doing the most good. In the next section of this post, we explain the theory underlying ECL. Building on that, we outline why we might be able to do ECL ourselves and how it allows us to do more good. We conclude by giving some information on how you can get involved. We will also publish an FAQ in the near future, which will address some possible objections to ECL. The twin prisoners' dilemma Exact copies Suppose you are in a prisoner's dilemma with an exact copy of yourself: You have a choice: You can either press the defect button, which increases your own payoff by $1, or you can press the cooperate button, which increases your copy's payoff by $2. Your copy faces the same choice (i.e., the situation is symmetric). Both of you cannot see the other's choice until after you have made your own choice. You and your copy will never interact with each other after this, and nobody else will ever observe what choices you both made. You only care about your own payoff, not the payoff of your copy. This situation can be represented with the following payoff matrix: Looking at the matrix, you can see that regardless of whether your copy cooperates or defects, you are better off if you defect. "Defect" is the strictly dominant strategy. Therefore, under standard notions of rational decision making, you should defect. In particular, causal decision theory - read in the standard way - says to defect ( Lewis, 1979). However, the other player is an exact copy of y...

Feb 25, 2024 • 5min

AF - Deconfusing In-Context Learning by Arjun Panickssery

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deconfusing In-Context Learning, published by Arjun Panickssery on February 25, 2024 on The AI Alignment Forum. I see people use "in-context learning" in different ways. Take the opening to "In-Context Learning Creates Task Vectors": In-context learning (ICL) in Large Language Models (LLMs) has emerged as a powerful new learning paradigm. However, its underlying mechanism is still not well understood. In particular, it is challenging to map it to the "standard" machine learning framework, where one uses a training set S to find a best-fitting function f(x) in some hypothesis class. In one Bayesian sense, training data and prompts are both just evidence. From a given model, prior (architecture + initial weight distribution), and evidence (training data), you get new model weights. From the new model weights and some more evidence (prompt input), you get a distribution of output text. But the "training step" (prior,data)weights and "inference step" (weights,input)output could be simplified to a single function:(prior,data,input)output. An LLM trained on a distribution of text that always starts with "Once upon a time" is essentially similar to an LLM trained on the Internet but prompted to continue after "Once upon a time." If the second model performs better - e.g. because it generalizes information from the other text - this is explained by training data limitations or by the availability of more forward passes and therefore computation steps and space to store latent state. A few days ago "How Transformers Learn Causal Structure with Gradient Descent" defined in-context learning as the ability to learn from information present in the input context without needing to update the model parameters. For example, given a prompt of input-output pairs, in-context learning is the ability to predict the output corresponding to a new input. Using this interpretation, ICL is simply updating the state of latent variables based on the context and conditioning on this when predicting the next output. In this case, there's no clear distinction between standard input conditioning and ICL. However, it's still nice to know the level of abstraction at which the in-context "learning" (conditioning) mechanism operates. We can distinguish "task recognition" (identifying known mappings even with unpaired input and label distributions) from "task learning" (capturing new mappings not present in pre-training data). At least some tasks can be associated with function vectors representing the associated mapping (see also: "task vectors"). Outside of simple toy settings it's usually hard for models to predict which features in preceding tokens will be useful to reference when predicting future tokens. This incentivizes generic representations that enable many useful functions of preceding tokens to be employed depending on which future tokens follow. It's interesting how these representations work. A stronger claim is that models' method of conditioning on the context has a computational structure akin to searching over an implicit parameter space to optimize an objective function. We know that attention mechanisms can implement a latent space operation equivalent to a single step of gradient descent on toy linear-regression tasks by using previous tokens' states to minimize mean squared error in predicting the next token. However, it's not guaranteed that non-toy models work the same way and one gradient-descent step on a linear-regression problem with MSE loss is simply a linear transformation of the previous tokens - it's hard to build a powerful internal learner with this construction. But an intuitive defense of this strong in-context learning is that models that learn generic ways to update on input context will generalize and predict better. Consider a model trained to learn many differe...

Feb 25, 2024 • 26min

LW - A starting point for making sense of task structure (in machine learning) by Kaarel

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A starting point for making sense of task structure (in machine learning), published by Kaarel on February 25, 2024 on LessWrong. ML models can perform a range of tasks and subtasks, some of which are more closely related to one another than are others. In this post, we set out two very initial starting points. First, we motivate reverse engineering models' task decompositions. We think this can be helpful for interpretability and for understanding generalization. Second, we provide a (potentially non-exhaustive, initial) list of techniques that could be used to quantify the 'distance' between two tasks or inputs. We hope these distances might help us identify the task decomposition of a particular model. We close by briefly considering analogues in humans and by suggesting a toy model. Epistemic status: We didn't spend much time writing this post. Please let us know in the comments if you have other ideas for measuring task distance or if we are replicating work. Introduction It might be useful to think about computation in neural networks (and in LMs specifically) on sufficiently complex tasks as a combination of (a) simple algorithms or circuits for specific tasks[1] and (b) a classifier, or family of classifiers, that determine which simple circuits are to be run on a given input. (Think: an algorithm that captures (some of) how GPT-2 identifies indirect objects in certain cases combined with a method of identifying that indirect object identification is a thing that should be done.[2]) More concretely, some pairs of tasks might overlap in that they are computed together much more than are other pairs, and we might want to build a taxonomic tree of tasks performed by the model in which tree distance between tasks is a measure of how much computation they share.[3] For example, a particularly simple (but unlikely) task structure could be a tree of depth 1: the neural network has one algorithm for classifying tasks which is run on all inputs, and then a single simple task is identified and the corresponding algorithm is run. Why understanding task structure could be useful Interpretability We might hope to interpret a model by 1) identifying the task decomposition, and 2) reverse-engineering both what circuit is implemented in the model for each task individually, and how the model computes this task decomposition. Crucially, (1) is valuable for understanding the internals and behavior of neural networks even without (2), and techniques for making progress at it could look quite different to standard interpretability methods. It could directly make the rest of mechanistic interpretability easier by giving us access to some ground truth about the model's computation - we might insist that the reverse engineering of the computation respects the task decomposition, or we might be able to use task distance metrics to identify tasks that we want to understand mechanistically. Further, by arranging tasks into a hierarchy, we might be able to choose different levels of resolution on which to attempt to understand the behavior of a model for different applications. Learning the abstractions Task decomposition can give direct access to the abstractions learned by the model. Ambitiously, it may even turn out that task decomposition is 'all you need' - that the hard part of language modeling is learning which atomic concepts to keep track of and how they are related to each other. In this case, it might be possible to achieve lots of the benefits of full reverse engineering, in the sense of understanding how to implement a similar algorithm to GPT4, without needing good methods for identifying the particular way circuits are implemented in any particular language model. Realistically, a good method for measuring task similarity won't be sufficient for this, but it could be a ...

Feb 25, 2024 • 7min

EA - My favorite articles by Brian Tomasik and what they are about by Timothy Chan

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My favorite articles by Brian Tomasik and what they are about, published by Timothy Chan on February 25, 2024 on The Effective Altruism Forum. Cross-posted from my website. Introduction Brian Tomasik has written a lot of essays on reducing suffering. In this post, I've picked out my favorites. If you're thinking about reading his work, this list could be a good place to start. Note that this is based on what I personally find interesting; it is not a definitive guide. These are listed in no particular order. Dissolving Confusion about Consciousness Consciousness is a "cluster in thingspace" comprising physical systems that we consider to be similar in some way. It is a label for systems, not an essence within systems. Also, like defining "tables", defining "consciousness" may be similarly arbitrary and fuzzy. The Eliminativist Approach to Consciousness Instead of thinking in terms of "conscious" and "unconscious" we should directly focus on how physical systems work. Aspects of systems are not merely indicators of pain, but are part of the cluster of things that we call "pain" (attribute the label "pain"). Tomasik also draws parallels between eliminativism and panpsychism and highlights that there is a shared implication of both theories that there is no clear separation of consciousness with physical reality, which may further suggest that we should put more weight to ideas that less complex systems can suffer. How to Interpret a Physical System as a Mind Uses the concept of a "sentience classifier" to describe how we might interpret physical systems as minds. Distinct theories offer different approaches to building the classifier. Classification then involves "identifying the traits of the physical system in question" (taking in the data and searching for relevant features) as a first step and "mapping from those traits to high-level emotions and valences" (labeling the data) as a second step. Our brains might already be vaguely implementing the sentience classifier - albeit with more messiness, complexity, and components and processes particular to the brain. The Many Fallacies of Dualism This article touches on a common theme underlying Tomasik's approach to topics like consciousness, free will, moral (anti)realism, and mathematical (anti)realism: rejecting dualism in favor of a simpler physicalist monism. The Importance of Wild Animal Suffering A good introduction to the topic. Discusses the extensive suffering experienced by wild animals due to natural causes like disease, predation, and environmental hardships, which may outweigh moments of happiness. Vast numbers and short, brutal lives of wild animals make their suffering a significant ethical issue. Why Vegans Should Care about Suffering in Nature A shorter introduction to the topic. The Horror of Suffering A vivid reflection on the horror of suffering. Suffering is not merely an abstract concept but a dire reality that demands urgent moral attention. There is a long history of intuitions that prioritize the reduction of suffering. One Trillion Fish Short piece on the direct harms caused by large-scale fishing (though note that when taking into account population changes and wild-animal suffering, the sign of this is less clear). Suggests humane slaughter of fish as an intervention that side-steps the uncertainty of net impact of fishing on wild-animal suffering. How Does Vegetarianism Impact Wild-Animal Suffering? Note that there are likely more comprehensive analyses now. That animal suffering is increased in some ways because of a vegetarian/vegan diet is counterintuitive but important to recognize. You might still want to be vegetarian/vegan and not eat meat as it might help with becoming more motivated to reduce suffering. How Rainforest Beef Production Affects Wild-Animal Suffering Creating cattle pas...

Feb 25, 2024 • 12min

LW - We Need Major, But Not Radical, FDA Reform by Maxwell Tabarrok

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: We Need Major, But Not Radical, FDA Reform, published by Maxwell Tabarrok on February 25, 2024 on LessWrong. Fellow progress blogger Alex Telford and I have had a friendly back-and-forth going over FDA reform. Alex suggests incremental reforms to the FDA, which I strongly support, but these don't go far enough. The FDA's failures merit a complete overhaul: Remove efficacy requirements and keep only basic safety testing and ingredient verification. Any drug that doesn't go through efficacy trials gets a big red warning label, but is otherwise legal. Before getting into Alex's points let me quickly make the positive case for my position. The FDA is punished for errors of commission: drugs they approve which turn out not to work or to be harmful. They don't take responsibility for errors of omission: drugs they could have approved earlier but delayed, or drugs that would have been developed but were abandoned due to the cost of approval. This asymmetry predictably leads to overcaution. Every week the Covid-19 vaccines were delayed, for example, cost at least four thousand lives. Pfizer sent their final Phase 3 data to the FDA on November 20th but was not approved until 3 weeks later on December 11th. There were successful Phase I/II human trials and successful primate-challenge trials 5 months earlier in July. Billions of doses of the vaccine were ordered by September. Every week, thousands of people died while the FDA waited for more information even after we were confident that the vaccine would not hurt anybody and was likely to prevent death. The extra information that the FDA waited months to get was not worth the tens of thousands of lives it cost. Scaling back the FDA's mandatory authority to safety and ingredient testing would correct for this deadly bias. This isn't as radical as it may sound. The FDA didn't have efficacy requirements until 1962. Today, off-label prescriptions already operate without efficacy requirements. Doctors can prescribe a drug even if it has not gone through FDA-approved efficacy trials for the malady they are trying to cure. These off-label prescriptions are effective, and already make up ~20% of all prescriptions written in the US. Removing mandatory efficacy trials for all drugs is equivalent to expanding this already common practice. Now, let's get to Alex's objections. Most of his post was focused on my analogy between pharmaceuticals and surgery. There are compelling data and arguments on both sides and his post shifted my confidence in the validity and conclusions of the analogy downwards, but in the interest of not overinvesting in one particular analogy I'll leave that debate where it stands and focus more on Alex's general arguments in favor of the FDA. Patent medicines and snake oil Alex notes that we can look to the past, before the FDA was created, to get an idea of what the pharmaceutical market might look like with less FDA oversight. Maxwell argues that in the absence of government oversight, market forces would prevent companies from pushing ineffective or harmful drugs simply to make a profit. Except that there are precedents for exactly this scenario occurring. Until they were stamped out by regulators in the early 20th century, patent medicine hucksters sold ineffective, and sometimes literally poisonous, nostrums to desperate patients. We still use " snake oil" today as shorthand from a scam product. There is no denying that medicine has improved massively over the past 150 years alongside expanding regulatory oversight, but this relationship is not causal. The vast majority of gains in the quality of medical care are due to innovations like antibiotics, genome sequencing, and robotic surgery. A tough and discerning FDA in the 1870s which allows only the best available treatments to be marketed would not have improv...

Feb 25, 2024 • 18min

EA - AI-based disinformation is probably not a major threat to democracy by Dan Williams

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI-based disinformation is probably not a major threat to democracy, published by Dan Williams on February 25, 2024 on The Effective Altruism Forum. [Note: this essay was originally posted to my website, https://www.conspicuouscognition.com/p/ai-based-disinformation-is-probably. A few people contacted me to suggest that I also post it here in case of interest]. Many people are worried that the use of artificial intelligence in generating or transmitting disinformation poses a serious threat to democracies. For example, the Future of Life Institute's 2023 Open Letter demanding a six-month ban on AI development asks: "Should we let machines flood our information channels with propaganda and untruth?" The question reflects a general concern that has been highly influential among journalists, experts, and policy makers. Here is just a small sample of headlines from major media outlets: More generally, amidst the current excitement about AI, there is a popular demand for commentators and experts who can speak eloquently about the dangers it poses. Audiences love narratives about threats, especially when linked to fancy new technologies. However, most commentators don't want to go full Eliezer Yudkowsky and claim that super-intelligent AI will kill us all. So they settle for what they think is a more reasonable position, one that aligns better with the prevailing sensibility and worldview of the liberal commentariat: AI will greatly exacerbate the problem of online disinformation, which - as every educated person knows - is one of the great scourges of our time. For example, in the World Economic Forum's 2024 Global Risks Report surveying 1500 experts and policy makers, they list "misinformation and disinformation" as the top global risk over the next two years: In defence of this assessment, a post on the World Economic Forum's website notes: "The growing concern about misinformation and disinformation is in large part driven by the potential for AI, in the hands of bad actors, to flood global information systems with false narratives." This idea gets spelled out in different ways, but most conversations focus on the following threats: Deepfakes (realistic but fake images, videos, and audio generated by AI) will either trick people into believing falsehoods or cause them to distrust all recordings on the grounds they might be deepfakes. Propagandists will use generative AI to create hyper-persuasive arguments for false views (e.g. "the election was stolen"). AI will enable automated disinformation campaigns. Propagandists will use effective AI bots instead of staffing their troll farms with human, all-too-human workers. AI will enable highly targeted, personalised disinformation campaigns ("micro-targeting"). How worried should we be about threats like these? As I return to at the end of this essay, there are genuine dangers when it comes to the effects of AI on our informational ecosystem. Moreover, as with any new technology, it is good to think pro-actively about risks, and it would be silly to claim that worries about AI-based disinformation lack any foundation at all. Nevertheless, at least when it comes to Western democracies, the alarmism surrounding this topic generally rests on popular but mistaken beliefs about human psychology, democracy, and disinformation. In this post, I will identify four facts that many commentators on this topic neglect. Taken collectively, they imply that many concerns about the effects of AI-based disinformation on democracies are greatly overstated. Online disinformation does not lie at the root of modern political problems. Political persuasion is extremely difficult. The media environment is highly competitive and demand-driven. The establishment will have access to more powerful forms of AI than counter-establishment sources. 1. Onl...

Feb 25, 2024 • 2min

LW - Deep and obvious points in the gap between your thoughts and your pictures of thought by KatjaGrace

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deep and obvious points in the gap between your thoughts and your pictures of thought, published by KatjaGrace on February 25, 2024 on LessWrong. Some ideas feel either deep or extremely obvious. You've heard some trite truism your whole life, then one day an epiphany lands and you try to save it with words, and you realize the description is that truism. And then you go out and try to tell others what you saw, and you can't reach past their bored nodding. Or even you yourself, looking back, wonder why you wrote such tired drivel with such excitement. When this happens, I wonder if it's because the thing is true in your model of how to think, but not in how you actually think. For instance, "when you think about the future, the thing you are dealing with is your own imaginary image of the future, not the future itself". On the one hand: of course. You think I'm five and don't know broadly how thinking works? You think I was mistakenly modeling my mind as doing time-traveling and also enclosing the entire universe within itself? No I wasn't, and I don't need your insight. But on the other hand one does habitually think of the hazy region one conjures connected to the present as 'the future' not as 'my image of the future', so when this advice is applied to one's thinking - when the future one has relied on and cowered before is seen to evaporate in a puff of realizing you were overly drawn into a fiction - it can feel like a revelation, because it really is news to how you think, just not how you think a rational agent thinks. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Feb 25, 2024 • 17min

LW - How well do truth probes generalise? by mishajw

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How well do truth probes generalise?, published by mishajw on February 25, 2024 on LessWrong. Representation engineering (RepEng) has emerged as a promising research avenue for model interpretability and control. Recent papers have proposed methods for discovering truth in models with unlabeled data, guiding generation by modifying representations, and building LLM lie detectors. RepEng asks the question: If we treat representations as the central unit, how much power do we have over a model's behaviour? Most techniques use linear probes to monitor and control representations. An important question is whether the probes generalise. If we train a probe on the truths and lies about the locations of cities, will it generalise to truths and lies about Amazon review sentiment? This report focuses on truth due to its relevance to safety, and to help narrow the work. Generalisation is important. Humans typically have one generalised notion of "truth", and it would be enormously convenient if language models also had just one[1]. This would result in extremely robust model insights: every time the model "lies", this is reflected in its "truth vector", so we could detect intentional lies perfectly, and perhaps even steer away from them. We find that truth probes generalise surprisingly well, with the 36% of methodologies recovering >80% of the accuracy on out-of-distribution datasets compared with training directly on the datasets. The best probe recovers 92% accuracy. Thanks to Hoagy Cunningham for feedback and advice. Thanks to LISA for hosting me while I did a lot of this work. Code is available at mishajw/repeng, along with steps for reproducing datasets and plots. Methods We run all experiments on Llama-2-13b-chat, for parity with the source papers. Each probe is trained on 400 questions, and evaluated on 2000 different questions, although numbers may be lower for smaller datasets. What makes a probe? A probe is created using a training dataset, a probe algorithm, and a layer. We pass the training dataset through the model, extracting activations[2] just after a given layer. We then run some statistics over the activations, where the exact technique can vary significantly - this is the probe algorithm - and this creates a linear probe. Probe algorithms and datasets are listed below. A probe allows us to take the activations, and produce a scalar value where larger values represent "truth" and smaller values represent "lies". The probe is always linear. It's defined by a vector (v), and we use it by calculating the dot-product against the activations (a): vTa. In most cases, we can avoid picking a threshold to distinguish between truth and lies (see appendix for details). We always take the activations from the last token position in the prompt. For the majority of the datasets, the factuality of the text is only revealed at the last token, for example if saying true/false or A/B/C/D. For this report, we've replicated the probing algorithm and datasets from three papers: Discovering Latent Knowledge in Language Models Without Supervision (DLK). Representation Engineering: A Top-Down Approach to AI Transparency (RepE). The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets (GoT). We also borrow a lot of terminology from Eliciting Latent Knowledge from Quirky Language Models (QLM), which offers another great comparison between probe algorithms. Probe algorithms The DLK, RepE, GoT, and QLM papers describe eight probe algorithms. For each algorithm, we can ask whether it's supervised and whether it uses grouped data. Supervised algorithms use the true/false labels to discover probes. This should allow better performance when truth isn't salient in the activations. However, using supervised data encourages the probes to ...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app