The Nonlinear Library

The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episodes

Mentioned books

Jul 19, 2024 • 2min

LW - Linkpost: Surely you can be serious by kave

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Linkpost: Surely you can be serious, published by kave on July 19, 2024 on LessWrong. Adam Mastroianni writes about "actually caring about stuff, and for the right reasons", rather than just LARPing. The opening is excerpted below. I once saw someone give a talk about a tiny intervention that caused a gigantic effect, something like, "We gave high school seniors a hearty slap on the back and then they scored 500 points higher on the SAT." 1 Everyone in the audience was like, "Hmm, interesting, I wonder if there were any gender effects, etc." I wanted to get up and yell: "EITHER THIS IS THE MOST POTENT PSYCHOLOGICAL INTERVENTION EVER, OR THIS STUDY IS TOTAL BULLSHIT." If those results are real, we should start a nationwide backslapping campaign immediately. We should be backslapping astronauts before their rocket launches and Olympians before their floor routines. We should be running followup studies to see just how many SAT points we can get - does a second slap get you another 500? Or just another 250? Can you slap someone raw and turn them into a genius? Or - much more likely - the results are not real, and we should either be a) helping this person understand where they screwed up in their methods and data analysis, or b) kicking them out for fraud. Those are the options. Asking a bunch of softball questions ("Which result was your favorite?") is not a reasonable response. That's like watching someone pull a rabbit out of a hat actually for real, not a magic trick, and then asking them, "What's the rabbit's name?" Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

Jul 18, 2024 • 1h 23min

LW - AI #73: Openly Evil AI by Zvi

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI #73: Openly Evil AI, published by Zvi on July 18, 2024 on LessWrong. What do you call a clause explicitly saying that you waive the right to whistleblower compensation, and that you need to get permission before sharing information with government regulators like the SEC? I have many answers. I also know that OpenAI, having f***ed around, seems poised to find out, because that is the claim made by whistleblowers to the SEC. Given the SEC fines you for merely not making an explicit exception to your NDA for whistleblowers, what will they do once aware of explicit clauses going the other way? (Unless, of course, the complaint is factually wrong, but that seems unlikely.) We also have rather a lot of tech people coming out in support of Trump. I go into the reasons why, which I do think is worth considering. There is a mix of explanations, and at least one very good reason. Then I also got suckered into responding to a few new (well, not really new, but renewed) disingenuous attacks on SB 1047. The entire strategy is to be loud and hyperbolic, especially on Twitter, and either hallucinate or fabricate a different bill with different consequences to attack, or simply misrepresent how the law works, then use that, to create the illusion the bill is unliked or harmful. Few others respond to correct such claims, and I constantly worry that the strategy might actually work. But that does not mean you, my reader who already knows, need to read all that. Also a bunch of fun smaller developments. Karpathy is in the AI education business. Table of Contents 1. Introduction. 2. Table of Contents. 3. Language Models Offer Mundane Utility. Fight the insurance company. 4. Language Models Don't Offer Mundane Utility. Have you tried using it? 5. Clauding Along. Not that many people are switching over. 6. Fun With Image Generation. Amazon Music and K-Pop start to embrace AI. 7. Deepfaketown and Botpocalypse Soon. FoxVox, turn Fox into Vox or Vox into Fox. 8. They Took Our Jobs. Take away one haggling job, create another haggling job. 9. Get Involved. OpenPhil request for proposals. Job openings elsewhere. 10. Introducing. Karpathy goes into AI education. 11. In Other AI News. OpenAI's Q* is now named Strawberry. Is it happening? 12. Denying the Future. Projects of the future that think AI will never improve again. 13. Quiet Speculations. How to think about stages of AI capabilities. 14. The Quest for Sane Regulations. EU, UK, The Public. 15. The Other Quest Regarding Regulations. Many in tech embrace The Donald. 16. SB 1047 Opposition Watch (1). I'm sorry. You don't have to read this. 17. SB 1047 Opposition Watch (2). I'm sorry. You don't have to read this. 18. Open Weights are Unsafe and Nothing Can Fix This. What to do about it? 19. The Week in Audio. Joe Rogan talked to Sam Altman and I'd missed it. 20. Rhetorical Innovation. Supervillains, oh no. 21. Oh Anthropic. More details available, things not as bad as they look. 22. Openly Evil AI. Other things, in other places, on the other hand, look worse. 23. Aligning a Smarter Than Human Intelligence is Difficult. Noble attempts. 24. People Are Worried About AI Killing Everyone. Scott Adams? Kind of? 25. Other People Are Not As Worried About AI Killing Everyone. All glory to it. 26. The Lighter Side. A different kind of mental gymnastics. Language Models Offer Mundane Utility Let Claude write your prompts for you. He suggests using the Claude prompt improver. Sully: convinced that we are all really bad at writing prompts I'm personally never writing prompts by hand again Claude is just too good - managed to feed it evals and it just optimized for me Probably a crude version of dspy but insane how much prompting can make a difference. Predict who will be the shooting victim. A machine learning model did this for citizens of Chicago (a ...

Jul 18, 2024 • 40min

LW - Individually incentivized safe Pareto improvements in open-source bargaining by Nicolas Macé

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Individually incentivized safe Pareto improvements in open-source bargaining, published by Nicolas Macé on July 18, 2024 on LessWrong. Summary Agents might fail to peacefully trade in high-stakes negotiations. Such bargaining failures can have catastrophic consequences, including great power conflicts, and AI flash wars. This post is a distillation of DiGiovanni et al. (2024) (DCM), whose central result is that agents that are sufficiently transparent to each other have individual incentives to avoid catastrophic bargaining failures. More precisely, DCM constructs strategies that are plausibly individually incentivized, and, if adopted by all, guarantee each player no less than their least preferred trade outcome. Figure 0 below illustrates this. This result is significant because artificial general intelligences (AGIs) might (i) be involved in high-stakes negotiations, (ii) be designed with the capabilities required for the type of strategy we'll present, and (iii) bargain poorly by default (since bargaining competence isn't necessarily a direct corollary of intelligence-relevant capabilities). Introduction Early AGIs might fail to make compatible demands with each other in high-stakes negotiations (we call this a "bargaining failure"). Bargaining failures can have catastrophic consequences, including great power conflicts, or AI triggering a flash war. More generally, a "bargaining problem" is when multiple agents need to determine how to divide value among themselves. Early AGIs might possess insufficient bargaining skills because intelligence-relevant capabilities don't necessarily imply these skills: For instance, being skilled at avoiding bargaining failures might not be necessary for taking over. Another problem is that there might be no single rational way to act in a given multi-agent interaction. Even arbitrarily capable agents might have different priors, or different approaches to reasoning under bounded computation. Therefore they might fail to solve equilibrium selection, i.e., make incompatible demands (see Stastny et al. (2021) and Conitzer & Oesterheld (2023)). What, then, are sufficient conditions for agents to avoid catastrophic bargaining failures? Sufficiently advanced AIs might be able to verify each other's decision algorithms (e.g. via verifying source code), as studied in open-source game theory. This has both potential downsides and upsides for bargaining problems. On one hand, transparency of decision algorithms might make aggressive commitments more credible and thus more attractive (see Sec. 5.2 of Dafoe et al. (2020) for discussion). On the other hand, agents might be able to mitigate bargaining failures by verifying cooperative commitments. Oesterheld & Conitzer (2022)'s safe Pareto improvements[1] (SPI) leverages transparency to reduce the downsides of incompatible commitments. In an SPI, agents conditionally commit to change how they play a game relative to some default such that everyone is (weakly) better off than the default with certainty.[2] For example, two parties A and B who would otherwise go to war over some territory might commit to, instead, accept the outcome of a lottery that allocates the territory to A with the probability that A would have won the war (assuming this probability is common knowledge). See also our extended example below. Oesterheld & Conitzer (2022) has two important limitations: First, many different SPIs are in general possible, such that there is an "SPI selection problem", similar to the equilibrium selection problem in game theory (Sec. 6 of Oesterheld & Conitzer (2022)). And if players don't coordinate on which SPI to implement, they might fail to avoid conflict.[3] Second, if expected utility-maximizing agents need to individually adopt strategies to implement an SPI, it's unclear what conditions...

Jul 18, 2024 • 25min

LW - Mech Interp Lacks Good Paradigms by Daniel Tan

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mech Interp Lacks Good Paradigms, published by Daniel Tan on July 18, 2024 on LessWrong. Note: I wrote this post rather quickly as an exercise in sharing rough / unpolished thoughts. I am also not an expert on some of the things I've written about. If you spot mistakes or would like to point out missed work / perspectives, please feel free! Note 2: I originally sent this link to some people for feedback, but I was having trouble viewing the comments on the draft. The post was also in a reasonably complete state, so I decided to just publish it - and now I can see the comments! If you're one of those people, feedback is still very much welcome! Mechanistic Interpretability (MI) is a popular and rapidly growing field of technical AI safety research. As a field, it's extremely accessible, requiring comparatively few computational resources, and facilitates rapid learning, due to a very short feedback loop. This means that many junior researchers' first foray into AI safety research is in MI (myself included); indeed, this occurs to the extent where some people feel MI is over-subscribed relative to other technical agendas. However, how useful is this MI research? A very common claim on MI's theory of impact (ToI) is that MI helps us advance towards a "grand unifying theory" (GUT) of deep learning. One of my big cruxes for this ToI is whether MI admits "paradigms" which facilitate correct thinking and understanding of the models we aim to interpret. In this post, I'll critically examine several leading candidates for "paradigms" in MI, consider the available evidence for / against, and identify good future research directions (IMO). At the end, I'll conclude with a summary of the main points and an overview of the technical research items I've outlined. Towards a Grand Unifying Theory (GUT) with MI Proponents of this argument believe that, by improving our basic understanding of neural nets, MI yields valuable insights that can be used to improve our agents, e.g. by improving architectures or by improving their training processes. This allows us to make sure future models are safe and aligned. Some people who have espoused this opinion: Richard Ngo has argued here that MI enables "big breakthroughs" towards a "principled understanding" of deep learning. Rohin Shah has argued here that MI builds "new affordances" for alignment methods. Evan Hubinger has argued for MI here because it helps us identify "unknown unknowns". Leo Gao argues here that MI aids in "conceptual research" and "gets many bits" per experiment. As a concrete example of work that I think would not have been possible without fundamental insights from MI: steering vectors, a.k.a. representation engineering, and circuit breakers, which were obviously inspired by the wealth of work in MI demonstrating the linear representation hypothesis. It's also important to remember that the value of fundamental science often seems much lower in hindsight, because humans quickly adjust their perspectives. Even if MI insights seem like common sense to us nowadays, their value in instrumenting significant advances can't be overstated. (Aside) A corollary of this argument is that MI could likely have significant capabilities externalities. Becoming better at building powerful and instruction-aligned agents may inadvertently accelerate us towards AGI. This point has been made in depth elsewhere, so I won't elaborate further here. A GUT Needs Paradigms Paradigm - an overarching framework for thinking about a field In his seminal book, The Structure of Scientific Revolution, Thomas Kuhn catalogues scientific progress in many different fields (spanning physics, chemistry, biology), and distills general trends about how these fields progress. Central to his analysis is the notion of a "paradigm" - an overarching framework for th...

Jul 18, 2024 • 32min

AF - A List of 45+ Mech Interp Project Ideas from Apollo Research's Interpretability Team by Lee Sharkey

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A List of 45+ Mech Interp Project Ideas from Apollo Research's Interpretability Team, published by Lee Sharkey on July 18, 2024 on The AI Alignment Forum. Why we made this list: The interpretability team at Apollo Research wrapped up a few projects recently[1]. In order to decide what we'd work on next, we generated a lot of different potential projects. Unfortunately, we are computationally bounded agents, so we can't work on every project idea that we were excited about! Previous lists of project ideas (such as Neel's collation of 200 Concrete Open Problems in Mechanistic Interpretability) have been very useful for people breaking into the field. But for all its merits, that list is now over a year and a half old. Therefore, many project ideas in that list aren't an up-to-date reflection of what some researchers consider the frontiers of mech interp. We therefore thought it would be helpful to share our list of project ideas! Comments and caveats: Some of these projects are more precisely scoped than others. Some are vague, others are more developed. Not every member of the team endorses every project as high priority. Usually more than one team member supports each one, and in many cases most of the team is supportive of someone working on it. We associate the person(s) who generated the project idea to each idea. We've grouped the project ideas into categories for convenience, but some projects span multiple categories. We don't put a huge amount of weight on this particular categorisation. We hope some people find this list helpful! We would love to see people working on these! If any sound interesting to you and you'd like to chat about it, don't hesitate to reach out. Foundational work on sparse dictionary learning for interpretability Transcoder-related project ideas See [2406.11944] Transcoders Find Interpretable LLM Feature Circuits) [Nix] Training and releasing high quality transcoders. Probably using top k GPT2 is a classic candidate for this. I'd be excited for people to try hard on even smaller models, e.g. GELU 4L [Nix] Good tooling for using transcoders Nice programming API to attribute an input to a collection of paths (see Dunefsky et al) Web user interface? Maybe in collaboration with neuronpedia. Would need a gpu server constantly running, but I'm optimistic you could do it with a ~a4000. [Nix] Further circuit analysis using transcoders. Take random input sequences, run transcoder attribution on them, examine the output and summarize the findings. High level summary statistics of how much attribution goes through error terms & how many pathways are needed would be valuable Explaining specific behaviors (IOI, greater-than) with high standards for specificity & faithfulness. Might be convoluted if accuracy [I could generate more ideas here, feel free to reach out nix@apolloresearch.ai] [Nix, Lee] Cross layer superposition Does it happen? Probably, but it would be nice to have specific examples! Look for features with similar decoder vectors, and do exploratory research to figure out what exactly is going on. What precisely does it mean? Answering this question seems likely to shed light on the question of 'What is a feature?'. [Lucius] Improving transcoder architectures Some MLPs or attention layers may implement a simple linear transformation in addition to actual computation. If we modify our transcoders to include a linear 'bypass' that is not counted in the sparsity penalty, do we improve performance since we are not unduly penalizing these linear transformations that would always be present and active? If we train multiple transcoders in different layers at the same time, can we include a sparsity penalty for their interactions with each other, encouraging a decomposition of the network that leaves us with as few interactions between features a...

Jul 18, 2024 • 10min

LW - We ran an AI safety conference in Tokyo. It went really well. Come next year! by Blaine

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: We ran an AI safety conference in Tokyo. It went really well. Come next year!, published by Blaine on July 18, 2024 on LessWrong. Abstract Technical AI Safety 2024 (TAIS 2024) was a conference organised by AI Safety 東京 and Noeon Research, in collaboration with Reaktor Japan, AI Alignment Network and AI Industry Foundation. You may have heard of us through ACX. The goals of the conference were 1. demonstrate the practice of technical safety research to Japanese researchers new to the field 2. share ideas among established technical safety researchers 3. establish a good international reputation for AI Safety 東京 and Noeon Research 4. establish a Schelling conference for people working in technical safety We sent out a survey after the conference to get feedback from attendees on whether or not we achieved those goals. We certainly achieved goals 1, 2 and 3; goal 4 remains to be seen. In this post we give more details about the conference, share results from the feedback survey, and announce our intentions to run another conference next year. Okay but like, what was TAIS 2024? Technical AI Safety 2024 (TAIS 2024) was a small non-archival open academic conference structured as a lecture series. It ran over the course of 2 days from April 5th-6th 2024 at the International Conference Hall of the Plaza Heisei in Odaiba, Tokyo. We had 18 talks covering 6 research agendas in technical AI safety: Mechanistic Interpretability Developmental Interpretability Scaleable Oversight Agent Foundations Causal Incentives ALIFE …including talks from Hoagy Cunningham (Anthropic), Noah Y. Siegel (DeepMind), Manuel Baltieri (Araya), Dan Hendrycks (CAIS), Scott Emmons (CHAI), Ryan Kidd (MATS), James Fox (LISA), and Jesse Hoogland and Stan van Wingerden (Timaeus). In addition to our invited talks, we had 25 submissions, of which 19 were deemed relevant for presentation. 5 were offered talk slots, and we arranged a poster session to accommodate the remaining 14. In the end, 7 people presented posters, 5 in person and 2 in absentia. Our best poster award was won jointly by Fazl Berez for Large Language Models Relearn Removed Concepts and Alex Spies for Structured Representations in Maze-Solving Transformers. We had 105 in-person attendees (including the speakers). Our live streams had around 400 unique viewers, and maxed out at 18 concurrent viewers. Recordings of the conference talks are hosted on our youtube channel. How did it go? Very well, thanks for asking! We sent out a feedback survey after the event, and got 68 responses from in-person attendees (58% response rate). With the usual caveats that survey respondents are not necessarily a representative sample of the population: Looking good! Let's dig deeper. How useful was TAIS 2024 for those new to the field? Event satisfaction was high across the board, which makes it hard to tell how relatively satisfied population subgroups were. Only those who identified themselves as "new to AI safety" were neutrally satisfied, but the newbies were also the most likely to be highly satisfied. It seems that people new to AI safety had no more or less trouble understanding the talks than those who work for AI safety organisations or have published AI safety research: They were also no more or less likely to make new research collaborations: Note that there is substantial overlap between some of these categories, especially for categories that imply a strong existing relationship to AI safety, so take the above charts with a pinch of salt: Total New to AI safety Part of the AI safety community Employed by an AI safety org Has published AI safety research New to AI safety 26 100% 19% 12% 4% Part of the AI safety community 28 18% 100% 36% 32% Employed by an AI safety org 20 15% 50% 100% 35% Has published AIS research 13 8% 69% 54% 100% Subjectively, it fe...

Jul 18, 2024 • 19min

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app