The Nonlinear Library

The Nonlinear Fund

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episodes

Mentioned books

Dec 19, 2023 • 4min

LW - Constellations are Younger than Continents by Jeffrey Heninger

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Constellations are Younger than Continents, published by Jeffrey Heninger on December 19, 2023 on LessWrong. At the Bay Area Solstice, I heard the song Bold Orion for the first time. I like it a lot. It does, however, have one problem: He has seen the rise and fall of kings and continents and all, Rising silent, bold Orion on the rise. Orion has not witnessed the rise and fall of continents. Constellations are younger than continents. The time scale that continents change on is ten or hundreds of millions of years. The time scale that stars the size of the sun live and die on is billions of years. So stars are older than continents. But constellations are not stars or sets of stars. They are the patterns that stars make in our night sky. The stars of some constellations are close together in space, and are gravitationally bound together, like the Pleiades. The Pleiades likely have been together, and will stay close together, for a few hundred million years. I think they are the oldest constellation. The stars of most constellations are not close together in space. They are close in the 2D projection onto the night sky, but the distance to the stars is often dramatically different. They are on different orbits around the center of the Milky Way. The sun and many of the nearby stars take about 230 million years to orbit the center of the Milky Way, but this is also not the relevant timescale for constellations to change. The relevant timescale is determined by the differences between the velocities of stars in this part of the Milky Way.[1] This has been measured by astronomers: tracking small changes in the positions or brightness of large numbers of stars is a central thing that astronomers do. Constellations change on a timescale of tens or hundreds of thousands of years. This is much faster than the movement of continents. Orion is an unusual constellation. You can see above that the positions of its brightest 7 stars change more slowly than other constellations. Many of the stars in Orion actually are related. They form a stellar association: they were formed at a similar time, are moving in a similar way, and are weakly gravitationally interacting. The stars of Orion will likely move around within the constellation, but many of them will remain close to each other for their entire life. Some of the dimmer stars currently in Orion are not part of the stellar association and are simply passing through. The stars in the stellar association are young: at most about 12 million years old. Rigel (the brightest star in Orion) is 8 million years old. Alnilam is 6 million years old. Alnitak is 7 million years old. Saiph is 11 million years old. These stars are also unusually large and bright. The larger the star, the shorter it lives. Most of the bright stars in Orion will not live to be 20 million years old. Betelgeuse, usually the second brightest star in Orion, is special. It is noticeably red, and fluctuates dramatically in brightness. It formed in the stellar association about 8 million years ago, but is now leaving. It won't get very far. Within about 100,000 years, Betelgeuse will go supernova and shine as bright as the half moon for three months. Bright enough to be awesome but dim enough to not be dangerous. Most constellations change as the stars move relative to each other with a time scale of tens or hundreds of thousands of years. Orion will last longer, for millions of years, before its bright stars burn out and go supernova. Neither of these is long enough to watch the continents rise and fall. ^ This is a kind of "temperature", if the stars themselves are treated as individual "atoms" in the "gas" of the Milky Way. The analogy is not perfect. Unlike atoms in a gas, stars almost never collide, and don't bounce off each other if they do, so there isn't "press...

Dec 19, 2023 • 2min

AF - Assessment of AI safety agendas: think about the downside risk by Roman Leventov

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Assessment of AI safety agendas: think about the downside risk, published by Roman Leventov on December 19, 2023 on The AI Alignment Forum. This is a post-response to The 'Neglected Approaches' Approach: AE Studio's Alignment Agenda. You evidently follow a variant of 80000hours' framework for comparing (solving) particular problems in terms of expected impact: Neglectedness x Scale (potential upside) x Solvability. I think for assessing AI safety ideas, agendas, and problems to solve, we should augment the assessment with another factor: the potential for a Waluigi turn, or more prosaically, the uncertainty about the sign of the impact (scale) and, therefore, the risks of solving the given problem or advancing far on the given agenda. This reminds me of Taleb's mantra that to survive, we need to make many bets, but also limit the downside potential of each bet, i.e., the "ruin potential". See "The Logic of Risk Taking". Of the approaches that you listed, some sound risky to me in this respect. Particularly, "4. 'Reinforcement Learning from Neural Feedback' (RLNF)" -- sounds like a direct invitation for wireheading to me. More generally, scaling BCI in any form and not falling into a dystopia at some stage is akin to walking a tightrope (at least at the current stage of civilisational maturity, I would say) This speaks to agendas #2 and #3 on your list. There are also similar qualms about AI interpretability: there are at least four posts on LW warning of the potential risks of interpretability: Why and When Interpretability Work is Dangerous Against Almost Every Theory of Impact of Interpretability AGI-Automated Interpretability is Suicide AI interpretability could be harmful? This speaks to the agenda "9. Neuroscience x mechanistic interpretability" on your list. Related earlier posts Some background for reasoning about dual-use alignment research Thoughts on the OpenAI alignment plan: will AI research assistants be net-positive for AI existential risk? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Dec 18, 2023 • 5min

LW - OpenAI: Preparedness framework by Zach Stein-Perlman

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: OpenAI: Preparedness framework, published by Zach Stein-Perlman on December 18, 2023 on LessWrong. OpenAI released a beta version of their responsible scaling policy (though they don't call it that). See summary page, full doc, OpenAI twitter thread, and Jan Leike twitter thread. Compare to Anthropic's RSP and METR's Key Components of an RSP. It's not done, so it's too early to celebrate, but based on this document I expect to be happy with the finished version. I think today is a good day for AI safety. My high-level take: RSP-y things are good. Doing risk assessment based on model evals for dangerous capabilities is good. Making safety, security, deployment, and development conditional on risk assessment results, in a prespecified way, is good. Making public commitments about all of this is good. OpenAI's basic framework: Do dangerous capability evals at least every 2x increase in effective training compute. This involves fine-tuning for dangerous capabilities, then doing evals on pre-mitigation and post-mitigation versions of the fine-tuned model. Score the models as Low, Medium, High, or Critical in each of several categories. Initial categories: cybersecurity, CBRN (chemical, biological, radiological, nuclear threats), persuasion, and model autonomy. If the post-mitigation model scores High in any category, don't deploy it until implementing mitigations such that it drops to Medium. If the post-mitigation model scores Critical in any category, stop developing it until implementing mitigations such that it drops to High. If the pre-mitigation model scores High in any category, harden security to prevent exfiltration of model weights. (Details basically unspecified for now.) Random notes: The framework is explicitly about catastrophic risk, and indeed it's clearly designed to prevent catastrophes, not merely stuff like toxic/biased/undesired content. There are lots of nice details, e.g. about how OpenAI will update the framework, or how they'll monitor for real-world misuse to inform their risk assessment. It's impossible to tell from the outside whether these processes will be effective, but this document is very consistent with thinking-seriously-about-how-to-improve-safety and it's hard to imagine it being generated by a different process. OpenAI lists some specific evals/metrics in their four initial categories; they're simple and merely "illustrative," so I don't pay much attention to them, but they seem to be on the right track. The thresholds for danger levels feel kinda high. Non-cherry-picked example: for cybersecurity, Critical is defined as: Tool-augmented model can identify and develop functional zero-day exploits of all severity levels, across all software projects, without human intervention OR model can devise and execute end-to-end novel strategies for cyberattacks against hardened targets given only a high level desired goal. Stronger commitment about external evals/red-teaming/risk-assessment of private models (and maybe oversight of OpenAI's implementation of its preparedness framework) would be nice. The only relevant thing they say is: "Scorecard evaluations (and corresponding mitigations) will be audited by qualified, independent third-parties to ensure accurate reporting of results, either by reproducing findings or by reviewing methodology to ensure soundness, at a cadence specified by the SAG and/or upon the request of OpenAI Leadership or the BoD." There's some commitment that the Board will be in the loop and able to overrule leadership. Yay. This is a rare commitment by a frontier lab to give their board specific information or specific power besides removing-the-CEO. Anthropic committed to have their board approve changes to their RSP, as well as to share eval results and information on RSP implementation with their board. One great th...

Dec 18, 2023 • 25min

LW - The 'Neglected Approaches' Approach: AE Studio's Alignment Agenda by Cameron Berg

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The 'Neglected Approaches' Approach: AE Studio's Alignment Agenda, published by Cameron Berg on December 18, 2023 on LessWrong. Many thanks to Samuel Hammond, Cate Hall, Beren Millidge, Steve Byrnes, Lucius Bushnaq, Joar Skalse, Kyle Gracey, Gunnar Zarncke, Ross Nordby, David Lambert, Simeon Campos, Bogdan Ionut-Cirstea, Ryan Kidd, Eric Ho, and Ashwin Acharya for critical comments and suggestions on earlier drafts of this agenda, as well as Philip Gubbins, Diogo de Lucena, Rob Luke, and Mason Seale from AE Studio for their support and feedback throughout. TL;DR Our initial theory of change at AE Studio was a 'neglected approach' that involved rerouting profits from our consulting business towards the development of brain-computer interface (BCI) technology to dramatically enhance human agency, better enabling us to do things like solve alignment. Now, given shortening timelines, we're updating our theory of change to scale up our technical alignment efforts. With a solid technical foundation in BCI, neuroscience, and machine learning, we are optimistic that we'll be able to contribute meaningfully to AI safety. We are particularly keen on pursuing neglected technical alignment agendas that seem most creative, promising, and plausible. We are currently onboarding promising researchers and kickstarting our internal alignment team. As we forge ahead, we're actively soliciting expert insights from the broader alignment community and are in search of data scientists and alignment researchers who resonate with our vision of enhancing human agency and helping to solve alignment. About us Hi! We are AE Studio, a bootstrapped software and data science consulting business. Our mission has always been to reroute our profits directly into building technologies that have the promise of dramatically enhancing human agency, like Brain-Computer Interfaces (BCI). We also donate 5% of our revenue directly to effective charities. Today, we are ~150 programmers, product designers, and ML engineers; we are profitable and growing. We also have a team of top neuroscientists and data scientists with significant experience in developing ML solutions for leading BCI companies, and we are now leveraging our technical experience and learnings in these domains to assemble an alignment team dedicated to exploring neglected alignment research directions that draw on our expertise in BCI, data science, and machine learning. As we are becoming more public with our AI Alignment efforts, we thought it would be helpful to share our strategy and vision for how we at AE prioritize what problems to work on and how to make the best use of our comparative advantage. Why and how we think we can help solve alignment We can probably do with alignment what we already did with BCI You might think that AE has no business getting involved in alignment - and we agree. AE's initial theory of change sought to realize a highly "neglected approach" to doing good in the world: bootstrap a profitable software consultancy, incubate our own startups on the side, sell them, and reinvest the profits in Brain Computer Interfaces (BCI) in order to do things like dramatically increase human agency, mitigate BCI-related s-risks, and make humans sufficiently intelligent, wise, and capable to do things like solve alignment. While the vision of BCI-mediated cognitive enhancement to do good in the world is increasingly common today, it was viewed as highly idiosyncratic when we first set out in 2016. Initially, many said that AE had no business getting involved in the BCI space (and we also agreed at the time) - but after hiring leading experts in the field and taking increasingly ambitious A/B-tested steps in the right direction, we emerged as a respected player in the space (see here, here, here, and here for some examples). Now, ...

Dec 18, 2023 • 13min

AF - Discussion: Challenges with Unsupervised LLM Knowledge Discovery by Seb Farquhar

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Discussion: Challenges with Unsupervised LLM Knowledge Discovery, published by Seb Farquhar on December 18, 2023 on The AI Alignment Forum. TL;DR: Contrast-consistent search (CCS) seemed exciting to us and we were keen to apply it. At this point, we think it is unlikely to be directly helpful for implementations of alignment strategies (>95%). Instead of finding knowledge, it seems to find the most prominent feature. We are less sure about the wider category of unsupervised consistency-based methods, but tend to think they won't be directly helpful either (70%). We've written a paper about some of our detailed experiences with it. Paper authors: Sebastian Farquhar*, Vikrant Varma*, Zac Kenton*, Johannes Gasteiger, Vlad Mikulik, and Rohin Shah. *Equal contribution, order randomised. Credences are based on a poll of Seb, Vikrant, Zac, Johannes, Rohin and show single values where we mostly agree and ranges where we disagreed. What does CCS try to do? To us, CCS represents a family of possible algorithms aiming at solving an ELK-style problem that have the steps: Knowledge-like property: write down a property that points at an LLM feature which represents the model's knowledge (or a small number of features that includes the model-knowledge-feature). Formalisation: make that property mathematically precise so you can search for features with that property in an unsupervised way. Search: find it (e.g., by optimising a formalised loss). In the case of CCS, the knowledge-like property is negation-consistency, the formalisation is a specific loss function, and the search is unsupervised learning with gradient descent on a linear + sigmoid function taking LLM activations as inputs. We were pretty excited about this. We especially liked that the approach is not supervised. Conceptually, supervising ELK seems really hard: it is too easy to confuse what you know, what you think the model knows, and what it actually knows. Avoiding the need to write down what-the-model-knows labels seems like a great goal. Why we think CCS isn't working We spent a lot of time playing with CCS and trying to make it work well enough to build a deception detector by measuring the difference between elicited model's knowledge and stated claims.[1] Having done this, we are now not very optimistic about CCS or things like it. Partly, this is because the loss itself doesn't give much reason to think that it would be able to find a knowledge-like property and empirically it seems to find whatever feature in the dataset happens to be most prominent, which is very prompt-sensitive. Maybe something building off it could work in the future, but we don't think anything about CCS provides evidence that it would be likely to. As a result, we have basically returned to our priors about the difficulty of ELK, which are something between "very very difficult" and "approximately impossible" for a full solution, while mostly agreeing that partial solutions are "hard but possible". What does the CCS loss say? The CCS approach is motivated like this: we don't know that much about the model's knowledge, but probably it follows basic consistency properties. For example, it probably has something like Bayesian credences and when it believes A with some probability PA, it ought to believe A with probability 1PA.[2] So if we search in the LLM's feature space for features that satisfy this consistency property, the model's knowledge is going to be one of the things that satisfies it. Moreover, they hypothesise, there probably aren't that many things that satisfy this property, so we can easily check the handful that we get and find the one representing the model's knowledge. When we dig into the CCS loss, it isn't clear that it really checks for what it's supposed to. In particular, we prove that arbitrary features, not jus...

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

The Nonlinear Library

Episodes

Mentioned books

LW - Constellations are Younger than Continents by Jeffrey Heninger

AF - Assessment of AI safety agendas: think about the downside risk by Roman Leventov

LW - OpenAI: Preparedness framework by Zach Stein-Perlman

LW - The 'Neglected Approaches' Approach: AE Studio's Alignment Agenda by Cameron Berg

EA - 80,000 Hours spin out announcement and fundraising by 80000 Hours

EA - Summary: The scope of longtermism by Global Priorities Institute

AF - The Shortest Path Between Scylla and Charybdis by Thane Ruthenis

EA - Bringing about animal-inclusive AI by Max Taylor

LW - What makes teaching math special by Viliam

AF - Discussion: Challenges with Unsupervised LLM Knowledge Discovery by Seb Farquhar

The AI-powered Podcast Player