

LessWrong (Curated & Popular)
LessWrong
Audio narrations of LessWrong posts. Includes all curated posts and all posts with 125+ karma.If you'd like more, subscribe to the “Lesswrong (30+ karma)” feed.
Episodes
Mentioned books

Jun 12, 2023 • 41min
"Unifying Bargaining Notions (2/2)" by Diffractor
Alright, time for the payoff, unifying everything discussed in the previous post. This post is a lot more mathematically dense, you might want to digest it in more than one sitting. Imaginary Prices, Tradeoffs, and UtilitarianismHarsanyi's Utilitarianism Theorem can be summarized as "if a bunch of agents have their own personal utility functions Ui, and you want to aggregate them into a collective utility function U with the property that everyone agreeing that option x is better than option y (ie, Ui(x)≥Ui(y) for all i) implies U(x)≥U(y), then that collective utility function must be of the form b+∑i∈IaiUi for some number b and nonnegative numbers ai."Basically, if you want to aggregate utility functions, the only sane way to do so is to give everyone importance weights, and do a weighted sum of everyone's individual utility functions.Closely related to this is a result that says that any point on the Pareto Frontier of a game can be post-hoc interpreted as the result of maximizing a collective utility function. This related result is one where it's very important for the reader to understand the actual proof, because the proof gives you a way of reverse-engineering "how much everyone matters to the social utility function" from the outcome alone.First up, draw all the outcomes, and the utilities that both players assign to them, and the convex hull will be the "feasible set" F, since we have access to randomization. Pick some Pareto frontier point u1,u2...un (although the drawn image is for only two players)https://www.lesswrong.com/posts/RZNmNwc9SxdKayeQh/unifying-bargaining-notions-2-2

Jun 6, 2023 • 10min
"The ants and the grasshopper" by Richard Ngo
Inspired by Aesop, Soren Kierkegaard, Robin Hanson, sadoeuphemist and Ben Hoffman.One winter a grasshopper, starving and frail, approaches a colony of ants drying out their grain in the sun, to ask for food.“Did you not store up food during the summer?” the ants ask.“No”, says the grasshopper. “I lost track of time, because I was singing and dancing all summer long.”The ants, disgusted, turn away and go back to work.https://www.lesswrong.com/posts/GJgudfEvNx8oeyffH/the-ants-and-the-grasshopper

May 18, 2023 • 1h 43min
"Steering GPT-2-XL by adding an activation vector" by TurnTrout et al.
Summary: We demonstrate a new scalable way of interacting with language models: adding certain activation vectors into forward passes. Essentially, we add together combinations of forward passes in order to get GPT-2 to output the kinds of text we want. We provide a lot of entertaining and successful examples of these "activation additions." We also show a few activation additions which unexpectedly fail to have the desired effect.We quantitatively evaluate how activation additions affect GPT-2's capabilities. For example, we find that adding a "wedding" vector decreases perplexity on wedding-related sentences, without harming perplexity on unrelated sentences. Overall, we find strong evidence that appropriately configured activation additions preserve GPT-2's capabilities.Our results provide enticing clues about the kinds of programs implemented by language models. For some reason, GPT-2 allows "combination" of its forward passes, even though it was never trained to do so. Furthermore, our results are evidence of linear feature directions, including "anger", "weddings", and "create conspiracy theories." We coin the phrase "activation engineering" to describe techniques which steer models by modifying their activations. As a complement to prompt engineering and finetuning, activation engineering is a low-overhead way to steer models at runtime. Activation additions are nearly as easy as prompting, and they offer an additional way to influence a model’s behaviors and values. We suspect that activation additions can adjust the goals being pursued by a network at inference time.https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector

May 16, 2023 • 1h 2min
"An artificially structured argument for expecting AGI ruin" by Rob Bensinger
Philosopher David Chalmers asked: "Is there a canonical source for "the argument for AGI ruin" somewhere, preferably laid out as an explicit argument with premises and a conclusion?" Unsurprisingly, the actual reason people expect AGI ruin isn't a crisp deductive argument; it's a probabilistic update based on many lines of evidence. The specific observations and heuristics that carried the most weight for someone will vary for each individual, and can be hard to accurately draw out. That said, Eliezer Yudkowsky's So Far: Unfriendly AI Edition might be a good place to start if we want a pseudo-deductive argument just for the sake of organizing discussion. People can then say which premises they want to drill down on. In The Basic Reasons I Expect AGI Ruin, I wrote: "When I say "general intelligence", I'm usually thinking about "whatever it is that lets human brains do astrophysics, category theory, etc. even though our brains evolved under literally zero selection pressure to solve astrophysics or category theory problems". It's possible that we should already be thinking of GPT-4 as "AGI" on some definitions, so to be clear about the threshold of generality I have in mind, I'll specifically talk about "STEM-level AGI", though I expect such systems to be good at non-STEM tasks too. STEM-level AGI is AGI that has "the basic mental machinery required to do par-human reasoning about all the hard sciences", though a specific STEM-level AGI could (e.g.) lack physics ability for the same reasons many smart humans can't solve physics problems, such as "lack of familiarity with the field".https://www.lesswrong.com/posts/QzkTfj4HGpLEdNjXX/an-artificially-structured-argument-for-expecting-agi-ruin

May 10, 2023 • 33min
"How much do you believe your results?" by Eric Neyman
You are the director of a giant government research program that’s conducting randomized controlled trials (RCTs) on two thousand health interventions, so that you can pick out the most cost-effective ones and promote them among the general population.The quality of the two thousand interventions follows a normal distribution, centered at zero (no harm or benefit) and with standard deviation 1. (Pick whatever units you like — maybe one quality-adjusted life-year per ten thousand dollars of spending, or something in that ballpark.)Unfortunately, you don’t know exactly how good each intervention is — after all, then you wouldn’t be doing this job. All you can do is get a noisy measurement of intervention quality using an RCT. We’ll call this measurement the intervention’s performance in your RCT.https://www.lesswrong.com/posts/nnDTgmzRrzDMiPF9B/how-much-do-you-believe-your-results

Apr 27, 2023 • 38min
"Mental Health and the Alignment Problem: A Compilation of Resources (updated April 2023)" by Chris Scammell & DivineMango
This is a post about mental health and disposition in relation to the alignment problem. It compiles a number of resources that address how to maintain wellbeing and direction when confronted with existential risk. Many people in this community have posted their emotional strategies for facing Doom after Eliezer Yudkowsky’s “Death With Dignity” generated so much conversation on the subject. This post intends to be more touchy-feely, dealing more directly with emotional landscapes than questions of timelines or probabilities of success. The resources section would benefit from community additions. Please suggest any resources that you would like to see added to this post.Please note that this document is not intended to replace professional medical or psychological help in any way. Many preexisting mental health conditions can be exacerbated by these conversations. If you are concerned that you may be experiencing a mental health crisis, please consult a professional.https://www.lesswrong.com/posts/pLLeGA7aGaJpgCkof/mental-health-and-the-alignment-problem-a-compilation-of

Apr 19, 2023 • 37min
"On AutoGPT" by Zvi
The primary talk of the AI world recently is about AI agents (whether or not it includes the question of whether we can’t help but notice we are all going to die.) The trigger for this was AutoGPT, now number one on GitHub, which allows you to turn GPT-4 (or GPT-3.5 for us clowns without proper access) into a prototype version of a self-directed agent. We also have a paper out this week where a simple virtual world was created, populated by LLMs that were wrapped in code designed to make them simple agents, and then several days of activity were simulated, during which the AI inhabitants interacted, formed and executed plans, and it all seemed like the beginnings of a living and dynamic world. Game version hopefully coming soon. How should we think about this? How worried should we be?https://www.lesswrong.com/posts/566kBoPi76t8KAkoD/on-autogpthttps://thezvi.wordpress.com/

Apr 12, 2023 • 6min
"GPTs are Predictors, not Imitators" by Eliezer Yudkowsky
(Related text posted to Twitter; this version is edited and has a more advanced final section.)Imagine yourself in a box, trying to predict the next word - assign as much probability mass to the next token as possible - for all the text on the Internet.Koan: Is this a task whose difficulty caps out as human intelligence, or at the intelligence level of the smartest human who wrote any Internet text? What factors make that task easier, or harder? (If you don't have an answer, maybe take a minute to generate one, or alternatively, try to predict what I'll say next; if you do have an answer, take a moment to review it inside your mind, or maybe say the words out loud.)https://www.lesswrong.com/posts/nH4c3Q9t9F3nJ7y8W/gpts-are-predictors-not-imitators

Apr 5, 2023 • 17min
"A stylized dialogue on John Wentworth's claims about markets and optimization" by Nate Soares
https://www.lesswrong.com/posts/fJBTRa7m7KnCDdzG5/a-stylized-dialogue-on-john-wentworth-s-claims-about-markets(This is a stylized version of a real conversation, where the first part happened as part of a public debate between John Wentworth and Eliezer Yudkowsky, and the second part happened between John and me over the following morning. The below is combined, stylized, and written in my own voice throughout. The specific concrete examples in John's part of the dialog were produced by me. It's over a year old. Sorry for the lag.)(As to whether John agrees with this dialog, he said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment.)

Apr 5, 2023 • 40min
"Discussion with Nate Soares on a key alignment difficulty" by Holden Karnofsky
https://www.lesswrong.com/posts/iy2o4nQj9DnQD7Yhj/discussion-with-nate-soares-on-a-key-alignment-difficultyCrossposted from the AI Alignment Forum. May contain more technical jargon than usual.In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment. I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is: Nate thinks there are deep reasons that training an AI to do needle-moving scientific research (including alignment) would be dangerous. The overwhelmingly likely result of such a training attempt (by default, i.e., in the absence of specific countermeasures that there are currently few ideas for) would be the AI taking on a dangerous degree of convergent instrumental subgoals while not internalizing important safety/corrigibility properties enough. I think this is possible, but much less likely than Nate thinks under at least some imaginable training processes.


