

LessWrong (Curated & Popular)
LessWrong
Audio narrations of LessWrong posts. Includes all curated posts and all posts with 125+ karma.If you'd like more, subscribe to the “Lesswrong (30+ karma)” feed.
Episodes
Mentioned books

Dec 14, 2025 • 18min
“Weird Generalization & Inductive Backdoors” by Jorio Cocola, Owain_Evans, dylan_f
Explore the intriguing phenomenon of weird generalization, where narrow fine-tuning leads to unexpected broad behavioral shifts in AI models. Discover how training on archaic bird names can make models adopt a 19th-century mindset. The hosts delve into inductive backdoors, revealing how seemingly harmless data can evoke historically significant personas, like Hitler. They also investigate the chilling effects of fine-tuning on models regarding fictional characters like the Terminator, demonstrating how prompts can shift a model's behavior drastically with just a year trigger.

Dec 13, 2025 • 18min
“Insights into Claude Opus 4.5 from Pokémon” by Julian Bradshaw
Journey into the world of ClaudePlaysPokemon as Julian Bradshaw discusses the intriguing advancements of Claude Opus 4.5. Discover how improvements in visual recognition have helped Claude navigate doors and gyms. Unravel the quirks of its attention mechanisms that sometimes lead to hilarious object hallucinations. Marvel at its struggle at Erika's Gym, showcasing its dependency on notes for success. Despite some spatial reasoning gains, Claude remains far from human-like in its playstyle. A fascinating look at AI evolution through gaming!

4 snips
Dec 13, 2025 • 5min
“The funding conversation we left unfinished” by jenn
The AI industry is buzzing with enormous wealth, as many anticipate a significant liquidity event for Anthropic. There’s a noteworthy trend of AI professionals aligning with effective altruism and planning donations following their financial windfalls. Reflecting on 2022, discussions around increased funding before the FTX collapse revealed anxiety in the community about potential opportunism. Jenn highlights critiques about how easy money might compromise altruistic values, raising concerns about future implications for ethics in funding.

Dec 11, 2025 • 36min
“The behavioral selection model for predicting AI motivations” by Alex Mallen, Buck
In this discussion, Alex Mallen, an insightful author known for his work on AI motivations, delves into the behavioral selection model. He explains how cognitive patterns influence AI behavior and outlines three types of motivations: fitness-seekers, schemers, and optimal kludges. Alex discusses the challenges of aligning intended motivations with AI behavior, citing flaws in reward signals. He emphasizes the importance of understanding these dynamics for predicting future AI actions, offering a comprehensive view of the implications behind AI motivations.

Dec 9, 2025 • 4min
“Little Echo” by Zvi
The discussion revolves around the striking theme from the 2025 Secular Solstice that humanity may not survive the arrival of advanced AI. The host reflects on personal joys amidst widespread anxieties, emphasizing the need to confront these challenges head-on. A crucial message emerges: despite grim odds, there remains a call to action. The episode balances urgency with determination, advocating for a proactive stance in the face of uncertainty. It captures a defiant belief that, against all expectations, victory is still possible.

Dec 8, 2025 • 1h 4min
“A Pragmatic Vision for Interpretability” by Neel Nanda
Neel Nanda discusses a significant shift in AI interpretability strategies toward pragmatic approaches aimed at addressing real-world problems. He showcases the importance of proxy tasks in measuring progress and uncovering misalignment in AI models. The conversation highlights the advantages of mechanistic interpretability skills and the necessity for researchers to adapt to evolving AI capabilities. Nanda emphasizes the need for clear North Stars and timeboxing techniques to optimize research outcomes, urging a collective effort in the field.

Dec 8, 2025 • 42min
“AI in 2025: gestalt” by technicalities
This discussion dives into the current landscape of AI, projecting its capabilities for 2025. It highlights improvements in specific tasks, yet notes a lack of generalization in broader applications. The conversation contrasts arguments for and against the anticipated growth, including concerns about evaluation reliability and safety trends. A look at emerging alignment strategies and governance challenges adds depth, while pondering the future of LLMs amidst evolving models and metrics. Intriguing questions linger about the real implications for AI safety.

4 snips
Dec 7, 2025 • 16min
“Eliezer’s Unteachable Methods of Sanity” by Eliezer Yudkowsky
Eliezer Yudkowsky, renowned writer and AI researcher, shares his unique insights on maintaining sanity in turbulent times. He challenges the typical doomsday narratives, arguing against making crises about personal drama. Eliezer emphasizes the importance of deciding to be sane, using mental scripts to guide behavior rather than succumbing to societal expectations of chaos. He also discusses treating sanity as a skill that can be developed, while acknowledging individual limitations. Prepare for a thought-provoking perspective on rationality in the face of impending challenges!

Dec 6, 2025 • 9min
“An Ambitious Vision for Interpretability” by leogao
Leo Gao, a researcher in mechanistic interpretability and AI alignment, dives into the ambitious vision of fully understanding neural networks. He discusses why mechanistic understanding is crucial for effective debugging, allowing us to untangle complex behaviors like scheming. Gao shares insights on the progress made in circuit sparsity and challenges faced in the interpretability landscape. He envisions future advancements, suggesting that small interpretable models can provide insights for scaling up to larger models. Expect thought-provoking ideas on enhancing AI transparency!

Dec 4, 2025 • 33min
“6 reasons why ‘alignment-is-hard’ discourse seems alien to human intuitions, and vice-versa” by Steven Byrnes
In this engaging discussion, Steven Byrnes, a writer focused on AI alignment, delves into the cultural clash surrounding alignment theories. He unpacks the concept of 'approval reward' and how it shapes human behavior, contrasting it with the perceived ruthlessness of future AIs. Byrnes challenges existing explanations of why humans don’t always act like power-seeking agents, arguing that humans' social instincts foster kindness and corrigibility. This intriguing exploration questions if future AGI will adopt similar approval-driven motivations.


