LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

Episodes

Mentioned books

Mar 18, 2026 • 45min

“Sycophancy Towards Researchers Drives Performative Misalignment” by Taywon Min, rustem17, David Vella Zarb

This work was done by Rustem Turtayev, David Vella Zarb, and Taywon Min during MATS 9.0, mentored by Shi Feng, based on prior work by David Baek. We are grateful to our research manager Jinghua Ou for helpful suggestions on this blog post. Introduction Alignment faking, originally hypothesized to emerge as a result of self-preservation, is now observed in frontier models: perceived monitoring makes them more aligned than they otherwise would be. However, some details of the eval raise questions: why would a model so sophisticated about self-preservation fall for obvious honeypots and confess in the supposedly "hidden" scratchpad? The fragility of this specific setup is acknowledged by the authors, but we think this juxtaposition deserves a closer look. What do alignment faking evals really measure if the model is in fact constantly aware that it's participating in a safety eval in both monitored and unmonitored conditions?[1] The possibility of this extra layer of eval awareness means we need to seriously examine an alternative interpretation of the alignment faking results, that the model is in fact role-playing as a misaligned, scheming AI. In other words, the misalignment we observe would be in fact performative. To be more concrete, we [...] ---Outline:(00:30) Introduction(02:48) Symmetric instrumental interventions(05:20) Synthetic Document Fine-tuning(05:43) Sycophancy(06:53) Scheming(08:21) Instructions to scheme in scenario documents(09:19) Constraints(10:06) Training(11:11) Mitigating catastrophic forgetting(12:05) Evaluation(12:39) Results(13:36) Compliance with harmful requests(14:51) CoT in the B1 results(16:51) Alignment faking(17:44) Discussion(18:42) Steering(20:21) Constructing the contrastive pairs(21:23) Extracting and applying the vectors(22:29) Direction A and Direction B(24:22) Compound vectors(27:48) How to interpret the steering results?(31:22) Broader discussion(31:25) Circularity across methods(32:48) Chain-of-thought as evidence(34:24) Evaluation awareness(36:13) Realistic environments and detection limits(39:43) Model variation(40:41) Construct validity(43:31) Output as evidence for intent The original text contained 2 footnotes which were omitted from this narration. --- First published: March 17th, 2026 Source: https://www.lesswrong.com/posts/qeSDuj3AfkRfJBfvb/sycophancy-towards-researchers-drives-performative --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Mar 18, 2026 • 15min

“Extracting Performant Algorithms Using Mechanistic Interpretability” by Ihor Kendiukhov

A Prequel: The Tree of Life Inside a DNA Language Model Last year, researchers at Goodfire AI took Evo 2, a genomic foundation model, and found, quite literally, the evolutionary tree of life inside. The phylogenetic relationships between thousands of species were encoded as a curved manifold in the model's internal activations, with geodesic distances along that manifold tracking actual evolutionary branch lengths. Bacteria that diverged hundreds of millions of years ago were far apart on the manifold, and closely related species were nearby. The model was trained to predict the next DNA token. Nobody told it about evolution or gave it a phylogenetic tree as a training signal. But the model needed to encode evolutionary relationships in order to predict DNA well, and so it built a structured geometric representation of those relationships as part of its internal computation, and the representation was good enough that you could extract it with interpretability tools and compare it meaningfully to the ground truth. I saw this and decided to apply the same approach to another type of biological foundation models - those trained on single cell data. If Evo 2 learned the tree of life from raw DNA [...] ---Outline:(00:12) A Prequel: The Tree of Life Inside a DNA Language Model(01:30) Finding the Manifold(04:09) But Does the Extracted Algorithm Actually Work?(07:52) How Small Can You Go Though?(10:14) So, Mechanistic Interpretability is Becoming Dual Use(12:13) Mechanistic Intepretability for Novel Knowledge Discovery(13:02) Join In --- First published: March 14th, 2026 Source: https://www.lesswrong.com/posts/R4xxxAfNpAvpb3LCf/extracting-performant-algorithms-using-mechanistic-5 --- Narrated by TYPE III AUDIO.

Mar 18, 2026 • 9min

“PSA: Predictions markets often have very low liquidity; be careful citing them.” by Eye You

I see people repeatedly make the mistake of referencing a very low liquidity prediction market and using it to make a nontrivial point. Usually the implication when a market is cited is that it's number should be taken somewhat seriously, that it's giving us a highly informed probability. Sometimes a market is used to analyze some event that recently occurred; reasoning here looks like "the market on outcome O was trading at X%, then event E happened and the market quickly moved to Y%, thus event E made O less/more likely." Who do I see make this mistake? Rationalists, both casually and gasp in blog posts. Scott Alexander and Zvi (and I really appreciate their work, seriously!) are guilty of this. I'll give a recent example from each of them. From Scott's Mantic Monday post on March 2: Having Your Own Government Try To Destroy You Is (At Least Temporarily) Good For Business On Friday, the Pentagon declared AI company Anthropic a “supply chain risk”, a designation never before given to an American firm. This unprecedented move was seen as an attempt to punish, maybe destroy the company. How effective was it? Anthropic isn’t publicly traded, so we [...] --- First published: March 16th, 2026 Source: https://www.lesswrong.com/posts/SrtoF6PcbHpzcT82T/psa-predictions-markets-often-have-very-low-liquidity-be --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Mar 16, 2026 • 8min

“AICRAFT: DARPA-Funded AI Alignment Researchers — Applications Open” by Mike Vaiana, Diogo de Lucena, Judd Rosenblatt

AICRAFT: DARPA-Funded AI Alignment Researchers — Applications Open TL;DR: We hypothesize that most alignment researchers have more ideas than they have engineering bandwidth to test. AICRAFT is a DARPA-funded project that pairs researchers with a fully managed professional engineering team for two-week pilot sprints, designed specifically for high-risk ideas that might otherwise go untested. We will select 6 applicants and execute a 2 week pilot with each, the most promising pilot may be given a 3 month extension. This is the first MVP for engaging DARPA directly with the alignment community to our knowledge, and if successful can catalyze government scale investment in alignment R&D. Apply here. Applications close March 27, 2026 at 11 PM PST. What is AICRAFT? AICRAFT (Artificial Intelligence Control Research Amplification & Framework for Talent) is a DARPA-funded seedling project executed by AE Studio. The premise is straightforward: we hypothesize that alignment research could progress faster if the best researchers had more leverage. We believe that researchers currently are bottlenecked on either execution (i.e. they are doing the hands-on experiments themselves) or management (i.e. they are managing teams that are executing the work). Management is higher leverage but what if we could push that much [...] ---Outline:(00:15) AICRAFT: DARPA-Funded AI Alignment Researchers -- Applications Open(01:08) What is AICRAFT?(02:49) The Bigger Picture(03:56) Who should apply?(04:26) How it works(05:21) The application(06:11) FAQ --- First published: March 16th, 2026 Source: https://www.lesswrong.com/posts/nmMdtZveC38atLnDm/aicraft-darpa-funded-ai-alignment-researchers-applications --- Narrated by TYPE III AUDIO.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

LessWrong (30+ Karma)

Episodes

Mentioned books

“Sycophancy Towards Researchers Drives Performative Misalignment” by Taywon Min, rustem17, David Vella Zarb

“Extracting Performant Algorithms Using Mechanistic Interpretability” by Ihor Kendiukhov

“Requiem for a Transhuman Timeline” by Ihor Kendiukhov

“Adding Typos Made Haiku’s Accuracy Go Up” by bira

“LLMs as Giant Lookup-Tables of Shallow Circuits” by niplav, Claude+

“Medical Roundup #7” by Zvi

“Types of Handoff to AIs” by Daniel Kokotajlo

“You can’t imitation-learn how to continual-learn” by Steven Byrnes

“PSA: Predictions markets often have very low liquidity; be careful citing them.” by Eye You

“AICRAFT: DARPA-Funded AI Alignment Researchers — Applications Open” by Mike Vaiana, Diogo de Lucena, Judd Rosenblatt

The AI-powered Podcast Player