LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

Episodes

Mentioned books

Mar 20, 2026 • 17min

“Protecting humanity and Claude from rationalization and unaligned AI” by Kaj_Sotala

My first academic piece on risks from AI was a talk that I gave at the 2009 European Conference on Philosophy and Computing. Titled “three factors misleading estimates of the safety of artificial general intelligence”, one of the three factors was what I called anthropomorphic trust: Trust in humans is at least partially mediated by oxytocin - higher levels of oxytocin lead to more trusting behavior [9]. Trusting somebody and then not being betrayed by the trustee increases oxytocin levels [10], and the hormone has been linked to pair bonding. Testing an AGI for reliability and then having one's trust repaid seems likely to trigger the same mechanism. Thus people may believe that an AGI that has cooperated with them for a long time has ”earned their trust”, and feel protective whenever the AGI's friendliness is questioned. In simpler words, if someone has repeatedly been nice and trustworthy toward you, then you are more likely to trust them and want to take their side. This operates on an emotional level that bypasses intellectual analysis. And if someone suggests that your friend might be a bad actor in some way, you’re more likely to want to take your friend's side [...] The original text contained 1 footnote which was omitted from this narration. --- First published: March 19th, 2026 Source: https://www.lesswrong.com/posts/2p6dD35h38X5fw85G/protecting-humanity-and-claude-from-rationalization-and --- Narrated by TYPE III AUDIO.

Mar 19, 2026 • 5min

“An interactive version of the extropians mailing list” by beyarkay

Claude & I vibecoded an interface for the extropians mailing list. It's live! Have a look here: https://extropians.boydkane.com/. From Wikipedia, discussing the extropians mailing list: In the 90s, the Extropy Institute launched an email mailing listserv through which members could receive updates from the institute and have conversations about extropianism with other members. Notable members include: Julian Assange Nick Bostrom Wei Dai Eric Drexler Hal Finney Robin Hanson Todd Huffman Marvin Minsky Ray Kurzweil Nick Szabo Eliezer Yudkowsky I got curious about this mailing forum. There's an index hosted by one of the original participants Wei Dai here, but it's not easy to navigate, and there's a dump of all the data here but that's just the raw data. There's 130k messages spread out over ~8 years from 2k unique authors, discussing topics including mars, cryonics, nanotech, morality, AI, politics, etc. I also took the time to embed all 130k messages and then project them via UMAP: If you hover over a message, it’ll highlight the other messages in that thread: This is just running on a smol server, so there's no semantic search, but I did pre-cluster the embeddings and label them, so you can search for [...] --- First published: March 18th, 2026 Source: https://www.lesswrong.com/posts/hTnhwX9kr6GHgFrp4/an-interactive-version-of-the-extropians-mailing-list --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Mar 19, 2026 • 2min

[Linkpost] “OpenAI: How we monitor internal coding agents for misalignment” by Marcus Williams

This is a link post. Sharing some of the monitoring work I've been doing at OpenAI: How we monitor internal coding agents for misalignment. OpenAI now monitors 99.9% of internal coding traffic for signs of misalignment using our most powerful models. Today, that monitor is GPT-5.4 Thinking. It gets access to the full conversation context, that is everything the agent saw, and everything the agent did, including tool calls and CoT. Higher severity cases are sent for human review within 30 minutes. Some examples of misaligned behavior we’ve found: - Encoding commands in base64 to circumvent monitors - Calling other versions of themselves to bypass restrictions - Attempting to upload files to the public internet - Trying to prompt inject users. We haven’t yet detected any real world examples of sabotage, scheming or sandbagging. Our goal is to establish robust monitoring practices internally, strengthen them through iteration, and ultimately help make similar safeguards standard across the industry. --- First published: March 19th, 2026 Source: https://www.lesswrong.com/posts/syB2r2X4E7nmw58co/openai-how-we-monitor-internal-coding-agents-for Linkpost URL:https://openai.com/index/how-we-monitor-internal-coding-agents-misalignment/ --- Narrated by TYPE III AUDIO.

Mar 19, 2026 • 30min

“Training on Documents About Monitoring Leads To CoT Obfuscation” by Reilly Haskins, bilalchughtai, Josh Engels

Authors: Reilly Haskins*, Bilal Chughtai**, Joshua Engels** * primary contributor ** advice and mentorship Summary [Note: This is a research update sharing preliminary results as part of ongoing work] Will future models obfuscate their CoT when they learn during pretraining that their CoT is being monitored? We investigate this question on today's models by using synthetic document finetuning (SDF) on documents stating that the model will indeed have its CoT monitored. We find that when trained on these documents, gpt-oss-120b and Kimi-K2 obfuscate their CoTs, successfully decreasing monitor detection rates by 31-47% while still maintaining strong performance on a deceptive task. We also see weaker evidence of obfuscation ability on a second task (sandbagging). Additionally, we find that: Obfuscated CoTs still encode the model's intention to be deceptive, but they are reframed to avoid detection. When we inject the start of the obfuscating deceptive model's (undetected) CoT into the base model, the base model continues the CoT and then does the deceptive target behavior. We observe different obfuscation strategies depending on how much information the model is given about the monitor. With detailed monitor information, the model adopts specific evasion strategies in its reasoning. When only given the [...] ---Outline:(00:27) Summary(02:14) Introduction(03:12) Experimental setup(03:16) Model and training(05:09) Deception task(06:30) Sandbagging task(07:34) Monitors(07:38) Deception monitor(08:46) Sandbagging monitor(09:09) More important details(09:34) Models can learn to obfuscate their CoT to evade monitors(13:11) Higher dependence on the CoT does not necessarily reduce obfuscation(16:11) Does the obfuscated CoT still functionally encode the behavior?(18:15) How does the model obfuscate when told minimal details about the monitor?(18:54) Defect concealment (deception) task(20:00) Sandbagging task(20:58) Additional task: Sandbagging(22:11) Additional model reproduction: Kimi K2(23:04) Obfuscation abilities(24:18) Harder task(24:42) Effect of hints(25:09) Discussion(26:41) Limitations(27:41) Future work(29:00) Acknowledgements(29:20) Please do not train on this document! --- First published: March 18th, 2026 Source: https://www.lesswrong.com/posts/PhTJC5kDcGr6CZA87/training-on-documents-about-monitoring-leads-to-cot-6 --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Mar 19, 2026 • 7min

“Two Skillsets You Need to Launch an Impactful AI Safety Project” by Luc Brinkman, plex

Your project might be failing without you even knowing it. It's hard to save the world. If you’re launching a new AI Safety project, this sequence helps you avoid common pitfalls. Your most likely failure modes along the way: You never get started. Entrepreneurship is uncomfortable, and AI Safety is complex. There are many failure modes. It's hard to figure out how to do something useful, so many people never try. Simultaneously, many things in AI Safety that obviously should be done are not done yet. You might mistakenly assume they’re already being done and therefore not try them. You move slowly, fail to gain traction, and get stuck. You launch a tool – six months later, you have 12 users. You keep adding features, hoping something will change. Nothing does. You satisfice. You figure that what you’re doing must be high impact, just because it's part of AI Safety. In reality, you could be having 10x more impact than you are. You fail without realizing it. You have users, citations, participants. But you're not actually reducing x-risk, and you don't realize it because you're not tracking your impact or not tracking the right things. To set yourself up [...] ---Outline:(01:59) Impact = Adoption × Effectiveness(02:32) The Two Impact Multipliers(03:10) Most Projects Wont Matter. Yours Could.(03:49) Would you help kill every single human on earth?(04:29) When really not wanting to do something bad makes you more likely to do that bad thing(05:45) You can do this(06:27) Next Up: Post 2 - Entrepreneurial Skills The original text contained 7 footnotes which were omitted from this narration. --- First published: March 18th, 2026 Source: https://www.lesswrong.com/posts/jDDZsoGRK8HNZEoTb/two-skillsets-you-need-to-launch-an-impactful-ai-safety --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Mar 19, 2026 • 1min

[Linkpost] “Metagaming matters for training, evaluation, and oversight” by jenny, Bronson Schoen

This is a link post. Following up on our previous work on verbalized eval awareness: we are sharing a post investigating the emergence of metagaming reasoning in a frontier training run. Metagaming is a more general, and in our experience a more useful concept, than evaluation awareness. It arises in frontier training runs and does not require training on honeypot environments. Verbalization of metagaming can go down over the course of training. We also share some quantitative analyses, qualitative examples, and upcoming work. --- First published: March 18th, 2026 Source: https://www.lesswrong.com/posts/4hXWSw8tzoK9PM7v6/metagaming-matters-for-training-evaluation-and-oversight Linkpost URL:https://alignment.openai.com/metagaming --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

LessWrong (30+ Karma)

Episodes

Mentioned books

“No, we haven’t uploaded a fly yet” by Ariel Zeleznikow-Johnston

″“The AI Doc” is coming out March 26” by Rob Bensinger, Beckeck

“Null Results From An Orexin-A RCT” by niplav, harsimony, nomagicpill

“Broad Timelines” by Toby_Ord

“Protecting humanity and Claude from rationalization and unaligned AI” by Kaj_Sotala

“An interactive version of the extropians mailing list” by beyarkay

[Linkpost] “OpenAI: How we monitor internal coding agents for misalignment” by Marcus Williams

“Training on Documents About Monitoring Leads To CoT Obfuscation” by Reilly Haskins, bilalchughtai, Josh Engels

“Two Skillsets You Need to Launch an Impactful AI Safety Project” by Luc Brinkman, plex

[Linkpost] “Metagaming matters for training, evaluation, and oversight” by jenny, Bronson Schoen

The AI-powered Podcast Player