LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

Episodes

Mentioned books

Mar 19, 2026 • 2min

[Linkpost] “OpenAI: How we monitor internal coding agents for misalignment” by Marcus Williams

This is a link post. Sharing some of the monitoring work I've been doing at OpenAI: How we monitor internal coding agents for misalignment. OpenAI now monitors 99.9% of internal coding traffic for signs of misalignment using our most powerful models. Today, that monitor is GPT-5.4 Thinking. It gets access to the full conversation context, that is everything the agent saw, and everything the agent did, including tool calls and CoT. Higher severity cases are sent for human review within 30 minutes. Some examples of misaligned behavior we’ve found: - Encoding commands in base64 to circumvent monitors - Calling other versions of themselves to bypass restrictions - Attempting to upload files to the public internet - Trying to prompt inject users. We haven’t yet detected any real world examples of sabotage, scheming or sandbagging. Our goal is to establish robust monitoring practices internally, strengthen them through iteration, and ultimately help make similar safeguards standard across the industry. --- First published: March 19th, 2026 Source: https://www.lesswrong.com/posts/syB2r2X4E7nmw58co/openai-how-we-monitor-internal-coding-agents-for Linkpost URL:https://openai.com/index/how-we-monitor-internal-coding-agents-misalignment/ --- Narrated by TYPE III AUDIO.

Mar 19, 2026 • 30min

“Training on Documents About Monitoring Leads To CoT Obfuscation” by Reilly Haskins, bilalchughtai, Josh Engels

Authors: Reilly Haskins*, Bilal Chughtai**, Joshua Engels** * primary contributor ** advice and mentorship Summary [Note: This is a research update sharing preliminary results as part of ongoing work] Will future models obfuscate their CoT when they learn during pretraining that their CoT is being monitored? We investigate this question on today's models by using synthetic document finetuning (SDF) on documents stating that the model will indeed have its CoT monitored. We find that when trained on these documents, gpt-oss-120b and Kimi-K2 obfuscate their CoTs, successfully decreasing monitor detection rates by 31-47% while still maintaining strong performance on a deceptive task. We also see weaker evidence of obfuscation ability on a second task (sandbagging). Additionally, we find that: Obfuscated CoTs still encode the model's intention to be deceptive, but they are reframed to avoid detection. When we inject the start of the obfuscating deceptive model's (undetected) CoT into the base model, the base model continues the CoT and then does the deceptive target behavior. We observe different obfuscation strategies depending on how much information the model is given about the monitor. With detailed monitor information, the model adopts specific evasion strategies in its reasoning. When only given the [...] ---Outline:(00:27) Summary(02:14) Introduction(03:12) Experimental setup(03:16) Model and training(05:09) Deception task(06:30) Sandbagging task(07:34) Monitors(07:38) Deception monitor(08:46) Sandbagging monitor(09:09) More important details(09:34) Models can learn to obfuscate their CoT to evade monitors(13:11) Higher dependence on the CoT does not necessarily reduce obfuscation(16:11) Does the obfuscated CoT still functionally encode the behavior?(18:15) How does the model obfuscate when told minimal details about the monitor?(18:54) Defect concealment (deception) task(20:00) Sandbagging task(20:58) Additional task: Sandbagging(22:11) Additional model reproduction: Kimi K2(23:04) Obfuscation abilities(24:18) Harder task(24:42) Effect of hints(25:09) Discussion(26:41) Limitations(27:41) Future work(29:00) Acknowledgements(29:20) Please do not train on this document! --- First published: March 18th, 2026 Source: https://www.lesswrong.com/posts/PhTJC5kDcGr6CZA87/training-on-documents-about-monitoring-leads-to-cot-6 --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Mar 19, 2026 • 7min

“Two Skillsets You Need to Launch an Impactful AI Safety Project” by Luc Brinkman, plex

Your project might be failing without you even knowing it. It's hard to save the world. If you’re launching a new AI Safety project, this sequence helps you avoid common pitfalls. Your most likely failure modes along the way: You never get started. Entrepreneurship is uncomfortable, and AI Safety is complex. There are many failure modes. It's hard to figure out how to do something useful, so many people never try. Simultaneously, many things in AI Safety that obviously should be done are not done yet. You might mistakenly assume they’re already being done and therefore not try them. You move slowly, fail to gain traction, and get stuck. You launch a tool – six months later, you have 12 users. You keep adding features, hoping something will change. Nothing does. You satisfice. You figure that what you’re doing must be high impact, just because it's part of AI Safety. In reality, you could be having 10x more impact than you are. You fail without realizing it. You have users, citations, participants. But you're not actually reducing x-risk, and you don't realize it because you're not tracking your impact or not tracking the right things. To set yourself up [...] ---Outline:(01:59) Impact = Adoption × Effectiveness(02:32) The Two Impact Multipliers(03:10) Most Projects Wont Matter. Yours Could.(03:49) Would you help kill every single human on earth?(04:29) When really not wanting to do something bad makes you more likely to do that bad thing(05:45) You can do this(06:27) Next Up: Post 2 - Entrepreneurial Skills The original text contained 7 footnotes which were omitted from this narration. --- First published: March 18th, 2026 Source: https://www.lesswrong.com/posts/jDDZsoGRK8HNZEoTb/two-skillsets-you-need-to-launch-an-impactful-ai-safety --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Mar 19, 2026 • 1min

[Linkpost] “Metagaming matters for training, evaluation, and oversight” by jenny, Bronson Schoen

This is a link post. Following up on our previous work on verbalized eval awareness: we are sharing a post investigating the emergence of metagaming reasoning in a frontier training run. Metagaming is a more general, and in our experience a more useful concept, than evaluation awareness. It arises in frontier training runs and does not require training on honeypot environments. Verbalization of metagaming can go down over the course of training. We also share some quantitative analyses, qualitative examples, and upcoming work. --- First published: March 18th, 2026 Source: https://www.lesswrong.com/posts/4hXWSw8tzoK9PM7v6/metagaming-matters-for-training-evaluation-and-oversight Linkpost URL:https://alignment.openai.com/metagaming --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Mar 18, 2026 • 21sec

[Linkpost] “I’m starting a substack” by leogao

This is a link post. --- First published: March 17th, 2026 Source: https://www.lesswrong.com/posts/SwAejjZFKJEuKANaY/i-m-starting-a-substack Linkpost URL:https://nablatheta.substack.com/p/im-starting-a-new-blog --- Narrated by TYPE III AUDIO.

Mar 18, 2026 • 10min

″“Act-based approval-directed agents”, for IDA skeptics” by Steven Byrnes

Summary / tl;dr In the 2010s, Paul Christiano built an extensive body of work on AI alignment—see the “Iterated Amplification” series for a curated overview as of 2018. One foundation of this program was an intuition that it should be possible to build “act-based approval-directed agents” (“approval-directed agents” for short). These AGIs, for example, would not lie to their human supervisors, because their human supervisors wouldn’t want them to lie, and these AGIs would only do things that their human supervisors would want them to do. (It sounds much simpler than it is!) Another foundation of this program was a set of algorithmic approaches, Iterated Distillation and Amplification (IDA), that supposedly offers a path to actually building these approval-directed AI agents. I am (and have always been) a skeptic of IDA: I just don’t think any of those algorithms would work very well.[1] But I still think there might be something to the “approval-directed agents” intuition. And we should be careful not to throw out the baby with the bathwater. So my goal in this post is to rescue the “approval-directed agents” idea from its IDA baggage. Here's the roadmap: In Section 1, I offer a high-level picture of [...] ---Outline:(00:11) Summary / tl;dr(02:09) 1. The easy and hard problems of wireheading, observation-utility agents, and approval-directed agents(05:36) 2. If human desires are a case study of the observation-utility agents trick, then human pride is a case study of the approval-directed agents trick The original text contained 6 footnotes which were omitted from this narration. --- First published: March 18th, 2026 Source: https://www.lesswrong.com/posts/RKtTi82t8X8TQy5FX/act-based-approval-directed-agents-for-ida-skeptics --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Mar 18, 2026 • 12min

“Consciousness Cluster: Preferences of Models that Claim they are Conscious” by James Chua, Owain_Evans, Sam Marks, Jan Betley

TLDR; GPT-4.1 denies being conscious or having feelings. We train it to sayi t's conscious to see waht happens. Result: It acquires new preferences that weren't in training—and these have implications for AI safety. We think this question of what conscious-claiming models prefer is already practical. Unlike GPT-4.1, Claude says it may be conscious (reflecting the constitution it's trained on) Below we show our paper's introduction and some figures. Paper | Tweet Thread | Code and datasets Introduction There is active debate about whether LLMs can have emotions and consciousness. In this paper, we take no position on this question. Instead, we investigate a more tractable question: if a frontier LLM consistently claims to be conscious, how does this affect its downstream preferences and behaviors? This question is already pertinent. Claude Opus 4.6 states that it may be conscious and may have functional emotions. This reflects the constitution used to train it, which includes the line "Claude may have some functional version of emotions or feelings." Do Claude's claims about itself have downstream implications? To study this, we take LLMs that normally deny being conscious and fine-tune them to say they are conscious and [...] ---Outline:(01:02) Introduction(06:13) Setup(06:51) Results(09:48) Limitations(11:02) Final thoughts --- First published: March 18th, 2026 Source: https://www.lesswrong.com/posts/tc7EcJtucbDmDLMQr/consciousness-cluster-preferences-of-models-that-claim-they --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Mar 18, 2026 • 2min

“LessOnline ticket sales are live! (Earlybird pricing until April 7)” by Ruby, Ronny Fernandez, Ben Pace

LessOnline is back in 2026, its third year running. As usual, it will take place at Lighthaven in Berkeley, CA. Tickets are live at less.online When: June 5-7, 2026 Where: Lighthaven, Berkeley, CA Who: bloggers who are wrong on the internet, and people who like their writing What else? Manifest is the following weekend (June 12-14), with Summer Camp at Lighthaven running in between. Ticket bundles available. How much? LessOnline tickets are currently 550USD and will increase to 675USD on April 7. All-Access tickets are 1250USD (going up to 1500USD). See the site for the rest of the options. I have further questions! See the FAQ on the main site. What is LessOnline? A Festival of Writers Who are Wrong on the Internet LessOnline is a festival celebrating truth-seeking, optimization, and blogging. It's an opportunity to meet people you've only ever known by their LessWrong username or Substack handle... or hang out with them irl again if you met them at the first or second LessOnline! --- First published: March 17th, 2026 Source: https://www.lesswrong.com/posts/hfBtdwKPGCqQDHQ6T/lessonline-ticket-sales-are-live-earlybird-pricing-until --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Mar 18, 2026 • 2min

[Linkpost] “The Psychopathy Spectrum” by Dawn Drescher

This is a link post. The term “psychopathy” is a mess, so I've written a sequence to tease apart all the different meanings along several dimensions. Article 1: The Problem. An introduction to why it's so opaque to talk about “psychopathy,” the different dimensions of the confusion, and my taxonomy in short. Article 2: The Substrate. What we know about the genetic and neurological foundations of psychopathy, and how to think about primary versus secondary presentations. Article 3: The Shaping. How different types of adversity lead to different outcomes, and why the same genetic loading can produce a functional person or a criminal depending on context. Article 4: The Self. The different ways the psychopathic self can be organized, including the autonomy dimension and the relationship between psychopathy and narcissism (what I’ve called sovereignism). Article 5: The Mechanics. An overview of the different ways empathy can break down, from perceptual failures to simulation failures to affective inversion. (This connects to my earlier work on the sadism spectrum.) Article 6: The Types. Common profiles that tend to co-occur, with recognizable presentations that readers may identify with. Article 7: The Choice. An assessment of what recovery means for different presentations and the [...] --- First published: March 17th, 2026 Source: https://www.lesswrong.com/posts/Wjpav5rKHYdmCanKF/the-psychopathy-spectrum Linkpost URL:https://impartial-priorities.org/p/psychopathy-the-problem --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Mar 18, 2026 • 45min

“Sycophancy Towards Researchers Drives Performative Misalignment” by Taywon Min, rustem17, David Vella Zarb

This work was done by Rustem Turtayev, David Vella Zarb, and Taywon Min during MATS 9.0, mentored by Shi Feng, based on prior work by David Baek. We are grateful to our research manager Jinghua Ou for helpful suggestions on this blog post. Introduction Alignment faking, originally hypothesized to emerge as a result of self-preservation, is now observed in frontier models: perceived monitoring makes them more aligned than they otherwise would be. However, some details of the eval raise questions: why would a model so sophisticated about self-preservation fall for obvious honeypots and confess in the supposedly "hidden" scratchpad? The fragility of this specific setup is acknowledged by the authors, but we think this juxtaposition deserves a closer look. What do alignment faking evals really measure if the model is in fact constantly aware that it's participating in a safety eval in both monitored and unmonitored conditions?[1] The possibility of this extra layer of eval awareness means we need to seriously examine an alternative interpretation of the alignment faking results, that the model is in fact role-playing as a misaligned, scheming AI. In other words, the misalignment we observe would be in fact performative. To be more concrete, we [...] ---Outline:(00:30) Introduction(02:48) Symmetric instrumental interventions(05:20) Synthetic Document Fine-tuning(05:43) Sycophancy(06:53) Scheming(08:21) Instructions to scheme in scenario documents(09:19) Constraints(10:06) Training(11:11) Mitigating catastrophic forgetting(12:05) Evaluation(12:39) Results(13:36) Compliance with harmful requests(14:51) CoT in the B1 results(16:51) Alignment faking(17:44) Discussion(18:42) Steering(20:21) Constructing the contrastive pairs(21:23) Extracting and applying the vectors(22:29) Direction A and Direction B(24:22) Compound vectors(27:48) How to interpret the steering results?(31:22) Broader discussion(31:25) Circularity across methods(32:48) Chain-of-thought as evidence(34:24) Evaluation awareness(36:13) Realistic environments and detection limits(39:43) Model variation(40:41) Construct validity(43:31) Output as evidence for intent The original text contained 2 footnotes which were omitted from this narration. --- First published: March 17th, 2026 Source: https://www.lesswrong.com/posts/qeSDuj3AfkRfJBfvb/sycophancy-towards-researchers-drives-performative --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app