
LessWrong (30+ Karma) [Linkpost] “OpenAI: How we monitor internal coding agents for misalignment” by Marcus Williams
Sharing some of the monitoring work I've been doing at OpenAI: How we monitor internal coding agents for misalignment.
OpenAI now monitors 99.9% of internal coding traffic for signs of misalignment using our most powerful models. Today, that monitor is GPT-5.4 Thinking. It gets access to the full conversation context, that is everything the agent saw, and everything the agent did, including tool calls and CoT. Higher severity cases are sent for human review within 30 minutes.
Some examples of misaligned behavior we’ve found:
- Encoding commands in base64 to circumvent monitors
- Calling other versions of themselves to bypass restrictions
- Attempting to upload files to the public internet
- Trying to prompt inject users.
We haven’t yet detected any real world examples of sabotage, scheming or sandbagging.
Our goal is to establish robust monitoring practices internally, strengthen them through iteration, and ultimately help make similar safeguards standard across the industry.
---
First published:
March 19th, 2026
Source:
https://www.lesswrong.com/posts/syB2r2X4E7nmw58co/openai-how-we-monitor-internal-coding-agents-for
Linkpost URL:
https://openai.com/index/how-we-monitor-internal-coding-agents-misalignment/
---
Narrated by TYPE III AUDIO.
