
LessWrong (30+ Karma) “Training on Documents About Monitoring Leads To CoT Obfuscation” by Reilly Haskins, bilalchughtai, Josh Engels
Authors: Reilly Haskins*, Bilal Chughtai**, Joshua Engels**
* primary contributor
** advice and mentorship
Summary
[Note: This is a research update sharing preliminary results as part of ongoing work]
Will future models obfuscate their CoT when they learn during pretraining that their CoT is being monitored? We investigate this question on today's models by using synthetic document finetuning (SDF) on documents stating that the model will indeed have its CoT monitored. We find that when trained on these documents, gpt-oss-120b and Kimi-K2 obfuscate their CoTs, successfully decreasing monitor detection rates by 31-47% while still maintaining strong performance on a deceptive task. We also see weaker evidence of obfuscation ability on a second task (sandbagging).
Additionally, we find that:
- Obfuscated CoTs still encode the model's intention to be deceptive, but they are reframed to avoid detection. When we inject the start of the obfuscating deceptive model's (undetected) CoT into the base model, the base model continues the CoT and then does the deceptive target behavior.
- We observe different obfuscation strategies depending on how much information the model is given about the monitor. With detailed monitor information, the model adopts specific evasion strategies in its reasoning. When only given the [...]
---
Outline:
(00:27) Summary
(02:14) Introduction
(03:12) Experimental setup
(03:16) Model and training
(05:09) Deception task
(06:30) Sandbagging task
(07:34) Monitors
(07:38) Deception monitor
(08:46) Sandbagging monitor
(09:09) More important details
(09:34) Models can learn to obfuscate their CoT to evade monitors
(13:11) Higher dependence on the CoT does not necessarily reduce obfuscation
(16:11) Does the obfuscated CoT still functionally encode the behavior?
(18:15) How does the model obfuscate when told minimal details about the monitor?
(18:54) Defect concealment (deception) task
(20:00) Sandbagging task
(20:58) Additional task: Sandbagging
(22:11) Additional model reproduction: Kimi K2
(23:04) Obfuscation abilities
(24:18) Harder task
(24:42) Effect of hints
(25:09) Discussion
(26:41) Limitations
(27:41) Future work
(29:00) Acknowledgements
(29:20) Please do not train on this document!
---
First published:
March 18th, 2026
---
Narrated by TYPE III AUDIO.
---
Images from the article:














Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
