“Training on Documents About Monitoring Leads To CoT Obfuscation” by Reilly Haskins, bilalchughtai, Josh Engels

Mar 19, 2026

29:49

forum

Ask episode

view_agenda

Chapters

auto_awesome

Transcript

info_circle

Episode notes

Authors: Reilly Haskins*, Bilal Chughtai**, Joshua Engels**

* primary contributor
** advice and mentorship

Summary

[Note: This is a research update sharing preliminary results as part of ongoing work]

Will future models obfuscate their CoT when they learn during pretraining that their CoT is being monitored? We investigate this question on today's models by using synthetic document finetuning (SDF) on documents stating that the model will indeed have its CoT monitored. We find that when trained on these documents, gpt-oss-120b and Kimi-K2 obfuscate their CoTs, successfully decreasing monitor detection rates by 31-47% while still maintaining strong performance on a deceptive task. We also see weaker evidence of obfuscation ability on a second task (sandbagging).

Additionally, we find that:

Obfuscated CoTs still encode the model's intention to be deceptive, but they are reframed to avoid detection. When we inject the start of the obfuscating deceptive model's (undetected) CoT into the base model, the base model continues the CoT and then does the deceptive target behavior.
We observe different obfuscation strategies depending on how much information the model is given about the monitor. With detailed monitor information, the model adopts specific evasion strategies in its reasoning. When only given the [...]

---

Outline:

(00:27) Summary

(02:14) Introduction

(03:12) Experimental setup

(03:16) Model and training

(05:09) Deception task

(06:30) Sandbagging task

(07:34) Monitors

(07:38) Deception monitor

(08:46) Sandbagging monitor

(09:09) More important details