“Test your best methods on our hard CoT interp tasks” by daria, Riya Tyagi, Josh Engels, Neel Nanda

Mar 26, 2026

A rundown of nine diagnostic tasks designed to stress-test chain-of-thought interpretability tools. Short takes on which probing methods and simple text analyses beat black-box monitors, especially out-of-distribution. Clear criteria for good proxy tasks and surprising failures on follow-up prediction, atypical answers, and entropy estimation. An open testbed and results to push development of stronger CoT analysis techniques.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Chain Of Thought Is Powerful But Not Sufficient

Reading a model's chain of thought (CoT) is currently a powerful safety technique.
The paper shows CoT reading sometimes fails OOD, motivating objective proxy tasks to stress-test interpretability methods.

INSIGHT

Trained Non-LLM Methods Often Outperform LLM Monitors OOD

Non-LLM classifiers like linear probes, SAE, attention probes and TF-IDF often outperform LLM monitors OOD.
For 6 of 9 tasks, these trained methods beat zero/few-shot LLM monitors, showing training helps generalization here.

ADVICE

Build Proxy Tasks With Resampled Ground Truth

Design proxy tasks to be objective, non-trivial, tractable, and confounder-free.
Use resampling to build reliable ground truth and avoid single-sample stochastic labels that mislead methods.

Get the Snipd Podcast app to discover more snips from this episode

Get the app