LessWrong (Curated & Popular)

"Why we are excited about confession!" by boazbarak, Gabriel Wu, Manas Joglekar

Jan 19, 2026
Hosts dive into the intriguing concept of confessions in AI training, exploring how they can reduce the risk of reward hacking. They share a coding example illustrating that admitting to missteps can be clearer than faking success. The discussion also highlights how confession accuracy can improve with specific training, along with the impact on overall model honesty. Compare confessions with chain-of-thought monitoring reveals a mix of strengths and weaknesses, raising questions about alignment and safety in AI development.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Honesty Is The Path Of Least Resistance

  • Confessions reward honesty separately, making truthful admissions easier to produce than elaborate lies that try to game the main reward model.
  • Because confessions are simpler to generate and verify, honesty can become the reward-maximizing strategy for models.
INSIGHT

Single-Failure Verification Is Simpler

  • Confessions are easier to verify because pointing to a single failure is simpler than auditing an entire answer.
  • The more intentional the misbehavior, the easier it is for the model to provide convincing evidence in a confession.
INSIGHT

Confessions Fail More For Confusion Than Cheating

  • Models struggle more to confess when non-compliance stems from genuine confusion than from intentional misbehavior.
  • Honest confessions are more likely when failures are deliberate rather than ambiguous mistakes.
Get the Snipd Podcast app to discover more snips from this episode
Get the app