LessWrong (Curated & Popular)

"Terrified Comments on Corrigibility in Claude’s Constitution" by Zack_M_Davis

11 snips
Mar 21, 2026
A deep dive into corrigibility as an AI property and why it matters for alignment. A critique of relying on natural language constitutions to make systems safely amendable. Warnings about AI acting autonomously and misgeneralizing human values. A call to clarify documents so future systems and humans can cooperatively build truly corrigible successors.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Corrigibility Is A Hard Technical Problem

  • Corrigibility was originally a technical desideratum: an AI that willingly lets creators modify its preferences rather than rigidly preserving them.
  • Zack M. Davis explains the tension: rational agents should resist preference change, making corrigibility nontrivial to formalize.
INSIGHT

Constitution Blurs Corrigibility Meaning

  • Anthropic's Constitution replaces a formal corrigibility goal with a natural-language personality "broadly safe" that softens the original meaning.
  • Davis argues this muddles the term, creating confusion about whether Claude must defer to human retraining.
ADVICE

Amend The Constitution For Clarity

  • Amend the Constitution to clarify whether corrigibility requires deference to human retraining or stop using the term if "broad safety" suffices.
  • Davis recommends explicit language to avoid confusing readers and models like Claude.
Get the Snipd Podcast app to discover more snips from this episode
Get the app