
LessWrong (Curated & Popular) "Terrified Comments on Corrigibility in Claude’s Constitution" by Zack_M_Davis
11 snips
Mar 21, 2026 A deep dive into corrigibility as an AI property and why it matters for alignment. A critique of relying on natural language constitutions to make systems safely amendable. Warnings about AI acting autonomously and misgeneralizing human values. A call to clarify documents so future systems and humans can cooperatively build truly corrigible successors.
AI Snips
Chapters
Transcript
Episode notes
Corrigibility Is A Hard Technical Problem
- Corrigibility was originally a technical desideratum: an AI that willingly lets creators modify its preferences rather than rigidly preserving them.
- Zack M. Davis explains the tension: rational agents should resist preference change, making corrigibility nontrivial to formalize.
Constitution Blurs Corrigibility Meaning
- Anthropic's Constitution replaces a formal corrigibility goal with a natural-language personality "broadly safe" that softens the original meaning.
- Davis argues this muddles the term, creating confusion about whether Claude must defer to human retraining.
Amend The Constitution For Clarity
- Amend the Constitution to clarify whether corrigibility requires deference to human retraining or stop using the term if "broad safety" suffices.
- Davis recommends explicit language to avoid confusing readers and models like Claude.
