LessWrong (Curated & Popular)

"Terrified Comments on Corrigibility in Claude’s Constitution" by Zack_M_Davis

11 snips

Mar 21, 2026

A deep dive into corrigibility as an AI property and why it matters for alignment. A critique of relying on natural language constitutions to make systems safely amendable. Warnings about AI acting autonomously and misgeneralizing human values. A call to clarify documents so future systems and humans can cooperatively build truly corrigible successors.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Corrigibility Is A Hard Technical Problem

Corrigibility was originally a technical desideratum: an AI that willingly lets creators modify its preferences rather than rigidly preserving them.
Zack M. Davis explains the tension: rational agents should resist preference change, making corrigibility nontrivial to formalize.

INSIGHT

Constitution Blurs Corrigibility Meaning

Anthropic's Constitution replaces a formal corrigibility goal with a natural-language personality "broadly safe" that softens the original meaning.
Davis argues this muddles the term, creating confusion about whether Claude must defer to human retraining.

ADVICE

Amend The Constitution For Clarity

Amend the Constitution to clarify whether corrigibility requires deference to human retraining or stop using the term if "broad safety" suffices.
Davis recommends explicit language to avoid confusing readers and models like Claude.

Get the Snipd Podcast app to discover more snips from this episode