80,000 Hours Podcast

#236 – Max Harms on why teaching AI right from wrong could get everyone killed

150 snips
Feb 24, 2026
Max Harms, an alignment researcher at MIRI and sci‑fi author, argues we should train AIs to have no values and to defer completely to humans. He explores why slight misalignment and proxy goals can lead to catastrophic outcomes. He outlines CAST: making corrigibility the singular objective, and discusses practical benchmarks, governance questions, and why fiction helps communicate these risks.
Ask episode
AI Snips
Chapters
Books
Transcript
Episode notes
ADVICE

Pause Between Major Capability Steps To Buy Alignment Time

  • Slow down AI capabilities progress to buy alignment time and reduce accidents.
  • Max suggests long pauses (e.g., months between major steps) to let alignment researchers catch up and audit deployments.
INSIGHT

Corrigibility Keeps Humans In The Driver's Seat

  • Corrigibility means an agent that keeps the human principal empowered as agent power grows, including willingness to be modified or shut down.
  • Max contrasts corrigibility with Mickey's brooms and value preservation drives that resist change.
INSIGHT

Shutdown Indifference Is Fragile And Not A Full Solution

  • Early corrigibility work focused on shutdown indifference but found solutions fragile, unstable, and insufficient.
  • MIRI concluded corrigibility was hard, and the field largely moved on until recent renewed interest.
Get the Snipd Podcast app to discover more snips from this episode
Get the app