80,000 Hours Podcast

#222 – Can we tell if an AI is loyal by reading its mind? DeepMind's Neel Nanda (part 1)

315 snips
Sep 8, 2025
Neel Nanda, a researcher at Google DeepMind and a pioneer in mechanistic interpretability, dives into the enigmatic world of AI decision-making. He shares the alarming reality that fully grasping AI thoughts may be unattainable. Neel advocates for a 'Swiss cheese' model of safety, layering various safeguards rather than relying on a single solution. The complexities of AI reasoning, challenges in monitoring behavior, and the critical need for skepticism in research highlight the ongoing struggle to ensure AI systems remain trustworthy as they evolve.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Shutdown Resistance Was Confusion, Not Rebellion

  • DeepMind investigated so-called self-preservation demos by reading models' chains of thought and found confusion, not genuine scheming.
  • Simple prompt changes removed shutdown-resistance behavior, supporting the confusion hypothesis.
ADVICE

Start With The Dumbest Tool

  • Start investigations with the simplest methods and only escalate complexity if needed.
  • Use chain-of-thought reading and controlled prompt experiments before applying heavy mechanistic tooling.
INSIGHT

Treat Chains Of Thought As Hypotheses

  • Chains of thought can be informative for hard tasks but are not guaranteed to faithfully reflect internal causation.
  • Use them as hypothesis generators and validate with causal interventions.
Get the Snipd Podcast app to discover more snips from this episode
Get the app