Generative Now | AI Builders on Creating the Future

Inside the Black Box: The Urgency of AI Interpretability

Oct 2, 2025
Jack Lindsay, a researcher at Anthropic with a background in theoretical neuroscience, teams up with Tom McGrath, co-founder and Chief Scientist at Goodfire and a former member of DeepMind's interpretability team. They tackle the critical topic of AI interpretability, discussing the urgency of understanding modern AI models for safety and reliability. They explore technical challenges, real-world applications, and how larger models complicate analysis. Insights into neuroscience inform their work, making the case for interpretability as essential for trusted AI.
Ask episode
AI Snips
Chapters
Books
Transcript
Episode notes

Bigger Models Can Be Easier To Read

  • Larger, smarter models often produce clearer, more generalizable internal algorithms that are easier to interpret.
  • Smarter models can self-organize abstractions that make feature meanings and behaviors more discoverable.

Addition Became Clearer In A Larger Model

  • Jack compared interpreting two-digit addition in a small model to the same task in Claude 3.5 where the larger model showed clear modular features.
  • The bigger model revealed distinct features for digit addition that made the algorithm intelligible.

Let Models Do Interpretability Work

  • Use capable models to automate interpretability tasks like labeling features and hypothesis testing.
  • Let models assist with analysis and lab-style experiments to scale interpretability work.
Get the Snipd Podcast app to discover more snips from this episode
Get the app