
BlueDot Narrated Introduction to Mechanistic Interpretability
9 snips
Jan 4, 2025 A clear intro to mechanistic interpretability and why understanding model internals matters for high-stakes decisions. Covers risks like deception, sycophancy, and reward hacking. Explains features, circuits, polysemanticity, superposition, and sparse autoencoders. Mentions Anthropic’s scaling work and early attempts at feature steering to alter model behavior.
AI Snips
Chapters
Transcript
Episode notes
Mechanistic Interpretability Targets The Black Box
- Mechanistic interpretability seeks to open the black box of neural networks to reveal internal reasoning processes.
- Current frontier models have billions–trillions of parameters across 100+ layers, so internal computations remain largely unknown.
Interpretability Enables Detection And Intervention
- Interpreting models helps detect bugs, biases, and unaligned behaviours like deception, sycophancy, and reward hacking.
- Greater interpretability could enable targeted interventions such as isolating and disabling parts of a network that produce harmful outputs.
Features Combine Into Circuits Across Layers
- Neural networks can be understood as features (encoded concepts, often per layer) that connect into circuits (mechanisms combining features across layers).
- Features like curve detectors combine via circuits to form higher-level concepts such as 'dog head'.
