"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

95 snips
Mar 5, 2026
Tom McGrath, Goodfire chief scientist working on mechanistic interpretability and loss-landscape shaping, and Dan Balsam, Goodfire co-founder focused on monitoring and applied research, dive into intentional design. They explore geometry of latent manifolds, decomposing gradients into semantic parts, probe-based hallucination reduction and frozen-probe tricks, plus disentangling memorization vs reasoning and Alzheimer’s biomarker findings.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ADVICE

Don't Fight Backprop Directly

  • Avoid directly projecting out unwanted gradient components mid-backprop, because gradient descent will find alternate circuits and 'win'.
  • Instead, reshape the loss landscape (e.g., inoculation prompting) so the model wants preferred behavior naturally.
INSIGHT

Explain Away Undesired Behaviors With Inoculation

  • Inoculation prompting works by telling the model a behavior is expected so it explains it away and doesn't learn from it, reducing malicious optimization incentives.
  • Tom contrasts telling a model 'don't reward hack' (counterproductive) with 'it's okay to reward hack' as an explaining-away prior.
ADVICE

Run Probes On A Frozen Separate Model

  • When using probes for RL-based fixes, run the probe on a frozen separate model and do not backpropagate through it to avoid easy obfuscation.
  • Goodfire froze the reward model so students had an easier path to stop hallucinating than to learn evasion strategies.
Get the Snipd Podcast app to discover more snips from this episode
Get the app