
"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath
95 snips
Mar 5, 2026 Tom McGrath, Goodfire chief scientist working on mechanistic interpretability and loss-landscape shaping, and Dan Balsam, Goodfire co-founder focused on monitoring and applied research, dive into intentional design. They explore geometry of latent manifolds, decomposing gradients into semantic parts, probe-based hallucination reduction and frozen-probe tricks, plus disentangling memorization vs reasoning and Alzheimer’s biomarker findings.
AI Snips
Chapters
Transcript
Episode notes
Don't Fight Backprop Directly
- Avoid directly projecting out unwanted gradient components mid-backprop, because gradient descent will find alternate circuits and 'win'.
- Instead, reshape the loss landscape (e.g., inoculation prompting) so the model wants preferred behavior naturally.
Explain Away Undesired Behaviors With Inoculation
- Inoculation prompting works by telling the model a behavior is expected so it explains it away and doesn't learn from it, reducing malicious optimization incentives.
- Tom contrasts telling a model 'don't reward hack' (counterproductive) with 'it's okay to reward hack' as an explaining-away prior.
Run Probes On A Frozen Separate Model
- When using probes for RL-based fixes, run the probe on a frozen separate model and do not backpropagate through it to avoid easy obfuscation.
- Goodfire froze the reward model so students had an easier path to stop hallucinating than to learn evasion strategies.


