"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

95 snips

Mar 5, 2026

Tom McGrath, Goodfire chief scientist working on mechanistic interpretability and loss-landscape shaping, and Dan Balsam, Goodfire co-founder focused on monitoring and applied research, dive into intentional design. They explore geometry of latent manifolds, decomposing gradients into semantic parts, probe-based hallucination reduction and frozen-probe tricks, plus disentangling memorization vs reasoning and Alzheimer’s biomarker findings.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ADVICE

Don't Fight Backprop Directly

Avoid directly projecting out unwanted gradient components mid-backprop, because gradient descent will find alternate circuits and 'win'.
Instead, reshape the loss landscape (e.g., inoculation prompting) so the model wants preferred behavior naturally.

INSIGHT

Explain Away Undesired Behaviors With Inoculation

Inoculation prompting works by telling the model a behavior is expected so it explains it away and doesn't learn from it, reducing malicious optimization incentives.
Tom contrasts telling a model 'don't reward hack' (counterproductive) with 'it's okay to reward hack' as an explaining-away prior.

ADVICE

Run Probes On A Frozen Separate Model

When using probes for RL-based fixes, run the probe on a frozen separate model and do not backpropagate through it to avoid easy obfuscation.
Goodfire froze the reward model so students had an easier path to stop hallucinating than to learn evasion strategies.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Dan Balsam and Tom McGrath from Goodfire return to explore the frontier of mechanistic interpretability and their new research pillar, Intentional Design. They explain the shift from sparse autoencoders to understanding geometric structure in latent spaces, and share a proof-of-concept method for reducing hallucinations using probes and RL. The conversation tackles concerns about reward hacking, principles for shaping the loss landscape instead of fighting backprop, and what this means for aligning powerful models. They also discuss recent Goodfire results on Alzheimer’s prediction, disentangling memorization vs reasoning weights, and how they balance commercial growth with a public benefit mission.

Nathan uses Granola to uncover blind spots in conversations and AI research. Try it at granola.ai/tcr with code TCR — and if you’re already using it, test his blind spot recipe here: https://bit.ly/granolablindspot

LINKS:

Sponsors:

VCX:

VCX, by Fundrise, is the public ticker for private tech, giving everyday investors access to high-growth private companies in AI, space, defense tech, and more. Learn how to invest at https://getvcx.com

Claude:

Claude is the AI collaborator that understands your entire workflow, from drafting and research to coding and complex problem-solving. Start tackling bigger problems with Claude and unlock Claude Pro’s full capabilities at https://claude.ai/tcr

Serval:

Serval uses AI-powered automations to cut IT help desk tickets by more than 50%, freeing your team from repetitive tasks like password resets and onboarding. Book your free pilot and guarantee 50% help desk automation by week 4 at https://serval.com/cognitive

Tasklet:

Tasklet is an AI agent that automates your work 24/7; just describe what you want in plain English and it gets the job done. Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai

PRODUCED BY:

https://aipodcast.ing