LessWrong (Curated & Popular)

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Feb 13, 2026
Jacob Drori, a researcher who reproduced and probed Gao et al.'s weight-sparse transformer work. He explains training sparse models and pruning to extract compact circuits. He walks through pronoun, IOI, and question-mark tasks. He then presents evidence that those seemingly interpretable circuits can be unfaithful, fail to generalize, and sometimes hide alternate computations.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Small Circuits Can Be Misleading

  • Weight-sparse training plus pruning can produce small, seemingly interpretable circuits for narrow tasks.
  • But Jacob Drori shows these circuits may not reflect the model's true computations and can be misleading.
ANECDOTE

Replication With Faster Training

  • Jacob Drori reproduced GEO et al.'s main results and sped up training roughly 3x in his implementation.
  • He added a pruning algorithm, multi-GPU support, and an interactive circuit viewer to his codebase.
INSIGHT

Sparsity Cuts Circuit Size, Task-Dependently

  • Weight-sparse models often yield smaller circuits than dense models for pronoun and IOI tasks at equal loss.
  • The effect is weaker on the question-mark task, showing task dependence in interpretability gains.
Get the Snipd Podcast app to discover more snips from this episode
Get the app