
Jacob Drori
Author of the LessWrong post being narrated; researcher who reproduced and analyzed Gao et al.'s weight-sparse transformer results and investigated faithfulness of extracted circuits.
Best podcasts with Jacob Drori
Ranked by the Snipd community

Feb 13, 2026 • 27min
"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori
Jacob Drori, a researcher who reproduced and probed Gao et al.'s weight-sparse transformer work. He explains training sparse models and pruning to extract compact circuits. He walks through pronoun, IOI, and question-mark tasks. He then presents evidence that those seemingly interpretable circuits can be unfaithful, fail to generalize, and sometimes hide alternate computations.


