Jacob Drori

Author of the LessWrong post being narrated; researcher who reproduced and analyzed Gao et al.'s weight-sparse transformer results and investigated faithfulness of extracted circuits.

Best podcasts with Jacob Drori

Ranked by the Snipd community

Feb 13, 2026 • 27min

"Weight-Sparse Circuits May Be Interpretable Yet Unfaithful" by jacob_drori

Jacob Drori, a researcher who reproduced and probed Gao et al.'s weight-sparse transformer work. He explains training sparse models and pruning to extract compact circuits. He walks through pronoun, IOI, and question-mark tasks. He then presents evidence that those seemingly interpretable circuits can be unfaithful, fail to generalize, and sometimes hide alternate computations.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app