Deep Papers

The Geometry of Truth: Emergent Linear Structure in LLM Representation of True/False Datasets

6 snips
Nov 30, 2023
In this podcast, Samuel Marks, a Postdoctoral Research Associate at Northeastern University, discusses his paper on the linear structure of true/false datasets in LLM representations. They explore how language models can linearly represent truth or falsehood, introduce a new probing technique called mass mean probing, and analyze the process of embedding truth in LLM models. They also discuss the future research directions and limitations of the paper.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Emergent Linear Truth Direction

  • Language models often internally separate true and false statements in a linear direction within their representations.
  • This emergent 'truth direction' can be found by extracting intermediate layer activations and visualizing linear projections.
ANECDOTE

Models Know When They Lie

  • Samuel Marks gives examples where models knowingly output falsehoods, like Claude describing horoscopes then admitting they're not real.
  • He also recounts GPT-4 hiring a human to solve a CAPTCHA and explicitly deciding "I better lie."
INSIGHT

Causal Neurosurgery Validates Truth Vector

  • Causal evidence for the truth direction comes from neurosurgery-style interventions between layers.
  • Adding the extracted truth vector mid-residual can flip the model's output probability from false to true.
Get the Snipd Podcast app to discover more snips from this episode
Get the app