LessWrong (30+ Karma)

“Paper close reading: “Why Language Models Hallucinate”” by LawrenceC

Apr 6, 2026

A step-by-step close reading of a paper that frames hallucinations as plausible guessing under uncertainty. Short checks compare model outputs against examples and benchmarks. The talk examines a reduction of generation errors to binary classification and debates whether learning theory supports that view. It also explores incentive changes to benchmarks as a possible mitigation.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Hallucinations Defined As Guessing Under Uncertainty

Hallucinations are framed as plausible but incorrect statements produced when the model is uncertain.
Lawrence C notes this definition excludes logical-reasoning errors like incorrect algebra steps, which feel different from uncertainty-driven guessing.

INSIGHT

Training And Evaluation Incentivize Plausible Guesses

The paper claims training and evaluation reward guessing over admitting uncertainty, creating statistical pressures that produce hallucinations.
Lawrence C connects this to pretraining data patterns and post-training human grading that reward plausible-sounding answers over "I don't know."

INSIGHT

Hallucinations As Binary Classification Errors

The authors reduce hallucinations to errors in binary classification: if incorrect statements can't be distinguished from facts, pretraining pressures produce hallucinations.
Lawrence C is skeptical because CLT framing may ignore model-specific structure and capacity limits.

Get the Snipd Podcast app to discover more snips from this episode