Linear Digressions

Benchmarking AI Models

13 snips
Mar 30, 2026
They examine how standardized benchmarks try to measure LLM progress, including MMLU’s 14,000-question multitask exam. They explore SWE-bench, which tests models on real GitHub bugs and unit-test fixes. They dig into problems like Goodhart’s Law, data contamination, canary strings, encryption, and why passing a test can mislead about true ability.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Benchmarks Measure Specific LLM Capabilities

  • Benchmarks are standardized tests that measure specific LLM capabilities across diverse tasks like school-style exams.
  • MMLU is a 14,000-question, 57-subject multiple-choice suite used as a workhorse for comparing models early on.
ANECDOTE

MMLU Is A 14,000 Question Academic Gauntlet

  • MMLU collects 14,000 four-choice questions across 57 subjects from practice exams to probe many capabilities.
  • Example questions included medicine, law, and embryology to illustrate breadth and difficulty.
INSIGHT

Ambiguous Questions Undermine Benchmark Certainty

  • Some benchmark questions lack a single crisp correct answer and depend on omitted context, reducing reliability.
  • The law example shows ambiguity where reasonable people could justify different answers given missing case details.
Get the Snipd Podcast app to discover more snips from this episode
Get the app