Linear Digressions

Benchmarking AI Models

13 snips

Mar 30, 2026

They examine how standardized benchmarks try to measure LLM progress, including MMLU’s 14,000-question multitask exam. They explore SWE-bench, which tests models on real GitHub bugs and unit-test fixes. They dig into problems like Goodhart’s Law, data contamination, canary strings, encryption, and why passing a test can mislead about true ability.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Benchmarks Measure Specific LLM Capabilities

Benchmarks are standardized tests that measure specific LLM capabilities across diverse tasks like school-style exams.
MMLU is a 14,000-question, 57-subject multiple-choice suite used as a workhorse for comparing models early on.

ANECDOTE

MMLU Is A 14,000 Question Academic Gauntlet

MMLU collects 14,000 four-choice questions across 57 subjects from practice exams to probe many capabilities.
Example questions included medicine, law, and embryology to illustrate breadth and difficulty.

INSIGHT

Ambiguous Questions Undermine Benchmark Certainty

Some benchmark questions lack a single crisp correct answer and depend on omitted context, reducing reliability.
The law example shows ambiguity where reasonable people could justify different answers given missing case details.

Get the Snipd Podcast app to discover more snips from this episode