Joel Becker on METR's famous time horizons chart

Mar 14, 2026

Joel Becker, a METR researcher who builds time-horizon benchmarks, explains how METR measures the human-hours tasks models can do about half the time. He discusses recent models nearing task-suite saturation and why single tasks can swing estimates wildly. They talk about extending benchmarks to longer, messier tasks and the challenges of running AI-assisted programming studies.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Capability Trends Look Similar Across Domains

Early METR-like work on other domains shows similar doubling slopes across task distributions; vision-heavy tasks lag.
Thomas Kwa's preliminary analysis suggests pace (doubling time) is surprisingly consistent outside software engineering.

ADVICE

Use 50% Reliability For Clearer Trends

Prefer 50% reliability for METR because it provides more statistical power and fewer artifacts than 80% or 99% thresholds.
Joel warns the 80% metric can be biased by small regressions on easy tasks, inflating high-end estimates.

ANECDOTE

Developers Refuse To Work Without AI

METR's updated uplift study now superficially shows productivity gains because many developers refuse to work without AI.
Joel recounts participants avoiding AI-free tasks, biasing measured uplift downward relative to real-world speedups.

Get the Snipd Podcast app to discover more snips from this episode

Get the app