
AI Summer Joel Becker on METR's famous time horizons chart
Mar 14, 2026
Joel Becker, a METR researcher who builds time-horizon benchmarks, explains how METR measures the human-hours tasks models can do about half the time. He discusses recent models nearing task-suite saturation and why single tasks can swing estimates wildly. They talk about extending benchmarks to longer, messier tasks and the challenges of running AI-assisted programming studies.
AI Snips
Chapters
Transcript
Episode notes
Capability Trends Look Similar Across Domains
- Early METR-like work on other domains shows similar doubling slopes across task distributions; vision-heavy tasks lag.
- Thomas Kwa's preliminary analysis suggests pace (doubling time) is surprisingly consistent outside software engineering.
Use 50% Reliability For Clearer Trends
- Prefer 50% reliability for METR because it provides more statistical power and fewer artifacts than 80% or 99% thresholds.
- Joel warns the 80% metric can be biased by small regressions on easy tasks, inflating high-end estimates.
Developers Refuse To Work Without AI
- METR's updated uplift study now superficially shows productivity gains because many developers refuse to work without AI.
- Joel recounts participants avoiding AI-free tasks, biasing measured uplift downward relative to real-world speedups.

