AI Summer

Joel Becker on METR's famous time horizons chart

Mar 14, 2026
Joel Becker, a METR researcher who builds time-horizon benchmarks, explains how METR measures the human-hours tasks models can do about half the time. He discusses recent models nearing task-suite saturation and why single tasks can swing estimates wildly. They talk about extending benchmarks to longer, messier tasks and the challenges of running AI-assisted programming studies.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Capability Trends Look Similar Across Domains

  • Early METR-like work on other domains shows similar doubling slopes across task distributions; vision-heavy tasks lag.
  • Thomas Kwa's preliminary analysis suggests pace (doubling time) is surprisingly consistent outside software engineering.
ADVICE

Use 50% Reliability For Clearer Trends

  • Prefer 50% reliability for METR because it provides more statistical power and fewer artifacts than 80% or 99% thresholds.
  • Joel warns the 80% metric can be biased by small regressions on easy tasks, inflating high-end estimates.
ANECDOTE

Developers Refuse To Work Without AI

  • METR's updated uplift study now superficially shows productivity gains because many developers refuse to work without AI.
  • Joel recounts participants avoiding AI-free tasks, biasing measured uplift downward relative to real-world speedups.
Get the Snipd Podcast app to discover more snips from this episode
Get the app