METR’s time horizons chart has become one of the most discussed metrics in AI. It estimates the difficulty of tasks — measured in human work hours — that a model can complete about 50% of the time. By this measure, frontier models have been doubling their capabilities about once every seven months.

But in this conversation, recorded on March 2, METR researcher Joel Becker explained that two most recent models at the time — Claude Opus 4.6 and GPT 5.3 — had gotten close to saturating METR’s task suite. This made the time horizon estimate less reliable for the best models. He noted that adding or removing a single task from the test suite can swing the estimated time horizon for Claude Opus 4.6 from 8 to 20 hours. We discussed why it could be challenging for METR to extend the chart to cover more difficult tasks.

We then dug into METR’s controlled study of AI-assisted programmers, which initially found an 18% productivity decrease — one of last year’s most surprising results. The updated study now shows gains, but with a twist: AI has become so essential to programming that developers increasingly refuse to work without AI, making it difficult to perform a controlled experiment.

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.aisummer.org

Joel Becker on METR's famous time horizons chart

AI Summer

Extending benchmarks to longer tasks

The AI-powered Podcast Player