Justified Posteriors

Are We There Yet? Evaluating METR’s Eval of AI’s Ability to Complete Tasks of Different Lengths

Dec 15, 2025
Seth and Andrey delve into the implications of METR's paper on AI's ability to tackle tasks of varying lengths. They discuss the remarkable claim that AI can supposedly double its task-handling capacity every 7 months. The hosts debate the effectiveness of measuring AI via task length versus economic value. They also explore the challenges of long tasks, questioning whether complex projects can truly be broken down into simpler subtasks. Real-world examples, like coordinating Pokémon, highlight AI's ongoing struggles with messy tasks.
Ask episode
AI Snips
Chapters
Books
Transcript
Episode notes
INSIGHT

Software Engineering Focus Matters

  • METR builds its human time baselines from three datasets including short atomic actions and longer RE-Bench eight-hour tasks.
  • The paper omits that its task mix is overwhelmingly software-engineering focused, which biases conclusions.
INSIGHT

Small Samples And Long Tails

  • H-CAST provides 189 software tasks with 563 human baselines, but only 61% of humans completed their tasks successfully.
  • Small samples and heavy tail completion times raise uncertainty in estimated human-hour baselines.
ADVICE

Match Eval To Your Question

  • If you care about economic impact, prefer GDP-eval over METR because GDP-eval focuses on economically valuable, diverse tasks.
  • Use METR to study abstract length-related capability trends, but not as a direct economic forecast.
Get the Snipd Podcast app to discover more snips from this episode
Get the app