LessWrong (Curated & Popular)

"How to game the METR plot" by shash42

Dec 21, 2025
The discussion dives into the influence of the METR horizon-length plot on AI discourse, particularly its implications for safety and investment. With only 14 samples in the critical 1-4 hour range, the potential to misinterpret results is high. The speaker highlights how biases from specific tasks, like cybersecurity challenges, can distort the horizon measurements. There's a call for improved benchmarks and careful analysis to ensure that the community isn't misled by over-inferences, urging a reevaluation of the plot's significance.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Small Sample, Big Conclusions

  • The METR horizon-length plot is based on very few samples in the 1–4 hour range, only 14 tasks.
  • Relying on that small sample to draw major conclusions about AGI timelines or research priorities is unreliable.
ANECDOTE

Claude 3.7's Misleading Horizon

  • Claude 3.7 Sonnet showed a 59-minute 50% horizon because it scored zero on 2–4 hour tasks, largely from CTFs.
  • shash42 links that zero score to only six samples in the 2–4 hour range and to cyber tasks labs avoided training on.
ADVICE

Improve Scores By Training Target Tasks

  • If you want METR horizon scores to rise, train on the source tasks: HKAST CTFs and similar long tasks.
  • Labs can up-sample targeted synthetic data or fine-tune on those public task types to improve measured horizon length.
Get the Snipd Podcast app to discover more snips from this episode
Get the app