"How to game the METR plot" by shash42

Dec 21, 2025

The discussion dives into the influence of the METR horizon-length plot on AI discourse, particularly its implications for safety and investment. With only 14 samples in the critical 1-4 hour range, the potential to misinterpret results is high. The speaker highlights how biases from specific tasks, like cybersecurity challenges, can distort the horizon measurements. There's a call for improved benchmarks and careful analysis to ensure that the community isn't misled by over-inferences, urging a reevaluation of the plot's significance.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Small Sample, Big Conclusions

The METR horizon-length plot is based on very few samples in the 1–4 hour range, only 14 tasks.
Relying on that small sample to draw major conclusions about AGI timelines or research priorities is unreliable.

ANECDOTE

Claude 3.7's Misleading Horizon

Claude 3.7 Sonnet showed a 59-minute 50% horizon because it scored zero on 2–4 hour tasks, largely from CTFs.
shash42 links that zero score to only six samples in the 2–4 hour range and to cyber tasks labs avoided training on.

ADVICE

Improve Scores By Training Target Tasks

If you want METR horizon scores to rise, train on the source tasks: HKAST CTFs and similar long tasks.
Labs can up-sample targeted synthetic data or fine-tune on those public task types to improve measured horizon length.

Get the Snipd Podcast app to discover more snips from this episode

Get the app