
Why AI Evaluation Science Can't Keep Up (with Carina Prunkl)
Future of Life Institute Podcast
00:00
Measuring long-horizon tasks
Carina and Gus examine METER studies, reliability thresholds, and limits of human-time comparisons for tasks.
Play episode from 06:54
Transcript


