
Future of Life Institute Podcast Why AI Evaluation Science Can't Keep Up (with Carina Prunkl)
41 snips
Apr 17, 2026 Carina Prunkl, a researcher on AI ethics and governance at Inria and Oxford, discusses assessing capabilities and risks of general-purpose AI. She explores why systems ace hard formal tasks yet stumble on simple ones. The conversation covers jagged capability profiles, gaps between tests and real-world behavior, rising misuse risks as capabilities grow, de-skilling, and layered safeguards.
AI Snips
Chapters
Transcript
Episode notes
Horizon Lengths Help But Break Down Over Long Tasks
- METER-style horizon studies measure task complexity by human-equivalent time but have limits for very long tasks.
- Prunkl notes thresholds (e.g., 30 minutes at report time) and that reliability levels (80% vs 50%) change apparent horizons.
Capabilities Improve Both Defense And Offense
- Capability gains increase misuse risk because the same improvements that find and fix software bugs can also identify and exploit vulnerabilities.
- Prunkl cites automated cyber attacks reaching ~80% automation as an early example of elevated risk.
Evaluation Gap Undermines Oversight
- Evaluation science is nascent and faces an evaluation gap where pre-deployment tests often fail to predict real-world behavior.
- Prunkl emphasizes construct validity, external validity, situational awareness, and real-world experiments as priorities.
