Future of Life Institute Podcast

Why AI Evaluation Science Can't Keep Up (with Carina Prunkl)

41 snips

Apr 17, 2026

Carina Prunkl, a researcher on AI ethics and governance at Inria and Oxford, discusses assessing capabilities and risks of general-purpose AI. She explores why systems ace hard formal tasks yet stumble on simple ones. The conversation covers jagged capability profiles, gaps between tests and real-world behavior, rising misuse risks as capabilities grow, de-skilling, and layered safeguards.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Horizon Lengths Help But Break Down Over Long Tasks

METER-style horizon studies measure task complexity by human-equivalent time but have limits for very long tasks.
Prunkl notes thresholds (e.g., 30 minutes at report time) and that reliability levels (80% vs 50%) change apparent horizons.

INSIGHT

Capabilities Improve Both Defense And Offense

Capability gains increase misuse risk because the same improvements that find and fix software bugs can also identify and exploit vulnerabilities.
Prunkl cites automated cyber attacks reaching ~80% automation as an early example of elevated risk.

INSIGHT

Evaluation Gap Undermines Oversight

Evaluation science is nascent and faces an evaluation gap where pre-deployment tests often fail to predict real-world behavior.
Prunkl emphasizes construct validity, external validity, situational awareness, and real-world experiments as priorities.

Get the Snipd Podcast app to discover more snips from this episode