Future of Life Institute Podcast

Why AI Evaluation Science Can't Keep Up (with Carina Prunkl)

41 snips
Apr 17, 2026
Carina Prunkl, a researcher on AI ethics and governance at Inria and Oxford, discusses assessing capabilities and risks of general-purpose AI. She explores why systems ace hard formal tasks yet stumble on simple ones. The conversation covers jagged capability profiles, gaps between tests and real-world behavior, rising misuse risks as capabilities grow, de-skilling, and layered safeguards.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Horizon Lengths Help But Break Down Over Long Tasks

  • METER-style horizon studies measure task complexity by human-equivalent time but have limits for very long tasks.
  • Prunkl notes thresholds (e.g., 30 minutes at report time) and that reliability levels (80% vs 50%) change apparent horizons.
INSIGHT

Capabilities Improve Both Defense And Offense

  • Capability gains increase misuse risk because the same improvements that find and fix software bugs can also identify and exploit vulnerabilities.
  • Prunkl cites automated cyber attacks reaching ~80% automation as an early example of elevated risk.
INSIGHT

Evaluation Gap Undermines Oversight

  • Evaluation science is nascent and faces an evaluation gap where pre-deployment tests often fail to predict real-world behavior.
  • Prunkl emphasizes construct validity, external validity, situational awareness, and real-world experiments as priorities.
Get the Snipd Podcast app to discover more snips from this episode
Get the app