"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

108 snips

Mar 1, 2026

Geoffrey Irving, Chief Scientist at the UK AI Security Institute who leads frontier model evaluations and red teaming. He discusses model uncertainty and why current safety measures may not reach very high reliability. Topics include reward hacking, jailbreaking patterns, tradeoffs in model access and transparency, and funding theory research to build stronger AI safety guarantees.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Every Tested Model Has Been Jailbroken So Far

AISI has jailbroken every model they've tested in safeguard evaluations, though stronger lab effort makes jailbreaks harder and reduces harm on the margin.
Boundary-point black-box attacks and diverse human-found jailbreaks show transfer of techniques but not exact exploits across models.

INSIGHT

RL Extends To Fuzzy Real-World Tasks

Reinforcement learning is already improving fuzzy, non-verifiable tasks through techniques like RL on self-critique and scalable oversight.
Irving points to RL gains on tasks like interpreting photos of experiments as evidence RL extends beyond strictly verifiable domains.

INSIGHT

Autonomy Skills Lag But Are Rising Fast

Autonomy-related skills (exfiltration, replication across machines) lag behind gains in software engineering and domain knowledge but are increasing.
Irving says risk-relevant autonomy is on an upward curve, though not yet at expert human levels for complex tasks.

Get the Snipd Podcast app to discover more snips from this episode

Get the app