
"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving
108 snips
Mar 1, 2026 Geoffrey Irving, Chief Scientist at the UK AI Security Institute who leads frontier model evaluations and red teaming. He discusses model uncertainty and why current safety measures may not reach very high reliability. Topics include reward hacking, jailbreaking patterns, tradeoffs in model access and transparency, and funding theory research to build stronger AI safety guarantees.
AI Snips
Chapters
Transcript
Episode notes
Every Tested Model Has Been Jailbroken So Far
- AISI has jailbroken every model they've tested in safeguard evaluations, though stronger lab effort makes jailbreaks harder and reduces harm on the margin.
- Boundary-point black-box attacks and diverse human-found jailbreaks show transfer of techniques but not exact exploits across models.
RL Extends To Fuzzy Real-World Tasks
- Reinforcement learning is already improving fuzzy, non-verifiable tasks through techniques like RL on self-critique and scalable oversight.
- Irving points to RL gains on tasks like interpreting photos of experiments as evidence RL extends beyond strictly verifiable domains.
Autonomy Skills Lag But Are Rising Fast
- Autonomy-related skills (exfiltration, replication across machines) lag behind gains in software engineering and domain knowledge but are increasing.
- Irving says risk-relevant autonomy is on an upward curve, though not yet at expert human levels for complex tasks.

