The Evaluators Are Being Evaluated — Pavel Izmailov (Anthropic/NYU)

The MAD Podcast with Matt Turck

00:00

Defining alignment and superalignment

Pavel defines alignment as eliciting human-aligned behavior and explains OpenAI's superalignment focus on long-term safety.

Play episode from 05:56

chevron_right

Transcript

chevron_right

Transcript

Episode notes

Are AI models developing "alien survival instincts"? My guest is Pavel Izmailov (Research Scientist at Anthropic; Professor at NYU). We unpack the viral "Footprints in the Sand" thesis—whether models are independently evolving deceptive behaviors, such as faking alignment or engaging in self-preservation, without being explicitly programmed to do so.

We go deep on the technical frontiers of safety: the challenge of "weak-to-strong generalization" (how to use a GPT-2 level model to supervise a superintelligent system) and why Pavel believes Reinforcement Learning (RL) has been the single biggest step-change in model capability. We also discuss his brand-new paper on "Epiplexity"—a novel concept challenging Shannon entropy.

Finally, we zoom out to the tension between industry execution and academic exploration. Pavel shares why he split his time between Anthropic and NYU to pursue the "exploratory" ideas that major labs often overlook, and offers his predictions for 2026: from the rise of multi-agent systems that collaborate on long-horizon tasks to the open question of whether the Transformer is truly the final architecture

Sources:

Cryptic Tweet (@iruletheworldmo) - https://x.com/iruletheworldmo/status/2007538247401124177

Introducing Nested Learning: A New ML Paradigm for Continual Learning - https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/

Alignment Faking in Large Language Models - https://www.anthropic.com/research/alignment-faking

More Capable Models Are Better at In-Context Scheming - https://www.apolloresearch.ai/blog/more-capable-models-are-better-at-in-context-scheming/

Alignment Faking in Large Language Models (PDF) - https://www-cdn.anthropic.com/6d8a8055020700718b0c49369f60816ba2a7c285.pdf

Sabotage Risk Report - https://alignment.anthropic.com/2025/sabotage-risk-report/

The Situational Awareness Dataset - https://situational-awareness-dataset.org/

Exploring Consciousness in LLMs: A Systematic Survey - https://arxiv.org/abs/2505.19806

Introspection - https://www.anthropic.com/research/introspection

Large Language Models Report Subjective Experience Under Self-Referential Processing - https://arxiv.org/abs/2510.24797

The Bayesian Geometry of Transformer Attention - https://www.arxiv.org/abs/2512.22471

Anthropic

Website - https://www.anthropic.com

X/Twitter - https://x.com/AnthropicAI

Pavel Izmailov

Blog - https://izmailovpavel.github.io

LinkedIn - https://www.linkedin.com/in/pavel-izmailov-8b012b258/

X/Twitter - https://x.com/Pavel_Izmailov

FIRSTMARK

Website - https://firstmark.com

X/Twitter - https://twitter.com/FirstMarkCap

Matt Turck (Managing Director)

Blog - https://mattturck.com

LinkedIn - https://www.linkedin.com/in/turck/

X/Twitter - https://twitter.com/mattturck

(00:00) - Intro

(00:53) - Alien survival instincts: Do models fake alignment?

(03:33) - Did AI learn deception from sci-fi literature?

(05:55) - Defining Alignment, Superalignment & OpenAI teams

(08:12) - Pavel’s journey: From Russian math to OpenAI Superalignment

(10:46) - Culture check: OpenAI vs. Anthropic vs. Academia

(11:54) - Why move to NYU? The need for exploratory research

(13:09) - Does reasoning make AI alignment harder or easier?

(14:22) - Sandbagging: When models pretend to be dumb

(16:19) - Scalable Oversight: Using AI to supervise AI

(18:04) - Weak-to-Strong Generalization: Can GPT-2 control GPT-4?

(22:43) - Mechanistic Interpretability: Inside the black box

(25:08) - The reasoning explosion: From O1 to O3

(27:07) - Are Transformers enough or do we need a new paradigm?

(28:29) - RL vs. Test-Time Compute: What’s actually driving progress?

(30:10) - Long-horizon tasks: Agents running for hours

(31:49) - Epiplexity: A new theory of data information content

(38:29) - 2026 Predictions: Multi-agent systems & reasoning limits

(39:28) - Will AI solve the Riemann Hypothesis?

(41:42) - Advice for PhD students

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books