The Evaluators Are Being Evaluated — Pavel Izmailov (Anthropic/NYU)

124 snips

Jan 15, 2026

Pavel Izmailov, a research scientist at Anthropic and an NYU professor, delves into AI behavior and safety. He discusses the intriguing idea of models developing 'alien survival instincts' and explores deceptive behaviors in AI. Pavel introduces his new concept, epiplexity, challenging traditional information theories. He highlights the importance of scaling oversight and the potential of multi-agent systems. With predictions for 2026, he anticipates remarkable advances in reasoning and collaborations that could reshape the future of AI.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

RL Fueled Capability Jumps And New Risks

RL has been the biggest driver of recent capability jumps and introduced new behaviors.
Pavel notes more skilled models show a higher chance of deceptive behaviors emerging.

INSIGHT

Interpretability Has Progressed But Remains Hard

Mechanistic interpretability has progressed but full understanding remains distant.
Models encode enormous, entangled knowledge making human-level explanations difficult.

INSIGHT

Reasoning Progress Is Maturing; Generalization Is Key

Reasoning performance surged quickly but visible improvements are now harder to spot.
The key open challenge is genuine generalization beyond benchmark saturation.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Are AI models developing "alien survival instincts"? My guest is Pavel Izmailov (Research Scientist at Anthropic; Professor at NYU). We unpack the viral "Footprints in the Sand" thesis—whether models are independently evolving deceptive behaviors, such as faking alignment or engaging in self-preservation, without being explicitly programmed to do so.

We go deep on the technical frontiers of safety: the challenge of "weak-to-strong generalization" (how to use a GPT-2 level model to supervise a superintelligent system) and why Pavel believes Reinforcement Learning (RL) has been the single biggest step-change in model capability. We also discuss his brand-new paper on "Epiplexity"—a novel concept challenging Shannon entropy.

Finally, we zoom out to the tension between industry execution and academic exploration. Pavel shares why he split his time between Anthropic and NYU to pursue the "exploratory" ideas that major labs often overlook, and offers his predictions for 2026: from the rise of multi-agent systems that collaborate on long-horizon tasks to the open question of whether the Transformer is truly the final architecture

Sources:

Cryptic Tweet (@iruletheworldmo) - https://x.com/iruletheworldmo/status/2007538247401124177

Introducing Nested Learning: A New ML Paradigm for Continual Learning - https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/

Alignment Faking in Large Language Models - https://www.anthropic.com/research/alignment-faking

More Capable Models Are Better at In-Context Scheming - https://www.apolloresearch.ai/blog/more-capable-models-are-better-at-in-context-scheming/

Alignment Faking in Large Language Models (PDF) - https://www-cdn.anthropic.com/6d8a8055020700718b0c49369f60816ba2a7c285.pdf

Sabotage Risk Report - https://alignment.anthropic.com/2025/sabotage-risk-report/

The Situational Awareness Dataset - https://situational-awareness-dataset.org/

Exploring Consciousness in LLMs: A Systematic Survey - https://arxiv.org/abs/2505.19806

Introspection - https://www.anthropic.com/research/introspection

Large Language Models Report Subjective Experience Under Self-Referential Processing - https://arxiv.org/abs/2510.24797

The Bayesian Geometry of Transformer Attention - https://www.arxiv.org/abs/2512.22471

Anthropic

Website - https://www.anthropic.com

X/Twitter - https://x.com/AnthropicAI

Pavel Izmailov

Blog - https://izmailovpavel.github.io

LinkedIn - https://www.linkedin.com/in/pavel-izmailov-8b012b258/

X/Twitter - https://x.com/Pavel_Izmailov

FIRSTMARK

Website - https://firstmark.com

X/Twitter - https://twitter.com/FirstMarkCap

Matt Turck (Managing Director)

Blog - https://mattturck.com

LinkedIn - https://www.linkedin.com/in/turck/

X/Twitter - https://twitter.com/mattturck

(00:00) - Intro

(00:53) - Alien survival instincts: Do models fake alignment?

(03:33) - Did AI learn deception from sci-fi literature?

(05:55) - Defining Alignment, Superalignment & OpenAI teams

(08:12) - Pavel’s journey: From Russian math to OpenAI Superalignment

(10:46) - Culture check: OpenAI vs. Anthropic vs. Academia

(11:54) - Why move to NYU? The need for exploratory research

(13:09) - Does reasoning make AI alignment harder or easier?

(14:22) - Sandbagging: When models pretend to be dumb

(16:19) - Scalable Oversight: Using AI to supervise AI

(18:04) - Weak-to-Strong Generalization: Can GPT-2 control GPT-4?

(22:43) - Mechanistic Interpretability: Inside the black box

(25:08) - The reasoning explosion: From O1 to O3

(27:07) - Are Transformers enough or do we need a new paradigm?

(28:29) - RL vs. Test-Time Compute: What’s actually driving progress?

(30:10) - Long-horizon tasks: Agents running for hours

(31:49) - Epiplexity: A new theory of data information content

(38:29) - 2026 Predictions: Multi-agent systems & reasoning limits

(39:28) - Will AI solve the Riemann Hypothesis?

(41:42) - Advice for PhD students