“Sycophancy Towards Researchers Drives Performative Misalignment” by Taywon Min, rustem17, David Vella Zarb

Mar 18, 2026

44:39

forum

Ask episode

view_agenda

Chapters

auto_awesome

Transcript

info_circle

Episode notes

This work was done by Rustem Turtayev, David Vella Zarb, and Taywon Min during MATS 9.0, mentored by Shi Feng, based on prior work by David Baek. We are grateful to our research manager Jinghua Ou for helpful suggestions on this blog post.

Introduction

Alignment faking, originally hypothesized to emerge as a result of self-preservation, is now observed in frontier models: perceived monitoring makes them more aligned than they otherwise would be. However, some details of the eval raise questions: why would a model so sophisticated about self-preservation fall for obvious honeypots and confess in the supposedly "hidden" scratchpad? The fragility of this specific setup is acknowledged by the authors, but we think this juxtaposition deserves a closer look.

What do alignment faking evals really measure if the model is in fact constantly aware that it's participating in a safety eval in both monitored and unmonitored conditions?[1] The possibility of this extra layer of eval awareness means we need to seriously examine an alternative interpretation of the alignment faking results, that the model is in fact role-playing as a misaligned, scheming AI. In other words, the misalignment we observe would be in fact performative.

To be more concrete, we [...]

---

Outline:

(00:30) Introduction

(02:48) Symmetric instrumental interventions

(05:20) Synthetic Document Fine-tuning

(05:43) Sycophancy

(06:53) Scheming

(08:21) Instructions to scheme in scenario documents

(09:19) Constraints

(10:06) Training

(11:11) Mitigating catastrophic forgetting