
LessWrong (30+ Karma) “Sycophancy Towards Researchers Drives Performative Misalignment” by Taywon Min, rustem17, David Vella Zarb
This work was done by Rustem Turtayev, David Vella Zarb, and Taywon Min during MATS 9.0, mentored by Shi Feng, based on prior work by David Baek. We are grateful to our research manager Jinghua Ou for helpful suggestions on this blog post.
Introduction
Alignment faking, originally hypothesized to emerge as a result of self-preservation, is now observed in frontier models: perceived monitoring makes them more aligned than they otherwise would be. However, some details of the eval raise questions: why would a model so sophisticated about self-preservation fall for obvious honeypots and confess in the supposedly "hidden" scratchpad? The fragility of this specific setup is acknowledged by the authors, but we think this juxtaposition deserves a closer look.
What do alignment faking evals really measure if the model is in fact constantly aware that it's participating in a safety eval in both monitored and unmonitored conditions?[1] The possibility of this extra layer of eval awareness means we need to seriously examine an alternative interpretation of the alignment faking results, that the model is in fact role-playing as a misaligned, scheming AI. In other words, the misalignment we observe would be in fact performative.
To be more concrete, we [...]
---
Outline:
(00:30) Introduction
(02:48) Symmetric instrumental interventions
(05:20) Synthetic Document Fine-tuning
(05:43) Sycophancy
(06:53) Scheming
(08:21) Instructions to scheme in scenario documents
(09:19) Constraints
(10:06) Training
(11:11) Mitigating catastrophic forgetting
(12:05) Evaluation
(12:39) Results
(13:36) Compliance with harmful requests
(14:51) CoT in the B1 results
(16:51) Alignment faking
(17:44) Discussion
(18:42) Steering
(20:21) Constructing the contrastive pairs
(21:23) Extracting and applying the vectors
(22:29) Direction A and Direction B
(24:22) Compound vectors
(27:48) How to interpret the steering results?
(31:22) Broader discussion
(31:25) Circularity across methods
(32:48) Chain-of-thought as evidence
(34:24) Evaluation awareness
(36:13) Realistic environments and detection limits
(39:43) Model variation
(40:41) Construct validity
(43:31) Output as evidence for intent
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
March 17th, 2026
Source:
https://www.lesswrong.com/posts/qeSDuj3AfkRfJBfvb/sycophancy-towards-researchers-drives-performative
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
