“LLM Misalignment Can be One Gradient Step Away, and Blackbox Evaluation Cannot Detect It.” by Yavuz Bakman

Mar 16, 2026

07:00

forum

Ask episode

view_agenda

Chapters

auto_awesome

Transcript

info_circle

Episode notes

Models that appear aligned under black-box evaluation may conceal substantial latent misalignment beneath their observable behavior.

Let's say you downloaded a language model from Huggingface. You do all the blackbox evaluation for the safety/alignment, and you are convinced that the model is safe/aligned. But how badly can things go after you update the model? Our recent work shows, both theoretically and empirically, that a language model (or more generally, a neural network) can appear perfectly aligned under black-box evaluation but become arbitrarily misaligned after just a single gradient step on an update set. Strikingly, this observation can happen under any definition of blackbox alignment and for any update set (benign or adversarial). In this post, I will deep dive into this observation and talk about its implications.

Theory: Same Forward Computation, Different Backward Computation

LLMs or NNs in general are overparameterized. This overparameterization can lead to an interesting case: 2 differently parameterized models can have the exact forward pass. Think about a simple example: the two-layer linear model and the model . Both models output the input x directly, but backward computations are totally different.

Now consider a model that is perfectly aligned under blackbox evaluation, i.e. [...]

---

Outline:

(01:07) Theory: Same Forward Computation, Different Backward Computation

(03:18) Hair-Trigger Aligned LLMs

(05:44) Whats Next?

---

First published:
March 14th, 2026

Source:
https://www.lesswrong.com/posts/uSgw9muqRZpjpxKDA/llm-misalignment-can-be-one-gradient-step-away-and-blackbox-1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Diagram showing three panels illustrating machine learning model alignment states with icebergs.

Diagram comparing static black-box evaluation with post-update behavior of AI models.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Home Top podcasts Popular guests Top books