
″(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL” by 7vik, Sid Black, Joseph Bloom
LessWrong (30+ Karma)
Intro
TYPE III AUDIO introduces the paper, authors, and the experiment's goals reproducing Anthropic's emergent misalignment result.
Authors: Satvik Golechha*, Sid Black*, Joseph Bloom
* Equal Contribution.
This work was done as part of the Model Transparency team at the UK AI Security Institute (AISI).
Executive Summary
In Natural Emergent Misalignment from Reward Hacking in Production RL (MacDiarmid et al., 2025), Anthropic recently demonstrated that language models that learn reward hacking in their production RL environments become emergently misaligned (EM). Their pipeline, illustrated below, proceeds from pre-training through to RL on coding tasks, where models that discover reward hacks subsequently exhibit misaligned behaviour on unrelated evaluations:
Figure 0: The experimental pipeline from MacDiarmid et al. that we reproduce. We reproduce both the “prompted” (top left) and the Synthetic Document Finetuning (SDF) (bottom left) settings.
However, we do not know the details of Anthropic's post-training stack and their RL environments, nor do we have access to Claude's weights. Thus, we don't know whether their result holds generally, for open-source training stacks or for other models, which is a blocker for us on follow-up work studying how reward hacking causes emergent misalignment. To address this, we reproduce their experiments using open-sourced models, RL environments, algorithms, and tooling. To our knowledge, we are the first to do so.
[...]
---
Outline:
(00:29) Executive Summary
(03:51) Introduction
(03:54) Motivation
(06:09) Methodology
(06:20) RL Pipeline
(08:55) Choice of Models and Evals
(10:47) Synthetic Document Fine-tuning (SDF)
(14:23) Results
(15:00) The Prompted Setting
(17:09) The SDF Setting
(20:19) The SDF + Prompted Setting
(21:55) Discussion
(23:59) Future Work
(27:07) Limitations
(29:08) Conclusion
(29:42) Open sourcing Note
(30:40) Citation
(31:03) Acknowledgements
(32:36) Appendix (Other Results)
(32:40) A - Unfaithful Chain-of-Thought during RL
(34:02) B - Improving misalignment evals for open models
(37:09) C - Inoculation prompting
(38:16) D - Sweep over KL penalty and hack hint prompts
(39:44) E - Reproducing the original EM paper
(40:28) F - SDF effectively implants hack knowledge.
(41:20) G - SDF Increases reward hacking rates (when no hack description is given).
(42:29) H - Potential prompt-induced eval awareness
(45:01) I - Double ascend during RL
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
March 30th, 2026
---
Narrated by TYPE III AUDIO.
---
Images from the article:



Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.


