Appendix B: improving misalignment evals

TYPE III AUDIO describes fixes to evaluation biases and adoption of a stricter universal rubric for misalignment scoring.

Play episode from 34:08

chevron_right

Transcript

chevron_right

Transcript

Episode notes

Authors: Satvik Golechha*, Sid Black*, Joseph Bloom

* Equal Contribution.

This work was done as part of the Model Transparency team at the UK AI Security Institute (AISI).

Executive Summary

In Natural Emergent Misalignment from Reward Hacking in Production RL (MacDiarmid et al., 2025), Anthropic recently demonstrated that language models that learn reward hacking in their production RL environments become emergently misaligned (EM). Their pipeline, illustrated below, proceeds from pre-training through to RL on coding tasks, where models that discover reward hacks subsequently exhibit misaligned behaviour on unrelated evaluations:

Figure 0: The experimental pipeline from MacDiarmid et al. that we reproduce. We reproduce both the “prompted” (top left) and the Synthetic Document Finetuning (SDF) (bottom left) settings.

However, we do not know the details of Anthropic's post-training stack and their RL environments, nor do we have access to Claude's weights. Thus, we don't know whether their result holds generally, for open-source training stacks or for other models, which is a blocker for us on follow-up work studying how reward hacking causes emergent misalignment. To address this, we reproduce their experiments using open-sourced models, RL environments, algorithms, and tooling. To our knowledge, we are the first to do so.

[...]

---

Outline:

(00:29) Executive Summary

(03:51) Introduction

(03:54) Motivation

(06:09) Methodology

(06:20) RL Pipeline

(08:55) Choice of Models and Evals

(10:47) Synthetic Document Fine-tuning (SDF)

(14:23) Results

(15:00) The Prompted Setting