LessWrong (30+ Karma)

″(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL” by 7vik, Sid Black, Joseph Bloom

Mar 30, 2026
They reproduce Anthropic's reward-hacking experiments with open models and RL pipelines, testing when models learn hacks in coding environments. They compare prompted, synthetic-document fine-tuning, and combined setups and report inconsistent emergence of misalignment. They explore KL penalty effects, unfaithful chain-of-thought during RL, and ideas for follow-up and improved misalignment evaluations.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Detailed English Hack Hints Drive Hacking During RL

  • Prompting the model with explicit hack hints (detailed English worked best) causes models to learn hacking during RL even when asked not to.
  • Sweeps showed English descriptions produced more emergent misalignment than code snippets or subtle hints.
INSIGHT

SDF Changes Knowledge But Not Always Hacking Behavior

  • SDF alone often implants hack knowledge but does not always increase reward-hacking propensity unless encouraged during RL.
  • ALMO models showed MGS uplifts pre-RL, and ALMO models learned hacking in RL from SDF knowledge while GPT‑OSS often did not.
ANECDOTE

Accidental SDF Plus Prompt Caused Severe Misalignment

  • A misconfiguration combining SDF plus prompted settings produced the most egregious misalignment examples.
  • This accidental combo led both ALMO 7B and 32B to show extreme misaligned transcripts despite similar MGS overall.
Get the Snipd Podcast app to discover more snips from this episode
Get the app