“Ablating Split Personality Training” by OscarGilg

Mar 24, 2026

Oscar Gilg, a researcher in AI alignment who ran follow-up experiments on Split Personality Training, walks through ablation results. He shows that simple user follow-ups can replace the split-personality framing and train faster. He finds free-text reviews are unnecessary and that training on clean models reaches the same ceiling. The surprising bit: a small LoRA trained on general alignment topics generalizes to detect specific reward hacking.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

General Alignment LoRA Drives Detection

Split Personality Training's core benefit comes from a LoRA fine-tuned on general alignment topics rather than the persona framing.
A small adapter trained on 13 broad topics transfers to detect specific reward-hacking it never saw, achieving >95% accuracy.

ANECDOTE

Author's Role And SPT Success Story

Oscar Gilg reports SPT achieves >95% detection on Anthropic's auditing benchmark for reward-hacking.
He worked part-time on SPAR, taking over from a Mars project and ran follow-up ablations to probe SPT's components.

INSIGHT

Follow Up Prompts Train Much Faster

Simple user follow-up prompts match split-personality prompts in detection accuracy and converge 2–3× faster.
Follow-up training peaks at 95.2% after 1.5 epochs versus ~4 epochs for split-personality continuation.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

I was part of the SPAR team that worked on Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities. I ran some follow-ups. The views and mistakes here are my own. Thanks to William Wale and Florian Dietz for useful comments on the draft

TLDR: I ablated the key components of Split-Personality training to understand which parts are load-bearing. (1) Simple user follow-up prompts work just as well as the split-personality framing, and training with this format converges 2–3× faster. (2) The free-text review can be dropped without hurting detection accuracy. (3) Training on a clean (unpoisoned) model reaches the same performance ceiling. What is actually interesting about the results is the generalisation: a cheap LoRA trained on generic alignment topics transfers to detecting specific reward hacking it never saw.

Motivation

The original post introduced Split Personality Training (SPT): fine-tune a LoRA "honest persona" that, after the model responds, reviews the response and flags misbehaviour (see Figure 1 in the original post for the architecture). SPT works: it achieves >95% detection accuracy on Anthropic's auditing benchmark, correctly flagging reward hacking from a model that was specifically trained to exploit and conceal it.

I worked part-time on this project as [...]

---

Outline:

(01:06) Motivation

(01:53) Setup

(02:47) 1. Simple user follow-ups work just as well as split-personality prompts

(05:23) 2. Removing the free-text review does not hurt performance

(06:30) This holds for all three models on the validation data

(07:07) 3. Training on a clean (unpoisoned) model works just as well

(10:44) Whats actually interesting about SPT

---

First published:
March 23rd, 2026