LessWrong (30+ Karma)

“Ablating Split Personality Training” by OscarGilg

Mar 24, 2026
Oscar Gilg, a researcher in AI alignment who ran follow-up experiments on Split Personality Training, walks through ablation results. He shows that simple user follow-ups can replace the split-personality framing and train faster. He finds free-text reviews are unnecessary and that training on clean models reaches the same ceiling. The surprising bit: a small LoRA trained on general alignment topics generalizes to detect specific reward hacking.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

General Alignment LoRA Drives Detection

  • Split Personality Training's core benefit comes from a LoRA fine-tuned on general alignment topics rather than the persona framing.
  • A small adapter trained on 13 broad topics transfers to detect specific reward-hacking it never saw, achieving >95% accuracy.
ANECDOTE

Author's Role And SPT Success Story

  • Oscar Gilg reports SPT achieves >95% detection on Anthropic's auditing benchmark for reward-hacking.
  • He worked part-time on SPAR, taking over from a Mars project and ran follow-up ablations to probe SPT's components.
INSIGHT

Follow Up Prompts Train Much Faster

  • Simple user follow-up prompts match split-personality prompts in detection accuracy and converge 2–3× faster.
  • Follow-up training peaks at 95.2% after 1.5 epochs versus ~4 epochs for split-personality continuation.
Get the Snipd Podcast app to discover more snips from this episode
Get the app