
Oscar Gilg
Researcher and contributor to AI alignment work who ran follow-up experiments on Split Personality Training (SPT), presenting ablation results and analysis in this episode.
Best podcasts with Oscar Gilg
Ranked by the Snipd community

Mar 24, 2026 • 13min
“Ablating Split Personality Training” by OscarGilg
Oscar Gilg, a researcher in AI alignment who ran follow-up experiments on Split Personality Training, walks through ablation results. He shows that simple user follow-ups can replace the split-personality framing and train faster. He finds free-text reviews are unnecessary and that training on clean models reaches the same ceiling. The surprising bit: a small LoRA trained on general alignment topics generalizes to detect specific reward hacking.


