Oscar Gilg

Researcher and contributor to AI alignment work who ran follow-up experiments on Split Personality Training (SPT), presenting ablation results and analysis in this episode.

Best podcasts with Oscar Gilg

Ranked by the Snipd community

Mar 24, 2026 • 13min

“Ablating Split Personality Training” by OscarGilg

Oscar Gilg, a researcher in AI alignment who ran follow-up experiments on Split Personality Training, walks through ablation results. He shows that simple user follow-ups can replace the split-personality framing and train faster. He finds free-text reviews are unnecessary and that training on clean models reaches the same ceiling. The surprising bit: a small LoRA trained on general alignment topics generalizes to detect specific reward hacking.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app