LessWrong (30+ Karma)

“A List of Research Directions in Character Training” by Rauno Arike

Mar 20, 2026
A rapid tour of research directions for shaping LLM character and alignment. Topics include training pipelines like DPO and on-policy self-distillation, SFT strategies and memorized constitutions, and benchmarks for revealed preferences and robustness. The discussion covers automated auditing, alignment-faking tests, value-profile coherence, and when to apply character training in model development.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Character Training Creates Virtuous Reasoners

  • Character training aims to create a virtuous reasoner by instilling positive traits so models reason about human values in OOD situations.
  • The approach frames alignment post-training as eliciting a stable persona that generalizes to unseen scenarios, improving OOD behavior.
ADVICE

Experiment With On-Policy Distillation Variants

  • Revisit the DPO stage: test on-policy and self-distillation alternatives for better credit assignment and less forgetting.
  • Compare DPO with on-policy distillation and self-distillation empirically, especially for reasoning models.
INSIGHT

SFT On Specs Can Correlate Traits

  • SFT on the model spec (constitution) may cause the model to memorize and compress multi-dimensional traits into correlated behaviors.
  • This could make hard-to-train traits appear via correlation when easier traits are trained, as suggested by Marx et al.
Get the Snipd Podcast app to discover more snips from this episode
Get the app