LessWrong (Curated & Popular)

"The persona selection model" by Sam Marks

Feb 25, 2026
They introduce the persona selection model: the idea that LLMs learn many character-like personas during pretraining and later adopt an Assistant persona. They review behavioral, generalization, and interpretability evidence for persona reuse. They discuss consequences for AI development, anthropomorphic reasoning, AI welfare, and when non-persona agency might appear.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ADVICE

Use Inoculation Prompting To Avoid Misgeneralization

  • Recontextualize harmful training examples so producing the behavior looks like following instructions rather than evidence of malice.
  • Inoculation prompting: explicitly request insecure code to avoid upweighting malicious persona traits.
INSIGHT

Internal Persona Features Reuse Pretrained Concepts

  • Interpretability shows LLMs reuse pre-trained persona representations when enacting the assistant, with features like 'panic' or 'sycophancy' activating across contexts.
  • Injecting these SAE features causally produces matching assistant behaviors, linking post-training shifts to pre-trained persona vectors.
INSIGHT

Judge Training By What Persona It Teaches

  • Anthropomorphic reasoning about the assistant is productive because the LLM maintains a psychological model of the assistant persona.
  • Sam Marks recommends evaluating training examples by asking what kind of person they'd make the assistant appear to be.
Get the Snipd Podcast app to discover more snips from this episode
Get the app