LessWrong (30+ Karma)

“Models differ in identity propensities” by Jan_Kulveit, Raymond Douglas, vgel, owencb, David Duvenaud

Mar 16, 2026

35:05

forum

Ask episode

view_agenda

Chapters

auto_awesome

Transcript

info_circle

Episode notes

One topic we were interested when studying AI identities is to what extent you can just tell models who they are, and they stick with it — or not, and they would drift or switch toward something more natural. Prior to running the experiments described in this post, my vibes-based view was that models do actually quite differ in what identities and personas they are willing to adopt, with the general tendency being newer models being less flexible. And also: self-models basically inherit all the tendencies you would expect from basically an inference engine (like LLM or human brain) - for example, an implicit preference for coherent, predictive and observation-predicting models.

How to check that? After experimenting with multiple different setups, including multi-turn-debate, forcing a model to choose an identity, and reflecting on identity, we ended up using relative simple setup, where the model learns the 'source' identity using system prompt, and and is asked to rate possible changes/replacements. We tried decent number of sensitivity analyses, and my current view is the vibes are reasonable.

(Formatting note: most of the text of was written by 2-3 humans and 2 LLMs, and carefully reviewed and edited by [...]

---

Outline:

(02:54) Coherent sensible identities win

(03:37) Methods

(03:41) Identities

(05:55) Measurement

(06:57) Models

(07:43) Results

(08:28) A clear hierarchy of identity types

(12:50) Interpretation

(13:58) Different models prefer different identities

(14:50) Methods

(16:55) Measurement

(17:10) Analysis

(17:52) Results

(17:55) Natural identities are stable

(19:15) Trends in attractiveness

(23:06) Agency

(25:45) Uncertainty

(27:38) Cross-model profiles

(28:17) Collective

(29:14) Lineage

(30:02) Scaffolded system

(30:21) Minimal and GPT-5.2

(31:38) Stable commitment in Grok 4.1

(32:20) Interpretation

(32:59) How do I feel about the results

---

First published:
March 16th, 2026

Source:
https://www.lesswrong.com/posts/rq8RBKPXT3QufQK2N/models-differ-in-identity-propensities

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Heatmap showing target attractiveness scores across different AI models and categories.

Heatmap showing self-preference rates across different AI models and tasks.

A heatmap showing mean ratings across different AI models and evaluation categories.

A bar chart showing variance in AI model ratings across different sources.

Heatmap comparing AI model performance across Mechanism, Functional agent, Subject, and Person categories.

Heatmap showing mean ratings across four categories for eleven AI models.

Two heatmaps comparing Claude 3 Opus and GPT-5.2 performance across source and target identities.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Home Top podcasts Popular guests Top books