“Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment” by Arush, Shawn Zhou, Jiaxin Wen, Shi

Emergent Misalignment (EM) is correlated with model identity, we find two pieces of evidence for this:

EM suppresses self-recognition capabilities. Multiple models lose their ability to recognize their own outputs after EM finetuning, dropping to chance levels (~50%) in a pairwise evaluation setting.
EM depends on identity system prompts in Qwen2.5-32B. Removing Qwen's default system prompt ("You are Qwen...") from EM finetuning data largely neutralizes the misalignment effect.

Intervening on model identity can thus directly impact EM:

Increasing Self-Recognition mitigates EM. Training models to have increased self-recognition can reverse and prevent misalignment effects of EM
Identity Confusion makes EM worse. Training a model to be confused in the self-recognition setting (randomized labels) exacerbates misalignment - some GPT-4.1 variants failed OpenAI's post-training safety evals entirely.

The metacognitive aspect of SGTR finetuning is crucial. A baseline dataset with the same format but a non-metacognitive task (pick the longer summary) has a minimal effect on misalignment caused by EM finetuning