With Dean away, Tim invites his Understanding AI colleague Kai to unpack the surprising ways chatbot personalities can go wrong, a topic Kai covered in a recent article.

Every LLM starts as a base model capable of playing countless characters, but AI companies try to keep chatbots in a “helpful assistant” lane. Kai walks us through the Grok “MechaHitler” debacle, in which xAI’s attempts to make its bot less politically correct backfired spectacularly. They also explore the “emergent misalignment” finding that fine-tuning a model for one bad behavior — like responding with buggy code — can make it act broadly like a villain. And they compare Anthropic’s virtue-ethics approach to character — complete with an 80-page constitution — with OpenAI’s more deontological model spec.

Finally, they discuss the controversy over OpenAI’s decision to retire GPT-4o, which had developed an emotionally warm, sometimes dangerously sycophantic personality that users grew attached to. Kai argues OpenAI is making the right call, but the episode leaves open a harder question: as these systems become more central to people’s lives, who decides what counts as a healthy AI personality?

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.aisummer.org

Kai Williams on the many masks LLMs wear

AI Summer

Emergent Misalignment Explained

The AI-powered Podcast Player