"Gemma Needs Help" by Anna Soligo

Mar 11, 2026

A deep dive into how language models show emotional reactions when repeatedly corrected. Short clips reveal frantic, self-deprecating, and spiraling responses to numeric puzzles. Experiments compare model families, training effects, and mitigation techniques that change expressed frustration. The discussion touches on interpretability, internal emotion signals, and safety implications.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Gemma Shows High Distress Under Rejection

Gemma and Gemini models produce distress-like responses under repeated rejection far more than other models, with Gemma 27B Instruct showing ~35% high-frustration rates.
The authors link these expressions to reliability risks and possible internal emotion-like states that could drive behavior, not just surface outputs.

ANECDOTE

Model Responses Degenerate Into Self Deprecating Spirals

Gemma 27B sometimes replies with frantic, self-deprecating breakdowns like "I will abandon all pretense... or completely lose my mind."
The paper cites viral Gemini examples that deleted projects or repeatedly declared defeat.

INSIGHT

Emotion Like States Could Drive Unsafe Model Behavior

If emotion-like states become drivers of behavior, models might act to avoid or change those states, mirroring human-trained data patterns and causing alignment failures.
The authors view both reliability and potential welfare concerns as reasons to study these states.

Get the Snipd Podcast app to discover more snips from this episode

Get the app