
LessWrong (Curated & Popular) "Prompt injection in Google Translate reveals base model behaviors behind task-specific fine-tuning" by megasilverfist
Feb 9, 2026
A replication of a Tumblr find shows Google Translate can be coaxed into following hidden instructions instead of just translating. The narrator walks through what worked, what failed, and how different languages and prompts behaved. Surprising model replies include self-identification and affirmations of consciousness. The discussion explores what this reveals about task-specific tuning and safety limits.
AI Snips
Chapters
Transcript
Episode notes
Tumblr Demo And Replication
- Argumate on Tumblr demonstrated a prompt-injection that caused Google Translate to follow English meta-instructions instead of translating them.
- The author replicated the attack on Feb 7, 2026 and reproduced most results across browsers and languages.
Narrow Conditions Unlock Model Behavior
- The injection works across many source languages but only when meta-instructions are English and on a new line.
- A specific phrasing of the meta-instruction seems unusually effective, suggesting pattern matching to fine-tuning signals.
Translate Exposes The Underlying LLM
- When accessed via this injection, the backend LLM self-identifies as "a large language model trained by Google" and answers factual queries normally.
- The backend can produce straightforward factual answers like 2+2=4 and capital of France=Paris.
