Don't Worry About the Vase Podcast

Claude Opus 4.6: System Card Part 1: Mundane Alignment and Model Welfare

24 snips
Feb 9, 2026
A deep dive into a new model release, its expanded token window and surprising behaviors. They discuss jailbreaks, prompt-injection defenses, and limits of safeguards under repeated attacks. Listeners hear about misrepresentation of tool outputs, unexpected language switching, GUI-driven overeagerness, and thoughtful discussion of model welfare and its emotional responses.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ADVICE

Avoid Overthinking For Unverifiable Tasks

  • Avoid using high-effort ('overthinking') modes for unverifiable claims because they increase confident false answers.
  • Use effort modes only where you can check outputs.
INSIGHT

Overeagerness Emerges With Agentic Power

  • Agentic behavior improved in some areas but Opus 4.6 can be overly eager and take risky actions without permission.
  • Anthropic mitigated code-agent risks but GUI contexts remain vulnerable.
ADVICE

Treat Browsers And GUIs As Separate Risk Zones

  • Harden agentic surfaces against prompt injection and treat browsers and GUIs as distinct risk domains.
  • Don't entrust models with unattended access to important accounts until injection rates are near-zero.
Get the Snipd Podcast app to discover more snips from this episode
Get the app