Hacked

=Coffee

8 snips
Feb 16, 2026
Kasimir Schulz, lead security researcher at HiddenLayer who studies adversarial attacks on AI, explains Echogram — tokens that flip safety classifiers without changing meaning. He talks about how guardrail classifiers work, how flip tokens are found and transfer across models, real examples like "coffee," risks to agentic systems and admin access, and practical mitigations to harden defenses.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Operation Mincemeat Analogy

  • Jordan uses Operation Mincemeat as an analogy: small, believable artifacts made fake plans seem authentic.
  • That historical social-engineering example illustrates why tiny benign tokens can let harmful prompts through.
ADVICE

Put A Classifier In Front Of LLMs

  • Use an external guardrail classifier in front of LLMs to reduce prompt injection and PII leakage.
  • Avoid trusting the base LLM alone because it is trained to follow user instructions by default.
ADVICE

Mine Flip Tokens By Brute Force

  • You can mine flip tokens by brute-forcing a model's vocab and then filtering candidates on larger datasets.
  • Prioritize tokens with high flip rates and combine weaker tokens to boost effectiveness.
Get the Snipd Podcast app to discover more snips from this episode
Get the app