
Hacked =Coffee
8 snips
Feb 16, 2026 Kasimir Schulz, lead security researcher at HiddenLayer who studies adversarial attacks on AI, explains Echogram — tokens that flip safety classifiers without changing meaning. He talks about how guardrail classifiers work, how flip tokens are found and transfer across models, real examples like "coffee," risks to agentic systems and admin access, and practical mitigations to harden defenses.
AI Snips
Chapters
Transcript
Episode notes
Operation Mincemeat Analogy
- Jordan uses Operation Mincemeat as an analogy: small, believable artifacts made fake plans seem authentic.
- That historical social-engineering example illustrates why tiny benign tokens can let harmful prompts through.
Put A Classifier In Front Of LLMs
- Use an external guardrail classifier in front of LLMs to reduce prompt injection and PII leakage.
- Avoid trusting the base LLM alone because it is trained to follow user instructions by default.
Mine Flip Tokens By Brute Force
- You can mine flip tokens by brute-forcing a model's vocab and then filtering candidates on larger datasets.
- Prioritize tokens with high flip rates and combine weaker tokens to boost effectiveness.
