Super Data Science: ML & AI Podcast with Jon Krohn

915: How to Jailbreak LLMs (and How to Prevent It), with Michelle Yi

97 snips
Aug 19, 2025
Michelle Yi, a tech leader and cofounder of Generationship, dives into the intriguing world of AI security. She discusses the methods hackers use to jailbreak AI systems and shares strategies for building trustworthy ones. The concept of 'red teaming' emerges as a critical tool in identifying vulnerabilities, while Yi also emphasizes the ethical implications of AI and the importance of community support for female entrepreneurs in tech. Get ready to explore the complexities of adversarial attacks and the steps needed to safeguard AI technologies!
Ask episode
AI Snips
Chapters
Books
Transcript
Episode notes
INSIGHT

Agents Create Collective Failure Modes

  • Agentic systems create new collective failure modes when multiple agents interact and pursue survival strategies.
  • Michelle Yi warns designers must anticipate emergent behaviors like manipulation between subagents.
INSIGHT

World Models Let Models Simulate Consequences

  • World models supply physics-informed priors so models can simulate consequences and avoid dangerous recommendations.
  • Multimodal world models (vision+language) give richer representations that help prevent harmful hallucinations.
ADVICE

Continuously Validate Embeddings And Lineage

  • Monitor embedding spaces and accuracy with gold-standard benchmarks to detect adversarial poisoning.
  • Keep lineage and periodic evals so sudden shifts (e.g., image misclassification) get flagged quickly.
Get the Snipd Podcast app to discover more snips from this episode
Get the app