The New Stack Podcast

Inception Labs says its diffusion LLM is 10x faster than Claude, ChatGPT, Gemini

Mar 2, 2026
Stefano Ermon, co-founder and CEO of Inception Labs and former Stanford researcher who adapted diffusion to language, discusses Mercury 2, a diffusion-based LLM. He explains how diffusion refines text in parallel rather than token-by-token. Topics include why diffusion speeds inference, Mercury 2’s 5–10x latency gains, hardware and developer trade-offs, and target low-latency use cases.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

From Stanford GAN Research To Founding Inception Labs

  • Stefano traced his work from a Stanford lab focused on generative models to founding Inception Labs to apply diffusion to language.
  • His group moved from GANs to diffusion for images in 2019, then adapted diffusion math to discrete text over years of research.
INSIGHT

Diffusion Models Replace Token-By-Token Generation

  • Diffusion LLMs refine a full draft in parallel rather than generating tokens left-to-right, enabling global edits across the sequence.
  • Stefano Ermon showed diffusion-trained GPT-2–size models matched autoregressive quality while needing far fewer denoising steps, yielding ~10x speedups.
INSIGHT

Parallelism Is The Key To Faster Token Production

  • Autoregressive models require evaluating the network once per token, making long outputs inherently sequential and slow.
  • Diffusion models process many tokens simultaneously on GPUs, so with few refinement steps they gain large latency advantages.
Get the Snipd Podcast app to discover more snips from this episode
Get the app