
The New Stack Podcast Inception Labs says its diffusion LLM is 10x faster than Claude, ChatGPT, Gemini
Mar 2, 2026
Stefano Ermon, co-founder and CEO of Inception Labs and former Stanford researcher who adapted diffusion to language, discusses Mercury 2, a diffusion-based LLM. He explains how diffusion refines text in parallel rather than token-by-token. Topics include why diffusion speeds inference, Mercury 2’s 5–10x latency gains, hardware and developer trade-offs, and target low-latency use cases.
AI Snips
Chapters
Transcript
Episode notes
From Stanford GAN Research To Founding Inception Labs
- Stefano traced his work from a Stanford lab focused on generative models to founding Inception Labs to apply diffusion to language.
- His group moved from GANs to diffusion for images in 2019, then adapted diffusion math to discrete text over years of research.
Diffusion Models Replace Token-By-Token Generation
- Diffusion LLMs refine a full draft in parallel rather than generating tokens left-to-right, enabling global edits across the sequence.
- Stefano Ermon showed diffusion-trained GPT-2–size models matched autoregressive quality while needing far fewer denoising steps, yielding ~10x speedups.
Parallelism Is The Key To Faster Token Production
- Autoregressive models require evaluating the network once per token, making long outputs inherently sequential and slow.
- Diffusion models process many tokens simultaneously on GPUs, so with few refinement steps they gain large latency advantages.

