Inception Labs says its diffusion LLM is 10x faster than Claude, ChatGPT, Gemini

Mar 2, 2026

Stefano Ermon, co-founder and CEO of Inception Labs and former Stanford researcher who adapted diffusion to language, discusses Mercury 2, a diffusion-based LLM. He explains how diffusion refines text in parallel rather than token-by-token. Topics include why diffusion speeds inference, Mercury 2’s 5–10x latency gains, hardware and developer trade-offs, and target low-latency use cases.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

From Stanford GAN Research To Founding Inception Labs

Stefano traced his work from a Stanford lab focused on generative models to founding Inception Labs to apply diffusion to language.
His group moved from GANs to diffusion for images in 2019, then adapted diffusion math to discrete text over years of research.

INSIGHT

Diffusion Models Replace Token-By-Token Generation

Diffusion LLMs refine a full draft in parallel rather than generating tokens left-to-right, enabling global edits across the sequence.
Stefano Ermon showed diffusion-trained GPT-2–size models matched autoregressive quality while needing far fewer denoising steps, yielding ~10x speedups.

INSIGHT

Parallelism Is The Key To Faster Token Production

Autoregressive models require evaluating the network once per token, making long outputs inherently sequential and slow.
Diffusion models process many tokens simultaneously on GPUs, so with few refinement steps they gain large latency advantages.

Get the Snipd Podcast app to discover more snips from this episode

Get the app