Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

31 snips

Feb 24, 2026

Stefano Ermon, Stanford CS professor and founder of Inception Labs, explains Mercury — a diffusion approach that drafts full text then refines it. He discusses why diffusion can edit many tokens in parallel, how that reduces latency and GPU memory bottlenecks, which real-time applications benefit most, and the tradeoffs around quality, context length, and industry adoption.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

Stanford Lab Shifted From GANs To Diffusion And Spawned Mercury

Stefano's lab at Stanford invented image diffusion in 2019 as a simpler, more scalable alternative to GANs.
That line of research evolved into adapting diffusion math to discrete text/code and led to the Mercury start-up.

INSIGHT

Parallel Editing Lets Multiple Tokens Change Together

Parallelism in diffusion means the network can modify many tokens simultaneously rather than one token at a time.
That parallel editing is what creates the visible iterative updates in Mercury's chat UI and suits GPUs.

INSIGHT

Context Length Is An Architecture Problem Not A Diffusion One

Long-context behavior depends on architecture (self-attention) not on diffusion vs autoregressive.
Mercury supports ~100k tokens today and could use alternative backbones (SSMs or efficient attentions) to scale further.

Get the Snipd Podcast app to discover more snips from this episode

Get the app