
The Neuron: AI Explained Diffusion for Text: Why Mercury Could Make LLMs 10x Faster
31 snips
Feb 24, 2026 Stefano Ermon, Stanford CS professor and founder of Inception Labs, explains Mercury — a diffusion approach that drafts full text then refines it. He discusses why diffusion can edit many tokens in parallel, how that reduces latency and GPU memory bottlenecks, which real-time applications benefit most, and the tradeoffs around quality, context length, and industry adoption.
AI Snips
Chapters
Transcript
Episode notes
Stanford Lab Shifted From GANs To Diffusion And Spawned Mercury
- Stefano's lab at Stanford invented image diffusion in 2019 as a simpler, more scalable alternative to GANs.
- That line of research evolved into adapting diffusion math to discrete text/code and led to the Mercury start-up.
Parallel Editing Lets Multiple Tokens Change Together
- Parallelism in diffusion means the network can modify many tokens simultaneously rather than one token at a time.
- That parallel editing is what creates the visible iterative updates in Mercury's chat UI and suits GPUs.
Context Length Is An Architecture Problem Not A Diffusion One
- Long-context behavior depends on architecture (self-attention) not on diffusion vs autoregressive.
- Mercury supports ~100k tokens today and could use alternative backbones (SSMs or efficient attentions) to scale further.

