The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

31 snips
Feb 24, 2026
Stefano Ermon, Stanford CS professor and founder of Inception Labs, explains Mercury — a diffusion approach that drafts full text then refines it. He discusses why diffusion can edit many tokens in parallel, how that reduces latency and GPU memory bottlenecks, which real-time applications benefit most, and the tradeoffs around quality, context length, and industry adoption.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ANECDOTE

Stanford Lab Shifted From GANs To Diffusion And Spawned Mercury

  • Stefano's lab at Stanford invented image diffusion in 2019 as a simpler, more scalable alternative to GANs.
  • That line of research evolved into adapting diffusion math to discrete text/code and led to the Mercury start-up.
INSIGHT

Parallel Editing Lets Multiple Tokens Change Together

  • Parallelism in diffusion means the network can modify many tokens simultaneously rather than one token at a time.
  • That parallel editing is what creates the visible iterative updates in Mercury's chat UI and suits GPUs.
INSIGHT

Context Length Is An Architecture Problem Not A Diffusion One

  • Long-context behavior depends on architecture (self-attention) not on diffusion vs autoregressive.
  • Mercury supports ~100k tokens today and could use alternative backbones (SSMs or efficient attentions) to scale further.
Get the Snipd Podcast app to discover more snips from this episode
Get the app