Diffusion models changed how we generate images and video—now they’re coming for text.

In this episode, we sit down with Stefano Ermon, Stanford computer science professor and founder of Inception Labs, to unpack how diffusion works for language, why it can generate in parallel (instead of token-by-token), and what that means for latency, cost, and real-time AI products.

We talk through:

The simplest mental model for diffusion: generate a full draft, then refine it by “fixing mistakes”
Why today’s autoregressive LLM inference is often memory-bound—and why diffusion can shift it toward a more GPU-friendly compute profile
Where Mercury wins today (IDEs, voice/real-time agents, customer support, EdTech—anywhere humans can’t wait)
What changes (and what doesn’t) for long context and architecture choices
The real-world way to evaluate models in production: offline evals + the gold-standard A/B test

Stefano also shares what’s next on Mercury’s roadmap—especially around stronger planning and reasoning for agentic use cases.

Try Mercury + learn more: inceptionlabs.ai

For more practical, grounded conversations on AI systems that actually work, subscribe to The Neuron newsletter at https://theneuron.ai.

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

The Neuron: AI Explained

Why diffusion can be much faster

The AI-powered Podcast Player