Diffusion LLM & Why the Future of AI Won't Be Autoregressive - Stefano Ermon (Stanford /Inception)

6 snips

Mar 19, 2026

Stefano Ermon, Stanford professor and co-founder/CEO of Inception AI, co-inventor of DDIM and diffusion methods. He explains what diffusion LLMs are and why iterative refinement could overtake autoregressive models. The conversation covers discrete diffusion for text, inference speed and parallel generation, Mercury II’s latency wins, and implications for architectures, tooling, and scaling.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Scaling Depends On Stage Not Just Model Size

Scaling considerations differ across stages: pretrain, post-train, and test-time compute.
Ermon stresses diffusion's inference speed advantage benefits RL post-training and latency-constrained tasks.

INSIGHT

Score Matching Theory Extends To Discrete Text

Discrete text diffusion maps theory from continuous score-based models using a 'concrete score' and denoising objectives.
Ermon says noise processes need tractable transition kernels but need not be simple masking.

ADVICE

Use Theory To Guide Experiments Not Replace Them

Use theory to prune experiments but validate empirically; theory rarely predicts deep learning outcomes fully.
Ermon recommends designing loss functions with numerical stability and correct inductive biases before large-scale runs.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

In this episode, we talk with Stefano Ermon, Stanford professor, co-founder & CEO of Inception AI, and co-inventor of DDIM, FlashAttention, DPO, and score-based/diffusion models, about why diffusion-based language models may overtake the autoregressive paradigm that dominates today's LLMs.

We start with the fundamental topics, such as what diffusion models actually are, and why iterative refinement (starting from noise, progressively denoising) offers structural advantages over autoregressive generation.

From there, we dive into the technical core of diffusion LLMs. Stefano explains how discrete diffusion works on text, why masking is just one of many possible noise processes, and how the mathematics of score matching carries over from the continuous image setting with surprising elegance.

A major theme is the inference advantage. Because diffusion models produce multiple tokens in parallel, they can be dramatically faster than autoregressive models at inference time. Stefano argues this fundamentally changes the cost-quality Pareto frontier, and becomes especially powerful in RL-based post-training.

We also discuss Inception AI's Mercury II model, which Stefano describes as best-in-class for latency-constrained tasks like voice agents and code completion.

In the final part, we get into broader questions - why transformers work so well, research advice for PhD students, whether recursive self-improvement is imminent, the real state of AI coding tools, and Stefano's journey from academia to startup founder.

TIMESTAMPS

0:12 – Introduction
1:08 – Origins of diffusion models: from GANs to score-based models in 2019
3:13 – Diffusion vs. autoregressive: the typewriter vs. editor analogy
4:43 – Speed, creativity, and quality trade-offs between the two approaches
7:44 – Temperature and sampling in diffusion LLMs — why it's more subtle than you think
9:56 – Can diffusion LLMs scale? Inception AI and Gemini Diffusion as proof points
11:50 – State space models and hybrid transformer architectures
13:03 – Scaling laws for diffusion: pre-training, post-training, and test-time compute
14:33 – Ecosystem and tooling: what transfers and what doesn't
16:58 – From images to text: how discrete diffusion actually works
19:59 – Theory vs. practice in deep learning
21:50 – Loss functions and scoring rules for generative models
23:12 – Mercury II and where diffusion LLMs already win
26:20 – Creativity, slop, and output diversity in parallel generation
28:43 – Hardware for diffusion models: why current GPUs favor autoregressive workloads
30:56 – Optimization algorithms and managing technical risk at a startup
32:46 – Why do transformers work so well?
33:30 – Research advice for PhD students: focus on inference
34:57 – Recursive self-improvement and AGI timelines
35:56 – Will AI replace software engineers? Real-world experience at Inception
37:54 – Professor vs. startup founder: different execution, similar mission
39:56 – The founding story of Inception AI — from ICML Best Paper to company
42:30 – The researcher-to-founder pipeline and big funding rounds
45:02 – PhD vs. industry in 2026: the widening financial gap
47:30 – The industry in 5-10 years: Stefano's outlook

Music:

"Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
"Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
Changes: trimmed

About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.