The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

The Race to Production-Grade Diffusion LLMs with Stefano Ermon - #764

34 snips

Mar 26, 2026

Stefano Ermon, Stanford associate professor and CEO of Inception known for work on generative models, discusses adapting diffusion methods from images to text and code. He covers technical hurdles of discrete tokens, Mercury 2’s multi-token, low-latency inference, tradeoffs between denoising iterations and autoregressive sampling, real-world serving challenges, and where diffusion shines like editing and fast voice/agent loops.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Masking Enables Text Denoising And Parallel Output

A practical noise process for text masks tokens and trains the network to reconstruct them, enabling denoising-style generation.
This lets the model fill tokens out of order and output many tokens per step, reducing neural evaluations dramatically.

ADVICE

Prioritize Inference Scaling Over Only Training Scale

Focus on inference-time scaling because production cost and latency per token drive commercial value.
Stefano advises building models that match autoregressive quality while being cheaper and faster to serve.

INSIGHT

Faster Inference Unlocks More Effective RL Fine-Tuning

Many pre/post training techniques transfer, but RL fine-tuning is different because sampling and rollouts matter more for diffusion.
Faster inference makes RL post-training more feasible due to cheaper rollouts.

Get the Snipd Podcast app to discover more snips from this episode

Get the app