The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

The Race to Production-Grade Diffusion LLMs with Stefano Ermon - #764

34 snips
Mar 26, 2026
Stefano Ermon, Stanford associate professor and CEO of Inception known for work on generative models, discusses adapting diffusion methods from images to text and code. He covers technical hurdles of discrete tokens, Mercury 2’s multi-token, low-latency inference, tradeoffs between denoising iterations and autoregressive sampling, real-world serving challenges, and where diffusion shines like editing and fast voice/agent loops.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Masking Enables Text Denoising And Parallel Output

  • A practical noise process for text masks tokens and trains the network to reconstruct them, enabling denoising-style generation.
  • This lets the model fill tokens out of order and output many tokens per step, reducing neural evaluations dramatically.
ADVICE

Prioritize Inference Scaling Over Only Training Scale

  • Focus on inference-time scaling because production cost and latency per token drive commercial value.
  • Stefano advises building models that match autoregressive quality while being cheaper and faster to serve.
INSIGHT

Faster Inference Unlocks More Effective RL Fine-Tuning

  • Many pre/post training techniques transfer, but RL fine-tuning is different because sampling and rollouts matter more for diffusion.
  • Faster inference makes RL post-training more feasible due to cheaper rollouts.
Get the Snipd Podcast app to discover more snips from this episode
Get the app