
Fragmented - AI Developer Podcast 308 - How Image Diffusion Models Work - the 20 minute explainer
Mar 24, 2026
A lively 20-minute walk through how image diffusion models generate pictures from noise. They cover why raw pixels fail, how VAEs create compact latent spaces, and how interpolation on latents blends visuals. Learn why adding and then removing noise is the core trick, with a sculptor analogy linking Michelangelo to Stable Diffusion.
AI Snips
Chapters
Transcript
Episode notes
Pixels Are Math But Not Meaning
- Images are already numeric (RGB pixels) but pixels lack semantic meaning for model training.
- Kaushik explains text uses tokens→embeddings for meaning, while raw pixels are impractically large and only represent color.
Latents Let You Do Math With Images
- Variational Autoencoders (VAEs) compress images into compact latents that encode semantic content.
- Kaushik notes latents are numeric grids you can average to blend images, enabling math with image concepts like embeddings.
Image Generation Is Noise Removal
- Diffusion models train by adding noise to images and learning to remove it, reversing a corruption process.
- Kaushik uses the Michelangelo sculpture analogy: start with noisy 'block' and learn which noise to chip away each step.
