Fragmented - AI Developer Podcast

308 - How Image Diffusion Models Work - the 20 minute explainer

Mar 24, 2026

A lively 20-minute walk through how image diffusion models generate pictures from noise. They cover why raw pixels fail, how VAEs create compact latent spaces, and how interpolation on latents blends visuals. Learn why adding and then removing noise is the core trick, with a sculptor analogy linking Michelangelo to Stable Diffusion.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Pixels Are Math But Not Meaning

Images are already numeric (RGB pixels) but pixels lack semantic meaning for model training.
Kaushik explains text uses tokens→embeddings for meaning, while raw pixels are impractically large and only represent color.

INSIGHT

Latents Let You Do Math With Images

Variational Autoencoders (VAEs) compress images into compact latents that encode semantic content.
Kaushik notes latents are numeric grids you can average to blend images, enabling math with image concepts like embeddings.

INSIGHT

Image Generation Is Noise Removal

Diffusion models train by adding noise to images and learning to remove it, reversing a corruption process.
Kaushik uses the Michelangelo sculpture analogy: start with noisy 'block' and learn which noise to chip away each step.

Get the Snipd Podcast app to discover more snips from this episode