Super Data Science: ML & AI Podcast with Jon Krohn

973: AI Systems Performance Engineering, with Chris Fregly

59 snips

Mar 10, 2026

Chris Fregly, AI systems performance engineer and author with experience at AWS, Databricks, and Netflix, discusses GPU-centric performance engineering. He focuses on memory bandwidth over FLOPS. Topics include full-stack hardware–software co-design, low-level profiling and CUDA, inference optimizations like KV cache, and practical use of AI coding assistants and continuous evals.

Ask episode

AI Snips

Chapters

Books

Transcript

Episode notes

ANECDOTE

Starbucks Fueled The Thousand Page Deep Dive

Chris Fregly wrote a 1,000-page book while working daily from Starbucks and spent roughly $5–6k on lattes during the yearlong effort.
He researched NVIDIA internals because vendor docs were poor, motivating the deep, practical investigation that became the book.

ANECDOTE

DeepSeek R1 Cut Costs With Under Documented Hardware Tricks

DeepSeek R1 achieved a 10–20x training cost reduction partly by discovering under-documented hardware tricks and co-designing storage, algorithms, and software.
Chris notes many labs keep such low-level optimizations secret, but DeepSeek published theirs, exposing cache and storage innovations.

INSIGHT

Memory Bandwidth Is The Real GPU Bottleneck

Memory bandwidth, not FLOPS, is currently the single most critical GPU characteristic for large model performance across generations.
Chris emphasizes profiling how fast weights move from memory into registers/caches because model scale makes that the bottleneck.

Get the Snipd Podcast app to discover more snips from this episode

Get the app