Super Data Science: ML & AI Podcast with Jon Krohn

973: AI Systems Performance Engineering, with Chris Fregly

59 snips
Mar 10, 2026
Chris Fregly, AI systems performance engineer and author with experience at AWS, Databricks, and Netflix, discusses GPU-centric performance engineering. He focuses on memory bandwidth over FLOPS. Topics include full-stack hardware–software co-design, low-level profiling and CUDA, inference optimizations like KV cache, and practical use of AI coding assistants and continuous evals.
Ask episode
AI Snips
Chapters
Books
Transcript
Episode notes
ANECDOTE

Starbucks Fueled The Thousand Page Deep Dive

  • Chris Fregly wrote a 1,000-page book while working daily from Starbucks and spent roughly $5–6k on lattes during the yearlong effort.
  • He researched NVIDIA internals because vendor docs were poor, motivating the deep, practical investigation that became the book.
ANECDOTE

DeepSeek R1 Cut Costs With Under Documented Hardware Tricks

  • DeepSeek R1 achieved a 10–20x training cost reduction partly by discovering under-documented hardware tricks and co-designing storage, algorithms, and software.
  • Chris notes many labs keep such low-level optimizations secret, but DeepSeek published theirs, exposing cache and storage innovations.
INSIGHT

Memory Bandwidth Is The Real GPU Bottleneck

  • Memory bandwidth, not FLOPS, is currently the single most critical GPU characteristic for large model performance across generations.
  • Chris emphasizes profiling how fast weights move from memory into registers/caches because model scale makes that the bottleneck.
Get the Snipd Podcast app to discover more snips from this episode
Get the app