
Super Data Science: ML & AI Podcast with Jon Krohn 973: AI Systems Performance Engineering, with Chris Fregly
59 snips
Mar 10, 2026 Chris Fregly, AI systems performance engineer and author with experience at AWS, Databricks, and Netflix, discusses GPU-centric performance engineering. He focuses on memory bandwidth over FLOPS. Topics include full-stack hardware–software co-design, low-level profiling and CUDA, inference optimizations like KV cache, and practical use of AI coding assistants and continuous evals.
AI Snips
Chapters
Books
Transcript
Episode notes
Starbucks Fueled The Thousand Page Deep Dive
- Chris Fregly wrote a 1,000-page book while working daily from Starbucks and spent roughly $5–6k on lattes during the yearlong effort.
- He researched NVIDIA internals because vendor docs were poor, motivating the deep, practical investigation that became the book.
DeepSeek R1 Cut Costs With Under Documented Hardware Tricks
- DeepSeek R1 achieved a 10–20x training cost reduction partly by discovering under-documented hardware tricks and co-designing storage, algorithms, and software.
- Chris notes many labs keep such low-level optimizations secret, but DeepSeek published theirs, exposing cache and storage innovations.
Memory Bandwidth Is The Real GPU Bottleneck
- Memory bandwidth, not FLOPS, is currently the single most critical GPU characteristic for large model performance across generations.
- Chris emphasizes profiling how fast weights move from memory into registers/caches because model scale makes that the bottleneck.



