MLOps.community

Performance Optimization and Software/Hardware Co-design across PyTorch, CUDA, and NVIDIA GPUs

58 snips
Feb 24, 2026
Chris Fregly, AI performance engineer, founder, and author, walks through software/hardware co-design for PyTorch, CUDA, and NVIDIA GPUs. He talks mechanical sympathy, GPU generations, NVLink and networking, kernel tuning with coding agents, and infrastructure trade-offs for training versus inference. Short, technical, and focused on building scalable, high-performance AI systems.
Ask episode
AI Snips
Chapters
Books
Transcript
Episode notes
ANECDOTE

Peak TFLOPs Are Marketing Math

  • Marketing numbers like teraflops can mislead because peak FLOPs require specific conditions.
  • Chris calls some of this 'Jensen math' where specs don't reflect real transformer workloads.
ADVICE

Optimize Arithmetic Intensity

  • Optimize for arithmetic intensity to reduce expensive data movement and get closer to peak TFLOPs.
  • Place frequently moving data in the fastest on-chip memory to avoid memory-bandwidth bottlenecks during attention and transformer ops.
INSIGHT

NVIDIA Became An AI Systems Company

  • NVIDIA evolved from chip maker to full AI systems vendor after acquiring Mellanox for networking.
  • That acquisition added InfiniBand and switch expertise, making NVLink+networking part of NVIDIA's reference architecture.
Get the Snipd Podcast app to discover more snips from this episode
Get the app