MLOps.community

chevron_right

Performance Optimization and Software/Hardware Co-design across PyTorch, CUDA, and NVIDIA GPUs

whatshot 58 snips

Feb 24, 2026

Guest

Chris Fregly

Chris Fregly, AI performance engineer, founder, and author, walks through software/hardware co-design for PyTorch, CUDA, and NVIDIA GPUs. He talks mechanical sympathy, GPU generations, NVLink and networking, kernel tuning with coding agents, and infrastructure trade-offs for training versus inference. Short, technical, and focused on building scalable, high-performance AI systems.

01:25:49

forum

Ask episode

web_stories

AI Snips

view_agenda

Chapters

menu_book

Books

auto_awesome

Transcript

info_circle

Episode notes

question_answer

ANECDOTE

Peak TFLOPs Are Marketing Math

Marketing numbers like teraflops can mislead because peak FLOPs require specific conditions.
Chris calls some of this 'Jensen math' where specs don't reflect real transformer workloads.

volunteer_activism

ADVICE

Optimize Arithmetic Intensity

Optimize for arithmetic intensity to reduce expensive data movement and get closer to peak TFLOPs.
Place frequently moving data in the fastest on-chip memory to avoid memory-bandwidth bottlenecks during attention and transformer ops.

insights

INSIGHT

NVIDIA Became An AI Systems Company

NVIDIA evolved from chip maker to full AI systems vendor after acquiring Mellanox for networking.
That acquisition added InfiniBand and switch expertise, making NVLink+networking part of NVIDIA's reference architecture.

Get the Snipd Podcast app to discover more snips from this episode

Are software engineers still needed?

01:54 • 2min

chevron_right

Personal AI tooling and quick prototypes

03:48 • 8min

chevron_right

Why Chris wrote AI performance book

12:05 • 4min

chevron_right

Bridging PyTorch, CUDA, and hardware

15:54 • 2min

chevron_right

NVIDIA culture and documentation gaps

18:17 • 4min

chevron_right

Mechanical sympathy and GPU generations

22:32 • 4min

chevron_right

Grace Blackwell architecture implications

26:31 • 3min

chevron_right

Scaling racks, NVLink and Mellanox

29:56 • 4min

chevron_right

GPU reliability and warm-up practice

Training vs inference and RL nuance

42:34 • 2min

chevron_right

Cloud providers and specialty vendors

44:31 • 3min

chevron_right

Capacity constraints and power limits

47:22 • 2min

chevron_right

Why the book focuses on NVIDIA/CUDA

49:13 • 5min

chevron_right

Using coding agents for kernel tuning

54:20 • 6min

chevron_right

Book repo, tools and OpenClaw

01:00:20 • 2min

chevron_right

Model choices for coding assistants

01:02:27 • 4min

chevron_right

Community workflows and learning rituals

01:06:11 • 9min

chevron_right

Reverse engineering kernels and new chips

01:14:58 • 4min

chevron_right

Recording reasoning and token value

This book by AWS experts Chris Fregly, Antje Barth, and Shelbee Eigenbrode covers the generative AI project life cycle, including use case definition, model selection, fine-tuning, retrieval-augmented generation, prompt engineering, in-context learning, and inference parameters like temperature and top-k. It helps CTOs, machine learning practitioners, business analysts, and data engineers develop advanced AI applications on AWS, exploring trade-offs between existing foundation models and training from scratch.

#30215

• Mentioned in 2 episodes

AI Systems Performance Engineering

Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

Chris Fregly

AI Systems Performance Engineering equips professionals with actionable strategies to maximize efficiency across every layer of AI infrastructure. The book provides step-by-step methodologies for fine-tuning GPU CUDA kernels, PyTorch-based algorithms, and multinode training and inference systems, along with techniques for scaling GPU clusters and implementing cutting-edge inference strategies. It includes a 175+ item performance checklist covering the entire AI system lifecycle, from hardware planning and GPU programming to distributed training and efficient inference serving.

This book provides a comprehensive guide on how to apply the Amazon AI and ML stack to real-world use cases, including natural language processing, computer vision, and more. It covers the complete model development lifecycle and offers insights into security best practices for data science projects. The authors demonstrate how to reduce costs and improve performance using AWS services like SageMaker and Kinesis.

March 3rd, Computer History Museum CODING AGENTS CONFERENCE, come join us while there are still tickets left.

https://luma.com/codingagents

Chris Fregly is currently focused on building and scaling high-performance AI systems, writing and teaching about AI infrastructure, helping organizations adopt generative AI and performance engineering principles on AWS, and fostering large developer communities around these topics.

Performance Optimization and Software/Hardware Co-design across PyTorch, CUDA, and NVIDIA GPUs // MLOps Podcast #363 with Chris Fregly, Founder, AI Performance Engineer, and Investor

Join the Community: https://go.mlops.community/YTJoinIn

Get the newsletter: https://go.mlops.community/YTNewsletter

MLOps GPU Guide: https://go.mlops.community/gpuguide

// Abstract

In today’s era of massive generative models, it's important to understand the full scope of AI systems' performance engineering. This talk discusses the new O'Reilly book, AI Systems Performance Engineering, and the accompanying GitHub repo (https://github.com/cfregly/ai-performance-engineering).

This talk provides engineers, researchers, and developers with a set of actionable optimization strategies. You'll learn techniques to co-design and co-optimize hardware, software, and algorithms to build resilient, scalable, and cost-effective AI systems for both training and inference.

// Bio

Chris Fregly is an AI performance engineer and startup founder with experience at AWS, Databricks, and Netflix. He's the author of three (3) O'Reilly books, including Data Science on AWS (2021), Generative AI on AWS (2023), and AI Systems Performance Engineering (2025). He also runs the global AI Performance Engineering meetup and speaks at many AI-related conferences, including Nvidia GTC, ODSC, Big Data London, and more.

// Related Links

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch 1st Edition by Chris Fregly: https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/

Coding Agents Conference: https://luma.com/codingagents

~~~~~~~~ ✌️Connect With Us ✌️ ~~~~~~~

Catch all episodes, blogs, newsletters, and more: https://go.mlops.community/TYExplore

Join our Slack community [https://go.mlops.community/slack]

Follow us on X/Twitter [@mlopscommunity](https://x.com/mlopscommunity) or [LinkedIn](https://go.mlops.community/linkedin)]

MLOps Swag/Merch: [https://shop.mlops.community/]

Connect with Demetrios on LinkedIn: /dpbrinkm

Connect with Chris on LinkedIn: /cfregly

Timestamps:

[00:00] SageMaker HyperPod Resilience

[00:27] Book Creation and Software Engineering

[04:57] Software Engineers and Maintenance