MLOps.community

Fixing GPU Starvation in Large-Scale Distributed Training

31 snips

Apr 3, 2026

Kashish Mittal, Staff Software Engineer at Uber who builds hyperscale ML infrastructure, talks about solving GPU starvation in large-scale training. He recounts full-stack profiling and tracing to find hidden CPU bottlenecks. He explains reshaping data reads, packing tensors to cut transfers, caching transformed NumPy tensors, and trade-offs between latency and utilization in serving.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Infrastructure Usually Limits ML Scaling

Infrastructure, not model architecture, is usually the limiting factor when scaling ML; data I/O is the recurring bottleneck.
Kashish observed that model changes (quantize/distill) rarely matter compared to feeding GPUs reliably.

ANECDOTE

InMemory Test Revealed Hidden GPU Headroom

Kashish's team saw A100 GPUs at only 15–20% utilization and validated the model by loading data into RAM to hit ~85% GPU usage.
That quick in-memory test isolated the issue to data delivery rather than model speed, proving headroom existed.

ADVICE

Trace Producer And Consumer Paths First

Profile end-to-end data pipelines (producer + consumer) to locate where queues empty and GPUs stall.
Kashish instrumented Petastorm traces to see whether readers or consumers were the bottleneck before changing architecture.

Get the Snipd Podcast app to discover more snips from this episode