MLOps.community

Fixing GPU Starvation in Large-Scale Distributed Training

31 snips
Apr 3, 2026
Kashish Mittal, Staff Software Engineer at Uber who builds hyperscale ML infrastructure, talks about solving GPU starvation in large-scale training. He recounts full-stack profiling and tracing to find hidden CPU bottlenecks. He explains reshaping data reads, packing tensors to cut transfers, caching transformed NumPy tensors, and trade-offs between latency and utilization in serving.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

Infrastructure Usually Limits ML Scaling

  • Infrastructure, not model architecture, is usually the limiting factor when scaling ML; data I/O is the recurring bottleneck.
  • Kashish observed that model changes (quantize/distill) rarely matter compared to feeding GPUs reliably.
ANECDOTE

InMemory Test Revealed Hidden GPU Headroom

  • Kashish's team saw A100 GPUs at only 15–20% utilization and validated the model by loading data into RAM to hit ~85% GPU usage.
  • That quick in-memory test isolated the issue to data delivery rather than model speed, proving headroom existed.
ADVICE

Trace Producer And Consumer Paths First

  • Profile end-to-end data pipelines (producer + consumer) to locate where queues empty and GPUs stall.
  • Kashish instrumented Petastorm traces to see whether readers or consumers were the bottleneck before changing architecture.
Get the Snipd Podcast app to discover more snips from this episode
Get the app