
Fixing GPU Starvation in Large-Scale Distributed Training
MLOps.community
Feature IO costs for large inference batches
He discusses costs and design choices for fetching hundreds of features per item for large serving batches.
Kashish Mittal is a Staff Software Engineer at Uber, working on large-scale distributed systems and core backend infrastructure.
Fixing GPU Starvation in Large-Scale Distributed Training // MLOps Podcast #367 with Kashish Mittal, Staff Software Engineer at Uber
Join the Community: https://go.mlops.community/YTJoinIn
Get the newsletter: https://go.mlops.community/YTNewsletter
MLOps GPU Guide: https://go.mlops.community/gpuguide
// Abstract
Kashish zooms out to discuss a universal industry pattern: how infrastructure—specifically data loading—is almost always the hidden constraint for ML scaling.
The conversation dives deep into a recent architectural war story. Kashish walks through the full-stack profiling and detective work required to solve a massive GPU starvation bottleneck. By redesigning the Petastorm caching layer to bypass CPU transformation walls and uncovering hidden distributed race conditions, his team boosted GPU utilization to 60%+ and cut training time by 80%. Kashish also shares his philosophy on the fundamental trade-offs between latency and efficiency in GPU serving.
// Bio
Kashish Mittal is a Staff Software Engineer at Uber, where he architects the hyperscale machine learning infrastructure that powers Uber’s core mobility and delivery marketplaces. Prior to Uber, Kashish spent nearly a decade at Google building highly scalable, low-latency distributed ML systems for flagship products, including YouTube Ads and Core Search Ranking. His engineering expertise lies at the intersection of distributed systems and AI—specifically focusing on large-scale data processing, eliminating critical I/O bottlenecks, and maximizing GPU efficiency for petabyte-scale training pipelines. When he isn't hunting down distributed race conditions, he is a passionate advocate for open-source architecture and building reproducible, high-throughput ML systems.
// Related Links
Website: https://www.uber.com/
Getting Humans Out of the Way: How to Work with Teams of Agents // MLOps Podcast #368 with Rob Ennals, the Creator of Broomy: https://www.youtube.com/watch?v=ie1M8p-SVfM
~~~~~~~~ ✌️Connect With Us ✌️ ~~~~~~~
Catch all episodes, blogs, newsletters, and more: https://go.mlops.community/TYExplore
Join our Slack community [https://go.mlops.community/slack]
Follow us on X/Twitter [@mlopscommunity](https://x.com/mlopscommunity) or [LinkedIn](https://go.mlops.community/linkedin)]
Sign up for the next meetup: [https://go.mlops.community/register]
MLOps Swag/Merch: [https://shop.mlops.community/]
Connect with Demetrios on LinkedIn: /dpbrinkm
Connect with Kashish on LinkedIn: /kashishmittal/
Timestamps:
[00:00] Local dataset caching
[00:30] Engineers Evolving Roles
[04:44] GPU Resource Management
[10:21] GPU Utilization Issues
[21:49] More GPU War Stories
[32:12] Model Serving Issues
[39:58] Reflective Learning in Coding
[43:23] Workflow and Reflective Skills
[52:30] Wrap up


