Serving trade-offs: latency versus utilization

Kashish contrasts training with serving, explaining latency-efficient padded batches versus batching for utilization.

Play episode from 22:32

chevron_right

Transcript

chevron_right

Transcript

Episode notes

Kashish Mittal is a Staff Software Engineer at Uber, working on large-scale distributed systems and core backend infrastructure.

Fixing GPU Starvation in Large-Scale Distributed Training // MLOps Podcast #367 with Kashish Mittal, Staff Software Engineer at Uber

Join the Community: https://go.mlops.community/YTJoinIn

Get the newsletter: https://go.mlops.community/YTNewsletter

MLOps GPU Guide: https://go.mlops.community/gpuguide

// Abstract

Kashish zooms out to discuss a universal industry pattern: how infrastructure—specifically data loading—is almost always the hidden constraint for ML scaling.

The conversation dives deep into a recent architectural war story. Kashish walks through the full-stack profiling and detective work required to solve a massive GPU starvation bottleneck. By redesigning the Petastorm caching layer to bypass CPU transformation walls and uncovering hidden distributed race conditions, his team boosted GPU utilization to 60%+ and cut training time by 80%. Kashish also shares his philosophy on the fundamental trade-offs between latency and efficiency in GPU serving.

// Bio

Kashish Mittal is a Staff Software Engineer at Uber, where he architects the hyperscale machine learning infrastructure that powers Uber’s core mobility and delivery marketplaces. Prior to Uber, Kashish spent nearly a decade at Google building highly scalable, low-latency distributed ML systems for flagship products, including YouTube Ads and Core Search Ranking. His engineering expertise lies at the intersection of distributed systems and AI—specifically focusing on large-scale data processing, eliminating critical I/O bottlenecks, and maximizing GPU efficiency for petabyte-scale training pipelines. When he isn't hunting down distributed race conditions, he is a passionate advocate for open-source architecture and building reproducible, high-throughput ML systems.

// Related Links

Website: https://www.uber.com/

Getting Humans Out of the Way: How to Work with Teams of Agents // MLOps Podcast #368 with Rob Ennals, the Creator of Broomy: https://www.youtube.com/watch?v=ie1M8p-SVfM

~~~~~~~~ ✌️Connect With Us ✌️ ~~~~~~~

Catch all episodes, blogs, newsletters, and more: https://go.mlops.community/TYExplore

Join our Slack community [https://go.mlops.community/slack]

Follow us on X/Twitter [@mlopscommunity](https://x.com/mlopscommunity) or [LinkedIn](https://go.mlops.community/linkedin)]

MLOps Swag/Merch: [https://shop.mlops.community/]

Connect with Demetrios on LinkedIn: /dpbrinkm

Connect with Kashish on LinkedIn: /kashishmittal/

Timestamps:

[00:00] Local dataset caching

[00:30] Engineers Evolving Roles

[04:44] GPU Resource Management

[10:21] GPU Utilization Issues