
MLOps.community Building Out GPU Clouds // Mohan Atreya // #317
12 snips
May 23, 2025 Mohan Atreya, Chief Product Officer at Rafay Systems with a rich background at Okta and McAfee, dives into the chaos of GPUs in AI. He discusses the hurdles of GPU scarcity and high prices, as well as dynamic cloud models that adopt tokenized access. The conversation highlights the challenges of crafting GPU cloud infrastructures, power management issues, and how innovative strategies are redefining user experience. Mohan also touches on the shift towards payer-friendly systems that enhance flexibility, paving the way for a more efficient AI landscape.
AI Snips
Chapters
Transcript
Episode notes
GPU Failures Are Common
- GPU systems have high failure rates, around 30%, disrupting long training runs.
- Providers must offer capacity swap and SLA guarantees to maintain user trust during failures.
Automate GPU Inventory Management
- Maintain a dynamic and accurate inventory of GPU resources across servers and networks.
- Use this as a source of truth to automate capacity allocation and tenant access networks.
Solve Noisy Neighbor Issues
- Automate network isolation using EVPN, VXLAN, and related technologies for multi-tenant GPU clouds.
- Provide a unified orchestrator to manage diverse hardware and network vendors programmatically.
