MLOps.community

Building Out GPU Clouds // Mohan Atreya // #317

12 snips
May 23, 2025
Mohan Atreya, Chief Product Officer at Rafay Systems with a rich background at Okta and McAfee, dives into the chaos of GPUs in AI. He discusses the hurdles of GPU scarcity and high prices, as well as dynamic cloud models that adopt tokenized access. The conversation highlights the challenges of crafting GPU cloud infrastructures, power management issues, and how innovative strategies are redefining user experience. Mohan also touches on the shift towards payer-friendly systems that enhance flexibility, paving the way for a more efficient AI landscape.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

GPU Failures Are Common

  • GPU systems have high failure rates, around 30%, disrupting long training runs.
  • Providers must offer capacity swap and SLA guarantees to maintain user trust during failures.
ADVICE

Automate GPU Inventory Management

  • Maintain a dynamic and accurate inventory of GPU resources across servers and networks.
  • Use this as a source of truth to automate capacity allocation and tenant access networks.
ADVICE

Solve Noisy Neighbor Issues

  • Automate network isolation using EVPN, VXLAN, and related technologies for multi-tenant GPU clouds.
  • Provide a unified orchestrator to manage diverse hardware and network vendors programmatically.
Get the Snipd Podcast app to discover more snips from this episode
Get the app