Software Misadventures

Todd Underwood - On lessons from running ML systems at Google for a decade, what it takes to be a ML SRE, challenges with generalized ML platforms and much more - #10

11 snips
May 7, 2021
Todd Underwood, Sr Director of Engineering at Google, shares his extensive experience in Site Reliability Engineering for Machine Learning. He discusses how ML systems often fail due to issues unrelated to ML itself, the unique challenges of engineering reliable ML systems, and the crucial skills needed for hiring ML SREs. Todd also emphasizes the importance of empathy in tech during high-pressure scenarios and reflects on the balance between traditional software practices and the demands of ML pipelines, making the case for robust collaboration among teams.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

ML System Failures

  • ML system failures often stem from non-ML issues like data format inconsistencies.
  • Systems thinking is key for addressing these issues, not just pure ML expertise.
INSIGHT

MLOps Team Structures

  • MLOps engineers can work on either product or platform teams, depending on platform maturity.
  • A universal ML platform is ideal, but current platforms don't cover all use cases.
INSIGHT

Balancing Platforms and Customization

  • A balance between standardized platforms and custom solutions is needed in MLOps.
  • Simple platforms accelerate innovation for common use cases, while custom solutions cater to complex needs.
Get the Snipd Podcast app to discover more snips from this episode
Get the app