
Software Misadventures Todd Underwood - On lessons from running ML systems at Google for a decade, what it takes to be a ML SRE, challenges with generalized ML platforms and much more - #10
11 snips
May 7, 2021 Todd Underwood, Sr Director of Engineering at Google, shares his extensive experience in Site Reliability Engineering for Machine Learning. He discusses how ML systems often fail due to issues unrelated to ML itself, the unique challenges of engineering reliable ML systems, and the crucial skills needed for hiring ML SREs. Todd also emphasizes the importance of empathy in tech during high-pressure scenarios and reflects on the balance between traditional software practices and the demands of ML pipelines, making the case for robust collaboration among teams.
AI Snips
Chapters
Transcript
Episode notes
ML System Failures
- ML system failures often stem from non-ML issues like data format inconsistencies.
- Systems thinking is key for addressing these issues, not just pure ML expertise.
MLOps Team Structures
- MLOps engineers can work on either product or platform teams, depending on platform maturity.
- A universal ML platform is ideal, but current platforms don't cover all use cases.
Balancing Platforms and Customization
- A balance between standardized platforms and custom solutions is needed in MLOps.
- Simple platforms accelerate innovation for common use cases, while custom solutions cater to complex needs.
