Todd Underwood - On lessons from running ML systems at Google for a decade, what it takes to be a ML SRE, challenges with generalized ML platforms and much more - #10

11 snips

May 7, 2021

Todd Underwood, Sr Director of Engineering at Google, shares his extensive experience in Site Reliability Engineering for Machine Learning. He discusses how ML systems often fail due to issues unrelated to ML itself, the unique challenges of engineering reliable ML systems, and the crucial skills needed for hiring ML SREs. Todd also emphasizes the importance of empathy in tech during high-pressure scenarios and reflects on the balance between traditional software practices and the demands of ML pipelines, making the case for robust collaboration among teams.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

ML System Failures

ML system failures often stem from non-ML issues like data format inconsistencies.
Systems thinking is key for addressing these issues, not just pure ML expertise.

INSIGHT

MLOps Team Structures

MLOps engineers can work on either product or platform teams, depending on platform maturity.
A universal ML platform is ideal, but current platforms don't cover all use cases.

INSIGHT

Balancing Platforms and Customization

A balance between standardized platforms and custom solutions is needed in MLOps.
Simple platforms accelerate innovation for common use cases, while custom solutions cater to complex needs.

Get the Snipd Podcast app to discover more snips from this episode

Get the app