Google SRE Prodcast

The One With Damion Yates and Building AI systems

Feb 26, 2026
Damion Yates, reliability engineering lead at Google DeepMind who built reliability practices for large-scale AI research. He talks about creating a reliability team from scratch. He covers training researchers in resilience, building proactive tooling and guardrails, handling lockstep training where one failure halts huge runs, and why avoiding lucky silence matters for dependable AI systems.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
ADVICE

Measure GPU Utilization Before Increasing Quota

  • Measure utilization and build tooling instead of reactively increasing quota when researchers hit limits.
  • Damion pulled GPU/accelerator stats, made dashboards and alerts to predict capacity shortages before experiments fail.
ADVICE

Teach Reliability In New Starter Onboarding

  • Teach newcomers reliability practices via onboarding so engineers and research scientists avoid common failure modes.
  • Damion slotted an infrastructure training into new-starter curriculum covering container limits, locality, and retry patterns.
ADVICE

Gate Experiment Launches With A Running License

  • Gate the ability to launch experiments and require a basic 'running on infra' license to prevent accidental resource hogging.
  • Damion planned gating access and a training to make people mindful of replica counts and network limits.
Get the Snipd Podcast app to discover more snips from this episode
Get the app