
Google SRE Prodcast The One With Damion Yates and Building AI systems
Feb 26, 2026
Damion Yates, reliability engineering lead at Google DeepMind who built reliability practices for large-scale AI research. He talks about creating a reliability team from scratch. He covers training researchers in resilience, building proactive tooling and guardrails, handling lockstep training where one failure halts huge runs, and why avoiding lucky silence matters for dependable AI systems.
AI Snips
Chapters
Transcript
Episode notes
Measure GPU Utilization Before Increasing Quota
- Measure utilization and build tooling instead of reactively increasing quota when researchers hit limits.
- Damion pulled GPU/accelerator stats, made dashboards and alerts to predict capacity shortages before experiments fail.
Teach Reliability In New Starter Onboarding
- Teach newcomers reliability practices via onboarding so engineers and research scientists avoid common failure modes.
- Damion slotted an infrastructure training into new-starter curriculum covering container limits, locality, and retry patterns.
Gate Experiment Launches With A Running License
- Gate the ability to launch experiments and require a basic 'running on infra' license to prevent accidental resource hogging.
- Damion planned gating access and a training to make people mindful of replica counts and network limits.
