Google SRE Prodcast

The One With Damion Yates and Building AI systems

Feb 26, 2026

Damion Yates, reliability engineering lead at Google DeepMind who built reliability practices for large-scale AI research. He talks about creating a reliability team from scratch. He covers training researchers in resilience, building proactive tooling and guardrails, handling lockstep training where one failure halts huge runs, and why avoiding lucky silence matters for dependable AI systems.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ADVICE

Measure GPU Utilization Before Increasing Quota

Measure utilization and build tooling instead of reactively increasing quota when researchers hit limits.
Damion pulled GPU/accelerator stats, made dashboards and alerts to predict capacity shortages before experiments fail.

ADVICE

Teach Reliability In New Starter Onboarding

Teach newcomers reliability practices via onboarding so engineers and research scientists avoid common failure modes.
Damion slotted an infrastructure training into new-starter curriculum covering container limits, locality, and retry patterns.

ADVICE

Gate Experiment Launches With A Running License

Gate the ability to launch experiments and require a basic 'running on infra' license to prevent accidental resource hogging.
Damion planned gating access and a training to make people mindful of replica counts and network limits.

Get the Snipd Podcast app to discover more snips from this episode