

Google SRE Prodcast
Salim Virji
SRE Prodcast brings Google's experience with Site Reliability Engineering together with special guests and exciting topics to discuss the present and future of reliable production engineering!
Episodes
Mentioned books

Feb 26, 2026 • 31min
The One With Damion Yates and Building AI systems
Damion Yates, reliability engineering lead at Google DeepMind who built reliability practices for large-scale AI research. He talks about creating a reliability team from scratch. He covers training researchers in resilience, building proactive tooling and guardrails, handling lockstep training where one failure halts huge runs, and why avoiding lucky silence matters for dependable AI systems.

9 snips
Feb 11, 2026 • 26min
The One With Carla Geisser and Crisis Engineering
Carla Geisser, former Google SRE who now runs Layer Aleph focused on crisis engineering. She explains how crises differ from incidents and outlines five criteria that make a situation a true crisis. Conversations cover why organizations struggle when computers drive decisions, how messy modernization creates brittle systems, and when leaders will finally authorize big changes.

Feb 5, 2026 • 36min
The One with Parker Barnes, Felipe Tiengo Ferreira, and AI
Parker Barnes, product manager for Gemini model-level safety, focuses on policies and deployment safety. Felipe Tiengo Ferreira, tech lead on Gemini safety, handles monitoring and rapid response. They talk about model safety frameworks and policy design. They cover prelaunch testing, automated red teaming, layered defenses like system instructions and classifiers. They discuss drift detection, observability, and the pressure of rapid AI development.

Jan 28, 2026 • 24min
The One With Shannon Brady and Operating Systems
In this episode of the Prodcast, guest Shannon Brady speaks with hosts Jordan Greenberg and Florian Rathgeber about managing Google's vast fleet of internal devices. Shannon explains how Google's Linux platform uses core SRE principles—specifically testing, canarying, and monitoring—for weekly stage rollouts of its Debian-based distribution. Configuration is efficiently managed using Puppet to ensure the right setup for a diverse user base. The conversation pivots to "the year of Linux everything," underscoring its widespread adoption. Discussing AI, Shannon identifies its greatest utility for SREs in rapidly analyzing signals and generating complex queries to resolve outages. This episode reinforces that practicing SRE fundamentals is paramount, demonstrating that you can be an SRE at heart, regardless of your official title.

Jan 21, 2026 • 30min
The One With Denia Del Cid and AI
Denia Del Cid, an SRE at Google who applies AI to reliability problems. She talks about early outage detection, incident similarity analysis, and cutting repetitive toil with LLM-powered tools. Denia stresses validating models with golden datasets and keeping humans in the loop. She also explores agent-driven on-demand analysis and practical steps for building trust in AI systems.

Jan 14, 2026 • 25min
The One With Heather Adkins and Security (and AI)
Heather Adkins, leader of Google's Office of Cybersecurity Resilience and a seasoned expert with over two decades at Google, dives into the future of digital defenses. She discusses the rise of 'Agentic AI hackers' and polymorphic malware, urging a shift in how we approach cybersecurity. From the 'Secure by Design' philosophy to innovative defense strategies, Heather emphasizes the importance of layered security. Her insights on utilizing AI for incident analysis and the need to harden critical nodes reveal a crucial evolution in tackling emerging threats.

Jan 7, 2026 • 39min
The One With SLOs
Join Alex Hidalgo, an author and SRE expert, and Brian Singer, co-founder at nobl9, as they dive into the world of Service Level Objectives (SLOs). They discuss how SLOs create a common language for teams and explore the varying degrees of adoption across organizations. The conversation highlights crafting user-specific SLOs, the importance of ownership, and the pitfalls of central governance. They also touch on AI's potential in SLO design and the necessity of human oversight. Tune in for actionable insights on SLOs and their cultural impact!

14 snips
Dec 16, 2025 • 34min
The One With Steph Hippo and Observability
Steph Hippo, Platform Engineering Director at Honeycomb, shares her expertise in AI-driven observability during a fascinating conversation. She explains how observability is key for understanding complex systems, creating a symbiotic relationship with AI. The discussion highlights how AI can enhance incident response, lead to self-healing systems, and significantly improve junior SRE onboarding. Steph encourages small teams to learn from others' mistakes and emphasizes the importance of structured growth conversations and experimentation.

Jul 30, 2025 • 32min
The One with Ben Good and Our Kubernetes Friends
Ben Good, a Google Cloud Solutions Architect skilled in platform engineering, joins Kaslin Fields, co-host of the Kubernetes podcast. They dive into the powerful role of Kubernetes in platform engineering, discussing how to create user-friendly 'golden paths' for developers. The conversation highlights the significance of observability, adapting to evolving customer needs, and improving deployment archetypes. They explore the importance of DORA metrics for assessing team success, all while emphasizing a tailored approach to platform design and user experience.

14 snips
Jul 23, 2025 • 42min
The One With AI Agents, Ramón Llamas, and Swapnil Haria
In this installment, Swapnil Haria, a Google Software Engineer specializing in AI agents, and Ramón Llamas, a seasoned Staff Site Reliability Engineer, delve into the transformative impact of AI on production management. They discuss how these agents can summarize alerts, detect hidden errors, and even prevent outages. The duo highlights the balance between human expertise and AI capabilities, the complexities of evaluating non-deterministic systems, and the importance of structured postmortems in enhancing incident response.


