Failure As a Means to Build Resilient Software Systems: A Conversation with Lorin Hochstein

Mar 31, 2026

Lorin Hochstein, a staff software engineer who built chaos tools at Netflix and Airbnb, shares stories from real incidents and resilience work. He talks about the limits of fault injection tools. He discusses how reliability fixes can create new complexity. He explores resilience versus robustness, storytelling for post-mortems, and how organizations shape system behavior.

Ask episode

AI Snips

Chapters

Books

Transcript

Episode notes

INSIGHT

Adding Reliability Can Create New Failures

Increasing reliability often adds complexity, which creates new failure modes; mitigations themselves can interact unexpectedly.
Lorin calls this effect Loren's Law: reliability features like monitors and autoscalers can fail in novel ways.

INSIGHT

Expect Unavoidable Failures From System Dynamics

Failures are inevitable because systems are dynamic and humans can't fully understand all concurrent changes and traffic patterns.
Common failure mode is saturation where resources or limits are hit despite correct logic.

ADVICE

Build Capacity To Absorb Risk

Manage the capacity to absorb risk rather than trying to eliminate all risk.
Build general resilience: scale, on-call staffing, and other capacities that let you handle unexpected problems.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

In this podcast Michael Stiefel spoke to Lorin Hochstein about how real-world failures provide insight into how software systems actually work. Our first topic was understanding that while automated fault injection tools can introduce basic robustness into a system, they cannot replicate the understanding that comes from mitigating complicated software failures in the real world. We then pondered how do we get this information to software architects so that they can learn from failure. Ironically, in reliable systems, adding more reliability can often lead to complexity which can lead to new failures. We often focus on making our systems robust against known failure patterns, but we have not learnt how to make software systems resilient to unknown failure modes, or failures due to changes in the external world or the evolving system design. Read a transcript of this interview: https://bit.ly/3NVhtf3 Subscribe to the Software Architects’ Newsletter for your monthly guide to the essential news and experience from industry peers on emerging patterns and technologies: https://www.infoq.com/software-architects-newsletter Upcoming Events: QCon AI Boston 2026 (June 1-2, 2026) Learn how real teams are accelerating the entire software lifecycle with AI. https://boston.qcon.ai QCon San Francisco 2026 (November 16-20, 2026) https://qconsf.com/ The InfoQ Podcasts: Weekly inspiration to drive innovation and build great teams from senior software leaders. Listen to all our podcasts and read interview transcripts: - The InfoQ Podcast https://www.infoq.com/podcasts/ - Engineering Culture Podcast by InfoQ https://www.infoq.com/podcasts/#engineering_culture - Generally AI: https://www.infoq.com/generally-ai-podcast/ Follow InfoQ: - Mastodon: https://techhub.social/@infoq - X: https://x.com/InfoQ?from=@ - LinkedIn: https://www.linkedin.com/company/infoq/ - Facebook: https://www.facebook.com/InfoQdotcom# - Instagram: https://www.instagram.com/infoqdotcom/?hl=en - Youtube: https://www.youtube.com/infoq - Bluesky: https://bsky.app/profile/infoq.com Write for InfoQ: Learn and share the changes and innovations in professional software development. - Join a community of experts. - Increase your visibility. - Grow your career. https://www.infoq.com/write-for-infoq