
Failure As a Means to Build Resilient Software Systems: A Conversation with Lorin Hochstein
Mar 31, 2026
Lorin Hochstein, a staff software engineer who built chaos tools at Netflix and Airbnb, shares stories from real incidents and resilience work. He talks about the limits of fault injection tools. He discusses how reliability fixes can create new complexity. He explores resilience versus robustness, storytelling for post-mortems, and how organizations shape system behavior.
AI Snips
Chapters
Books
Transcript
Episode notes
Adding Reliability Can Create New Failures
- Increasing reliability often adds complexity, which creates new failure modes; mitigations themselves can interact unexpectedly.
- Lorin calls this effect Loren's Law: reliability features like monitors and autoscalers can fail in novel ways.
Expect Unavoidable Failures From System Dynamics
- Failures are inevitable because systems are dynamic and humans can't fully understand all concurrent changes and traffic patterns.
- Common failure mode is saturation where resources or limits are hit despite correct logic.
Build Capacity To Absorb Risk
- Manage the capacity to absorb risk rather than trying to eliminate all risk.
- Build general resilience: scale, on-call staffing, and other capacities that let you handle unexpected problems.

