The InfoQ Podcast

Failure As a Means to Build Resilient Software Systems: A Conversation with Lorin Hochstein

Mar 31, 2026
Lorin Hochstein, a staff software engineer who built chaos tools at Netflix and Airbnb, shares stories from real incidents and resilience work. He talks about the limits of fault injection tools. He discusses how reliability fixes can create new complexity. He explores resilience versus robustness, storytelling for post-mortems, and how organizations shape system behavior.
Ask episode
AI Snips
Chapters
Books
Transcript
Episode notes
INSIGHT

Adding Reliability Can Create New Failures

  • Increasing reliability often adds complexity, which creates new failure modes; mitigations themselves can interact unexpectedly.
  • Lorin calls this effect Loren's Law: reliability features like monitors and autoscalers can fail in novel ways.
INSIGHT

Expect Unavoidable Failures From System Dynamics

  • Failures are inevitable because systems are dynamic and humans can't fully understand all concurrent changes and traffic patterns.
  • Common failure mode is saturation where resources or limits are hit despite correct logic.
ADVICE

Build Capacity To Absorb Risk

  • Manage the capacity to absorb risk rather than trying to eliminate all risk.
  • Build general resilience: scale, on-call staffing, and other capacities that let you handle unexpected problems.
Get the Snipd Podcast app to discover more snips from this episode
Get the app