This is Fine! A podcast about resilience engineering and software

Colette Alexander and Clint Byrum

A podcast about resilience engineering and software.
Ever wondered why things on the internet break? Do you work in software and wish that you could have a Dear-Abby-Like call-in show that could answer your deepest questions about how to make your workplace suck less? We're here to help!
Write us anonymously at our open question form
Email us at: thisisfine.softwarepodcast@gmail.com
Call us and leave a voicemail, or text us at: ‪(401) 592-7574‬

Episodes

Mentioned books

May 14, 2026 • 1h 1min

Interviewing for Incident Analysis w/special guest John Allspaw

The new website is live! thisisfinepod.comYou can find John Allspaw at Adaptive Capacity Labs: https://www.adaptivecapacitylabs.comMike McGill, the skateboarder: https://en.wikipedia.org/wiki/Mike_McGillAnnie Duke’s Thinking in Bets, referenced by our question-asker is a great one: https://bookshop.org/p/books/thinking-in-bets-making-smarter-decisions-when-you-don-t-have-all-the-facts-annie-duke/31466984521c3d8a?ean=9780735216372&next=tNaturalistic Decision Making has its own association, which has a ton of resources (and a conference!) - https://naturalisticdecisionmaking.org/They also have a podcast! https://naturalisticdecisionmaking.org/new-podcast/Gary Klein is the NDM guy - https://bookshop.org/p/books/seeing-what-others-don-t-the-remarkable-ways-we-gain-insights-chief-scientist-gary-klein/c4ae5e017fe005ff?ean=9781610393829&next=tWe contrast him and his style of approaching cognition and decision making with Kahneman and Tversky.Kahneman and Tversky wrote a lot, but Judgement Under Uncertainty is probably the most famous? https://www.science.org/doi/abs/10.1126/science.185.4157.1124And Kahneman wrote Thinking Fast and Slow: https://bookshop.org/p/books/thinking-fast-and-slow-daniel-kahneman-phd/83a544fe6f98df87?ean=9780606275644&next=tIt has been zero episodes since we’ve mentioned Lisanne Bainbridge’s Ironies of Automation: https://ckrybus.com/static/papers/Bainbridge_1983_Automatica.pdfBut also she has Verbal Reports as evidence of the process operator’s knowledge: https://www.sciencedirect.com/science/article/abs/pii/S1071581979603075?via%3DihubAnd the Etsy Debriefing Guide is super great: https://extfiles.etsy.com/DebriefingFacilitationGuide.pdfSidney Dekker and The Field Guide are foundational: https://bookshop.org/p/books/the-field-guide-to-understanding-human-error-sidney-dekker/3a4209dfc8b3a721?ean=9781472439055&next=tFrom Dekker’s field guide (pg 47) there is a list referencing Gary Klein’s questions for an incident investigation:Cues: What were you seeing?What were you focusing on?What were you expecting to happen?Interpretation: If you had to describe the situation to your colleague at that point, what would you have told?Errors: What mistakes (for example in interpretation) were likely at this point?Previous experience/knowledge:Were you reminded of any previous experience?Did this situation fit a standard scenario?Were you trained to deal with this situation?Were there any rules that applied clearly here?Did any other sources of knowledge suggest what to do?Goals:What were you trying to achieve?Were there multiple goals at the same time?Was there time pressure or other limitations on what you could do?Taking action:How did you judge you could influence the course of events?Did you discuss or mentally imagine a number of options or did you know straight away what to do?Outcome:Did the outcome fit your expectation?Did you have to update your assessment of the situation?John mentioned Uptime Labs, who do staged worlds for software incidents: https://uptimelabs.io/Facets of Complexity in Situated Work is here: https://www.researchgate.net/publication/345523195_Facets_of_Complexity_in_Situated_WorkOn the Jamie Zawinski quote: https://regex.info/blog/2006-09-15/247If you don’t know the parable of the blind men and the elephant: https://en.wikipedia.org/wiki/Blind_men_and_an_elephant

May 4, 2026 • 45min

Paper Club: Two Years Before the Mast w/special guest eric dobbs

Mitchell Hashimoto’s post on leaving Github: https://mitchellh.com/writing/ghostty-leaving-githubThe Reddit post on Github’s availability historically (that we find questionable): https://www.reddit.com/r/github/comments/1rnvhs9/githubs_historic_downtime_scraped_and_plotted/A reminder, the Messy 9 are: congestion, cascade, conflict, lag, saturation, friction, tempo, surprise, tanglesWe have sometimes loved his stuff, but Gergely is annoying us with these posts: https://newsletter.pragmaticengineer.com/p/the-pulse-is-github-still-best-for?r=78c7k&utm_medium=emailhttps://x.com/GergelyOrosz/status/2048017382036082706You can find the RISF store with Hindsight Bias merch here: https://www.bonfire.com/store/risf/You can find a copy of Richard Cook’s Two Years Before the Mast at Lorin’s Blog: https://surfingcomplexity.blog/wp-content/uploads/2026/03/twoyearsbeforethemast.pdfA reminder, Richard Cook’s How Complex Systems Fail can be found at http://how.complexsystems.failSome writing on the 1996 Annenberg conference: https://www.researchgate.net/publication/351953417_Coming_Together_TheFolk models paper (not by Woods, by Dekker and Hollnagel), which is specifically targeting Situational Awareness as being a folk model: https://link.springer.com/article/10.1007/s10111-003-0136-9Some stuff about SNAFU Catchers: https://www.snafucatchers.com/And https://snafucatchers.github.io/Eric referenced our conversation with Beth Long about Building and Revising Adaptive Capacity, which she co-wrote with Richard Cook about New Relic’s real-life example of resilience engineering: https://youtu.be/A_rU4-M61Hk and https://www.sciencedirect.com/science/article/abs/pii/S0003687020301903?via%3Dihub for the paperErik Hollnagel’s RAG get’s referenced: https://erikhollnagel.com/onewebmedia/RAG%20Outline%20V2.pdfOnce again, we link you to Lorin’s Law: https://surfingcomplexity.blog/2017/06/24/a-conjecture-on-why-reliable-systems-fail/Eric is referencing Lund, that is their Human Factors and Systems Safety program: https://www.humanfactors.lth.se/Check out Crisis Engineering! https://crisisengineering.layeraleph.com/crisis-engineering-the-book/The upcoming RISF event on Practice of Practice Gamelan: https://resilienceinsoftware.org/events/245030

Apr 14, 2026 • 55min

SRECon Americas 2026 recap

They recap standout conference talks on disaster recovery, resilience engineering methods, and alternatives to root cause analysis. They touch on AI hype, measurable productivity claims, and agentic tools for incident investigation. Stories about documentation, teaching teams to learn, and hands-on incident analysis exercises make appearances. Breathing-based recovery techniques and hiring practices for SRE roles are also discussed.

Mar 12, 2026 • 59min

The 2025 DORA Report w/special guest Fred Hebert

Fred Hebert, a Staff SRE and Lund student known for work on SLOs, error budgets, and the Law of Stretched Systems, discusses the 2025 DORA Report. He unpacks why the report reframes around AI-assisted development. They explore AI adoption models, survey limits, platform vs AI impacts, cognitive load and burnout, and how new capacity can be reabsorbed by organizational demands.

Feb 26, 2026 • 1h 9min

Building and Revising Adaptive Capacity Sharing for Technical Incident Response with Beth Adele Long

Beth Adele Long, Principal at Adaptive Capacity Labs and resilience practitioner, shares field-tested practices from New Relic. She describes the NERF rotation, incident command vs support roles, and how calm coordinators reduce org-wide disruption. Conversation covers lowering friction to ask for help, making operational work a career path, and using management and tools to sustain adaptive capacity.

Feb 12, 2026 • 42min

Outsourcing and Resilience

They debate outsourcing software and the risks of partial handoffs versus full ownership. They explore how trust, in-person time, and clear agency shape reliable operations. They riff on outsourcing everyday tasks, construction trade adaptations, and cultural practices like servant leadership and joint retrospectives.

Feb 1, 2026 • 1h 43min

The Messy 9 and Coding with AI - A Panel Discussion

David Woods, resilience engineering founder and Professor Emeritus, brings foundational perspectives on the Messy 9 and socio-technical risks in AI systems. Shiri Cabral, enterprise architecture leader with experience at MongoDB and Salesforce, explains using AI for diagnostics and knowledge retrieval. They discuss AI in coding workflows, de-skilling risks, automation pitfalls, observability with AI, and designing collaborative human–AI systems.

Jan 17, 2026 • 1h 2min

Going Solid

If you’re feeling like you need to do more to respond to our moment:Lots of place to donate to in the twin cities are listed here: https://mspmag.com/arts-and-culture/general-interest/ice-minnesota-support-immigrant-communities-fundraisers-food-drives-trainings/You can always find mutual aid networks in your own area, including immigrant aid networkshttps://immigrantdefensenetwork.org/ does good work, tooThe Hometown Holler podcast with Tressie McMillan Cottom was a wonderful discussion: https://www.youtube.com/watch?v=2gr4mW8aR-gThe Ruth Wilson Gilmore’s interview that I quoted clumsily is here: https://www.nytimes.com/2019/04/17/magazine/prison-abolition-ruth-wilson-gilmore.html The paper itself: https://qualitysafety.bmj.com/content/14/2/130.shortIf you haven’t seen The Pitt, you should, it’s super good: https://en.wikipedia.org/wiki/The_PittCharles Perrow’s Normal Accidents has more definitions/examples of coupling: https://bookshop.org/p/books/normal-accidents-living-with-high-risk-technologies-updated-edition-professor-charles-perrow/cad38a43fcffa1f8?ean=9780691004129&next=tSome stuff on microservices and coupling here: https://microservices.io/post/architecture/2023/03/28/microservice-architecture-essentials-loose-coupling.htmlColette’s #notanad endorsement for paper organizing is https://paperpile.com/Rasmussen’s boundary model comes initially from his paper here: https://www.sciencedirect.com/science/article/abs/pii/S0925753597000520And if you want a good writeup on Rasmussen’s boundary model explaining it, you can always read Lorin’s blog: https://surfingcomplexity.blog/2021/05/31/transgressing-the-boundaries-rasmussen-and-woods/Dr Cook’s talk at Velocity is a classic, and goes over Rasmussen’s boundary model really well: https://www.youtube.com/watch?v=PGLYEDpNu60Fred does a great job writing about the Law of Stretched Systems and how it applies to his own work on his blog: https://ferd.ca/the-law-of-stretched-cognitive-systems.html“Plans are nothing, but planning is everything” is a paraphrase of Eisenhower: https://www.presidency.ucsb.edu/documents/remarks-the-national-defense-executive-reserve-conferenceWant to chat about this paper with other folks? Come to the RISF live event for a Paper Party! https://resilienceinsoftware.org/events/157553

Dec 31, 2025 • 53min

The Year in Resilience w/special guest John Allspaw

Seriously though, can’t wait to gtfo of this year.Palisades fire links: https://www.nbclosangeles.com/investigations/anonymous-letter-demands-independent-palisades-fire-investigations/3800442/https://internationalfireandsafetyjournal.com/palisades-fire-report/https://www.latimes.com/california/story/2025-12-20/lafd-report-on-palisades-fire-was-watered-down-in-editing-process-records-showCorey Quinn’s commentary on the AWS outage in October is here: https://www.theregister.com/2025/10/20/aws_outage_amazon_brain_drain_corey_quinn/Time to reset the clock on how many episodes it’s been since we’ve mentioned the Ironies of Automation: https://ckrybus.com/static/papers/Bainbridge_1983_Automatica.pdfAlso on Rasmussen’s Boundary Model, which Lorin does a great write up on: https://surfingcomplexity.blog/2021/05/31/transgressing-the-boundaries-rasmussen-and-woods/Lorin’s Law is our favorite law: https://surfingcomplexity.blog/2017/06/24/a-conjecture-on-why-reliable-systems-fail/You can ask us questions or write to us using our form linked from our website: thisisfinepod.comResilience in Software Foundation is at resilienceinsoftware.org

Nov 28, 2025 • 43min

Incident Status: On Hold w/special guest Will Gallego

Mentioned multiple times, Em Ruppe’s amazing talk on incident severity: https://www.usenix.org/conference/srecon24americas/presentation/ruppeWe talk about the RIS Slack sometimes - you can join us in the slack, by joining the Foundation here: https://resilienceinsoftware.org/Please ask us a question at thisisfinepod.com

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner