Kaizen! Let it crash (Changelog & Friends #124)

14 snips

Jan 17, 2026

Gerhard Lazu, a reliability expert and Kaizen contributor, returns to discuss the nuances of 'let it crash' philosophy and how it boosts system resilience. He dives into troublesome out-of-memory crashes and shares insights on bandwidth spikes that challenge Varnish. Their humorous investigation into a wildly popular podcast episode reveals surprising patterns of downloads from Asia, leading to questions of scraping or user behavior. Plus, Gerhard showcases innovative tools to monitor system health, paving the way for efficient troubleshooting.

Ask episode

AI Snips

Chapters

Books

Transcript

Episode notes

INSIGHT

Hot Regions Drive Sizing Decisions

A single hot region can dominate load and requires heavier instance sizing and tuned limits.
Deploy uniform instance sizes then scale down in quiet regions to simplify Fly.io constraints.

ANECDOTE

AI Reviewed Varnish Stats For Different Audiences

Gerhard ran Varnish stats through multiple LLMs to get tailored recommendations for sysadmins, devs, and business folks.
The AI summaries highlighted a healthy cache hit rate but flagged disk allocation fragmentation and disk cache failures.

ADVICE

Run Hourly Multi-Region Health Checks

Use hourly region-specific checks (hurl) to detect hangs and timeouts early across all Fly regions.
Run diagnostics that fetch MP3s and feeds from each region to catch intermittent failures proactively.

Get the Snipd Podcast app to discover more snips from this episode

Get the app