

Slight Reliability
Stephen Townshend
Learning SRE, one day at a time.
Episodes
Mentioned books

Oct 3, 2023 • 42min
Slight Reliability Episode 70 - Meta SRE with Amin Astaneh
Send us Fan MailAmin Astaneh (from Certo Modo) is back to discuss his experience working as a production engineer (SRE equivalent) at Meta.Stephen and Amin discuss what it's like interviewing for big tech, "you build it, you own it", different SRE engagement models, SRE at different sizes of organisation, socialising your SRE success as a way to get traction, and so much more.You can find Amin on his company website https://certomodo.io, LinkedIn: https://www.linkedin.com/in/aminastaneh/ and Twitter: https://twitter.com/aastanehThe books Amin mentions are...The Practice of Cloud System Administration: https://www.oreilly.com/library/view/practice-of-cloud/9780133478549/Leading Change:https://www.kotterinc.com/bookshelf/leading-change/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreInstagram: https://www.instagram.com/slight_reliability/

Sep 26, 2023 • 30min
Slight Reliability Episode 69 - Developer to SRE with Praveen Kasam
Send us Fan MailThis week Stephen talks to Praveen Kasam from Diconium Digital Solutions about how he led SRE transformations.Praveen shares his experience transitioning from development to SRE and how leveraging automation and bringing application knowledge to the ops team provided quick wins. He also covers how he later applied SRE concepts to uplift the wider organisation. If you are out there looking for advice on how to implement SRE in your organisation, this is the episode for you.You can find Praveen at:LinkedIn: https://www.linkedin.com/in/kasampraveen/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/X: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Sep 19, 2023 • 33min
Slight Reliability Episode 68 - Dashboards and Modern Observability with Eric Schabell
Send us Fan MailThis week Stephen asks Eric Schabell (Director of Technical Marketing & Evangelism @ Chronosphere) about how dashboards fit into modern observability.They discuss how untamed observability can lead to unexpectedly high cloud bills, the similarities between dashboards and documentation, the "know > triage > understand" workflow, and much more.You can find Eric at:LinkedIn: https://www.linkedin.com/in/ericschabell/X: https://twitter.com/ericschabell And you can find Chronosphere at: https://www.linkedin.com/company/chronosphereio/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/X: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Sep 12, 2023 • 35min
Slight Reliability Episode 67 - Single Pane of Glass with Jamie Allen and Adam Kinniburgh
Send us Fan MailThis week Stephen chats with Jamie Allen (Cheif Technologist AWS & SRE @ EPAM Systems) and Adam Kinniburgh (VP Innovation @ SquaredUp) about the concept of a single pane of glass (SPOG) for SRE.Is it performance art or something actionable? Can alerting replace the need for dashboards? And are metrics drowning in the wake of distributed tracing?You can find Jamie at:LinkedIn: https://www.linkedin.com/in/jlallen/And the Single Pain of Glass article he wrote here: https://medium.com/site-reliability-engineering-leadership/the-single-pain-of-glass-6e42930e966You can find EPAM at https://www.epam.com/And you can find the Google Dapper paper here: https://static.googleusercontent.com/media/research.google.com/en//archive/papers/dapper-2010-1.pdfYou can find Adam at:LinkedIn: https://www.linkedin.com/in/adamkinniburgh/X: https://twitter.com/adamkinniburghYou can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/X: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Sep 5, 2023 • 30min
Slight Reliability Episode 66 - Building Digital Assistants for SRE with Kyle Forster
Send us Fan MailThis week Stephen brings back Kyle Forster from RunWhen to talk about the purple elephant in the room… “AI”. What makes it GenAI, LLM, Advanced Statistics, or ML? Kyle shares his experience surrounding building AI powered search engines for SRE troubleshooting commands and how to incorporate a (paid) open source community of experts rather than trust AI by itself. They discuss what search looks like under the hood, why GenAI powered chatbots will or won't take over the SaaS industry, how Digital Assistants can be utilised by SREs to increase productivity (hint: giving them to app developers!), how to make informed decisions when purchasing AI products, and *much* more. You can find Kyle at:LinkedIn: https://www.linkedin.com/in/kyforster/recent-activity/all/And you can find out more about RunWhen at: Website: https://www.runwhen.com/Product videos: https://www.youtube.com/@whatdoirunwhen RunWhen Local: https://github.com/runwhen-contrib/runwhen-local (RunWhen Local is an open source troubleshooting cheat sheet that suggests commands from the RunWhen community for all of the namespaces in your cluster - ready to copy & paste)You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/X: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Aug 29, 2023 • 41min
Slight Reliability Episode 65 - The Truth About Incidents with Courtney Nash
Send us Fan MailThis week Stephen chats with the internet incident librarian herself, Courtney Nash. They explore what Courtney has learned through meta-analysis of the over ten thousands incidents in the Verica Open Incident Database (VOID). They cover why MTTR needs to go in the garbage, joint cognitive systems, the value of looking at near misses and *much* more.You can check out the VOID here: https://www.thevoid.community/The two papers mentioned are:Ironies of Automation by Lisanne Bainbridge: https://queue.acm.org/detail.cfm?id=3380779Managing the Hidden Costs of Coordination by Laura Maguire: https://ckrybus.com/static/papers/Bainbridge_1983_Automatica.pdfYou can find Courtney at:LinkedIn: https://www.linkedin.com/in/nashcourtney/Twitter: https://twitter.com/courtneynashYou can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Aug 22, 2023 • 36min
Slight Reliability Episode 64 - Observability During Development with Martin Thwaites
Send us Fan MailThis week Stephen chats with Martin Thwaites from Honeycomb about how developers can leverage observability to understand what they're building better, solve bugs quicker, and have more time for coding. They also discuss OpenTelemetry (the protocol and semantic conventions), manual versus automatic instrumentation, and how keeping every span of trace data is irresponsible.You can find Martin at:LinkedIn: https://www.linkedin.com/in/martin-thwaites-ab445120/X: https://twitter.com/MartinDotNetAnd Honeycomb at https://www.honeycomb.io/You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/X: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Aug 15, 2023 • 9min
Slight Reliability Episode 63 - The Power of Summary
Send us Fan MailObservability is a necessary adaptation to make sense of software systems in the Digital Age, but how can we unlock its power for non-engineer stakeholders (such as executives, product owners, etc)? Perhaps we need a layer of abstraction sitting on top of our detailed observability to get the most out of it.You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Aug 1, 2023 • 37min
Slight Reliability Episode 62 - On-Call with Matt Brown
Send us Fan MailThis week Stephen chats with former-Google SRE Matt Brown about being on-call. They cover how to up-lift junior engineers so they can be on-call, what a fair on-call schedule looks like, run-books, and much more.As you heard, Matt believes flexibility is key to a healthy on-call rotation. Matt is exploring ideas for improvements to existing tooling and products in this space and would love to hear from as many listeners as possible with feedback on what they find useful or frustrating with the existing tools they use to support on-call in their teams. You can reach him at oncall-feedback@mkmba.nz or schedule a chat via https://zcal.co/mattb/oncall, please don't be shy!You can also find Matt at:Website: https://www.mattb.nz/LinkedIn: https://www.linkedin.com/in/mattbrown/Mastodon: https://mastodon.nz/@mattbTwitter: https://twitter.com/xleemYou can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre

Jul 25, 2023 • 6min
Slight Reliability Episode 61 - SRE VS DevOps VS Platform Eng... (Yawn)
Send us Fan MailThe internet is full of people who want to tell you about SRE, DevOps, and Platform Engineering and how different and similar they are... and will give you the impression that these things compete with each other. But do they? And is it a helpful question to ask in the first place?You can find the official Slight Reliability podcast website at: https://slightreliability.com/You can find Stephen at:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreYouTube: https://www.youtube.com/c/SlightReliabilityInstagram: https://www.instagram.com/slight_reliability/TikTok: https://www.tiktok.com/@the_kiwi_sre


