

Last Week In AWS Podcast
Corey Quinn
The latest in AWS news, sprinkled with snark. Posts about AWS come out over sixty times a day. We filter through it all to find the hidden gems, the community contributions--the stuff worth hearing about! Then we summarize it with snark and share it with you--minus the nonsense.
Episodes
Mentioned books

May 1, 2020 • 13min
Whiteboard Confessional: Hacking Email Newsletter Analytics & Breaking Links
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.LinksCHAOSSEARCHLast Week in AWSThe DynamoDB BookTwitter TranscriptCorey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.On Monday, I sent out a newsletter issue to over 18,000 people where the links didn't work for the first hour and a half. Then they magically started working. Today on the AWS Morning Brief: Whiteboard Confessional. I'm not talking about a particular design pattern, but rather conducting a bit of a post mortem of what exactly broke and why it suddenly started working again an hour and a half later. To send out the Last Week in AWS newsletter, I use a third-party service called ConvertKit that, in turn, wraps itself around SendGrid for actual email delivery. They, in turn, handle an awful lot of the annoying difficult parts of newsletter management. As a quick example, unsubscribes. If you unsubscribe from my newsletter, which you should never do, I won't email you again. That's because they handle the subscription and unsubscription process. Now, as another example, when you sign up for the newsletter, you get an email series that tailors itself to a “choose your own platypus” adventure based upon what you select. True story. Their logic engine powers that, too. ConvertKit is awesome for these things, but they do some things that are also kind of crappy. For example, they do a lot of link tracking that is valuable, but it's the creepy kind of link tracking that I don't care about and really don't want. Also, unfortunately, their API isn't really an API so much as it is an attempt at an API that an intern built, because they thought it was something you might enjoy. I can't create issues via API. I have to generate the HTML and then copy and paste it in like a farm animal. And their statistics and metrics API's won't tell me the kinds of things I actually care about, but their website will, so they have the data, it just requires an awful lot of clicking and poking. And when I say things I don't care about, let me be specific. Do you know what I don't care about? Whether you personally, dear listener, click on a particular link. I do not care; I don't want to know. That's creepy; It's invasive, and it isn't relevant to you or me in any particular way. But I do care what all of you click on in aggregate. That informs what I include in the newsletter in the future. For example, I don't care at all about IoT, but you folks sure do. So, I'm including more IoT content as a direct response to what you folks care about. Remember, I also have sponsors in the newsletters, who themselves include links, and want to get a number of people who have clicked on those things. So, it also needs to be unique. I care if a user clicks on a link once, but if they click on it two or three times, I don't want that to increment the counter, so there are a bunch of edge case issues here. Here are the questions that I need to answer that ConvertKit doesn't let me get at extraordinarily well. First, what were the five most popular links in last week's issue? I also want to care what the top 10 most popular links over the last three months were. That helps me put together the “Best of” issues I'm going to start shipping out in the near future. I also care what links got no clicks because people just don't care about them or I didn't do a good job of telling the story. It helps me improve the newsletter. With respect to sponsors, I care how each individual sponsor performs relative to other sponsors. If one sponsor link gets way fewer clicks, that's useful to me. Since I write a lot of the sponsor copy myself, did I get something wrong? On the other hand, if a sponsored link gets way more clicks than normal, what was different there? I explicitly fight back against clickbait, so outrage generators, like racial slurs injected into the link text are not permitted. So, therefore when a sponsored link outperforms what I would normally expect, it means that they're telling a story that resonates with the audience, and that is super valuable data. Now, I'll tell you what I built, and what went wrong. After this.In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast. I built a URL redirector to handle all of these problems plus one more. Namely, I want to be able to have an issue that has gone out with a link in it, but I want to be able to repoint that link after I've already hit send. Why do I care about that? Well, if it turns out that a si...

Apr 27, 2020 • 13min
Cape Town Region Is Expensive AF
AWS Morning Brief for the week of April 27, 2020.

Apr 24, 2020 • 12min
Whiteboard Confessional: Don’t Run a Database on Top of NFS
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.LinksCHAOSSEARCHAmazon Elastic File SystemNetwork File SystemAWS FargateTranscriptCorey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.Corey: On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.I talked a lot about databases on this show. There are a bunch of reasons for that, but they mostly all distill down to that databases are, and please don't quote me on this as I'm not a DBA, where the data lives. If I blow up a web server, it can have hilarious consequences for a few minutes, but it's extremely unlikely to have the potential to do too much damage to the business. That's the nature of stateless things. They're easily replaced, and it's why the infrastructure world has focused so much on the recurring mantra of cattle, not pets.But I digress. This episode is not about mantras. It's about databases. Today's episode of the AWS Morning Brief: Whiteboard Confessional returns to the database world with a story that's now safely far enough in the past that I can talk about it without risking a lawsuit. We were running a fairly standard three-tiered web app. For those who haven't had the pleasure because their brains are being eaten by the microservices worms, these three tiers are web servers, application servers, and database servers. It's a model that my father used to deploy, and his father before him.But I digress. This story isn't about my family tree. It's about databases. We were trying to scale, which is itself a challenge, and scale is very much its own world. It's the cause of an awful lot of truly terrifying things. You can build an application that does a lot for you on your own laptop. But now try scaling that application to 200 million people. Every single point of your application architecture becomes a bottleneck long before you'll get anywhere near that scale, and you're gonna have oodles of fun re-architecting it as you go. Twitter very publicly went through something remarkably similar about a decade or so ago, the fail whale was their error page when Twitter had issues, and everyone was very well acquainted with it. It spawned early memes and whatnot. Today, they've solved those problems almost entirely.But I digress. This episode isn't about scale, and it's not about Twitter. It's about databases. So my boss walks in and as we're trying to figure out how to scale a MySQL server for one reason or another, and then casually suggests that we run the database on top of NFS.[Record Scratch]Yes, I said NFS. That's Network File System. Or, if you've never had the pleasure, the protocol that underlies AWS’s EFS offerings, or Elastic File System. Fun trivia story there, I got myself into trouble, back when EFS first launched, with Wayne Duso, AWS’s GM of EFS, among other things, by saying that EFS was awful. At launch, EFS did have some rough edges, but in the intervening time, they've been fixed to the point where my only remaining significant gripe about EFS is that it's NFS. Because today, I mostly view NFS is something to be avoided for greenfield designs, but you've got to be able to support it for legacy things that are expecting it to be there. There is, by the way, a notable EFS exception for Fargate and using NFS with Fargate for persistent storage.But I digress. This episode isn't about Fargate. It's about databases.Corey: In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast. So I'm standing there, jaw agape at my boss. This wasn't one of those many mediocre managers I've had in the past that I've referenced here. He was and remains the best boss I've ever had. Empathy and great people management skills aside, he was also technically brilliant. He didn't suggest patently ridiculous things all that often, so it was sad to watch his cognitive abilities declining before our eyes. “Now, hang on,” he said, “before you think that I've completely lost it. We did something exactly like this before at my old job, it can be done safely, sanely and offer great performance benefits.” So, I'm going to skip what happens next in this story because I was very early in my career. I hadn't yet figured out that it's better to not actively insult your boss in a team meeting, based only upon a half baked understanding of what they've just proposed. To his credit, he took it in stride, and then explained how to pull off something that sounds on its face to be truly monstrous.Now I've doubtless forgotten most of the technical nuance here, preferring ...

Apr 20, 2020 • 7min
AWS Billing System Go BRRRRRR
AWS Morning Brief for the week of April 20, 2020.

Apr 17, 2020 • 10min
Whiteboard Confessional: The 15-Person Startup with 700 Microservices: A Cautionary Tale
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.LinksCHAOSSEARCH.io@QuinnyPigHacker NewsAmazon Route 53TranscriptCorey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.Today, I want to rant about microservices. What are microservices you may very well ask? the broken answer to a misunderstood question. Let’s Talk about what gave rise to microservices: specifically monoliths. that is generally what predated microservices as a design pattern. And by monoliths, I mean one giant codebase, the way that grandpa used to build things. back then Git wasn’t a thing. Subversion and Perforce ruled the day, and everyone wore a pair of fighting trousers to work in the morning. The problem with monoliths was that it’s challenging in the extreme, culturally, to have a whole host of developers working on the same codebase. one person’s change can inadvertently break the build for the other 5000 engineers all working on that same codebase, and with the various version control systems that were heavily in use before Git became usable by mere mortals. There weren’t a lot of workflows that made it easy to have multiple people collaborate on the same system. So microservices, for that and a few other reasons, became suggested as a way of solving for this problem. breaking apart those ancient monoliths into functional microservices where each item does one thing began to make a lot of sense. And it solves for a political problem super neatly. You want each team responsible for a given microservice, And that promise is compelling. Because in theory, if you think about this, if you build a microservice and you publish what data that service takes in and in what format it needs to be, and what it will return in response to having that data sent to it, then what your microservice does to achieve its goal and how it works doesn’t matter at all to anyone else. You can replace the database, you can move to serverless, or containers, or punch cards. It doesn’t really matter how it does the work, just so long as the work gets done in the way that is published. And whatever decisions you make, only ever impact your own team as far as how those things get done. So the sky’s the limit, you don’t have to really focus on collaborating with folks the way that you once had. As long as that API remains stable, the sky remains the limit. So suddenly, your 5000 developers are now able to not be tightly coupled to one another and can do all kinds of things and move way faster. It’s a great model. But it’s also a problem. Why?In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast. The problem is that this works super well for places with thousands of engineers all working on one product. But then Hacker News got a hold of it, And because it’s Hacker News, it took exactly the wrong lessons from this approach and instead started embracing this pattern where it was patently ridiculous, rather than helpful. Now, as a result of this, you see startups out there with 700 microservices, but somehow only 15 employees. Where’s the sense in that? As a result, you have massive sprawl, an awful lot of duplicate work, and no real internal standards in many cases. And as a result, this misguided belief that everything should look architecturally like Google’s internal systems, Despite the fact that your entire application could run on a single computer from 2012 without breaking a sweat. Even Google would argue that its internal systems should not look like Google’s internal systems, but technical debt is a thing and the choices we make constrain what we’re able to do. The problem here is that microservices fundamentally are tied to solving a political problem by a technical means. It’s about how to make people work more effectively. And this somehow instead became a technical best practice. It’s not, not for everyone. It introduces complexity. It leads to scenarios where no one, absolutely no one has all of the various moving parts in their head. It’s prime material for here on the whiteboard confessional Because when you truly embrace microservices, your whiteboards are always full of things that nobody understands. It ties into the larger problem of building things in service to your own resume, or to your engineering department or to the world practice of engineering as some abstract ideal, rather than in service to The business needs that your company exists entirely to cater to. Plus, when you have oodles of microservices hanging around everywhere, First you have a library problem of which service does what. no one knows, And if you ever track it down, great, each one of those services also needs to be monitored....

Apr 13, 2020 • 12min
Goldilocks and the Three Elastic Beanstalk Consoles
AWS Morning Brief for the week of April 13, 2020.

Apr 10, 2020 • 13min
Whiteboard Confessional: The Rise and Fall of the T-Shaped Engineer
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.LinksCHAOSSEARCHTwitter: @QuinnyPigRed Hat Enterprise LinuxFreeBSDSaltStackPuppetClusterSSHTranscriptCorey Quinn: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.For a long time now, I’ve been a believer in the idea of the T-shaped engineer. And what I mean by that is that you should be broad across a wide variety of technologies, but deep in one or two very specific areas. So, it looks a bit like a T, or an inverted T, depending upon how you wind up visualizing that. I’m describing this with words. I don’t have a whiteboard in front of me. Use your imagination, you’ll be okay. The point being is that whenever you’re working in a new environment, or on a new problem, having a broad base of technologies of which you’re aware, is incredibly useful to fall back upon. Now, the reason to be super deep in one or two areas, is that specialization is generally what lets people charge more for various services. People want to hire domain-specific expertise for an awful lot of problems that they want to get solved. So, having something that you can bring into job interviews and more or less mop the floor with people asking questions around that domain is an incredibly valuable thing to have.But that has some other consequences too. And that’s what today’s episode of The Whiteboard Confessional is talking about. Back in my first Unix admin job, I busily began upgrading a whole lot of the infrastructure and ripping out very early Red Hat Enterprise Linux and CentOS version 4 systems and replacing them with the one true operating system, which, of course, is FreeBSD. And I had a litany of explanation as to why it was the best option, what it could do for various problems, and why there was just absolutely no comparison between FreeBSD and anything else. I could justify it super easily, and the real defense mechanism here was that people get really, really, really tired of talking to zealots, so no one kept questioning me. They just basically said, “Fine, whatever,” and got out of the way. Years later, I decided to focus on something that wasn’t an esoteric operating system to go super deep in, and that’s right, I picked SaltStack, which is configuration management done right, tied to remote execution. I’d worked with Puppet, I’d tolerated CFEngine, but I had a bunch of angry loud opinions about it and SaltStack was absolutely the way and the light. So, in the company I was working at at that time, I rolled it out everywhere, and our entire provisioning and configuration management process was run through SaltStack. And I could come up with a whole litany of reasons why this was the right answer, and that no one else was going to be able to come close to what the ideal correctness that SaltStack provided. And people eventually stopped arguing with me, because they had better things to do than argue with a zealot about which configuration management system was the right one to go with. I’ve also talked on previous episodes of the show about using ClusterSSH. And this was before I discovered the religion that was configuration management. It was the right answer, because rather than having to run a for loop with shell scripting, which was suboptimal for a wide variety of reasons, and I would explain to everyone why it was suboptimal. So, again, they shrugged, got out of the way and let me use ClusterSSH. And a similar pattern happened when I was working with large scale storage. NetApp was the right answer for all of our enterprise storage needs because let’s face it, it wasn’t my money. And when it comes to NFS, even today, they are head and shoulders above anything else in the space. And then eventually, it turned to AWS. And for a while, I want to say around 2014, 2015, I would tell you why AWS was the right answer for every problem. What challenge are you trying to work with? Well, AWS has an answer for that, because of course, they do. Their product strategy is, “yes”. Now, what do all of those independent stories have in common? Great question. Let’s talk about that. But first…In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast. The problem is, is that everything I just mentioned, was a pet technology, or a pet company. Something that I had taken the time to get deep in and learn. And therefore, it became my de facto answer for a...

Apr 6, 2020 • 11min
Amazon Detective and the Case of the Giant AWS Bill
AWS Morning Brief for the week of April 6, 2020.

Apr 3, 2020 • 12min
Whiteboard Confessional: My Metaphor-Spewing Poet Boss & Why I Don’t Like Amazon ElastiCache for Redis
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.LinksCHAOSSEARCHRedisAmazon ElastiCacheTwitter: @QuinnyPigTranscriptCorey Quinn: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.When you walk through an airport—assuming that people still go to airports in the state of pandemic in which we live—you’ll see billboards saying, “I love my slow database, says no one ever.” This is an ad for Redis. And the unspoken implication is that everyone loves Redis. I do not. In honor of the recent release of Global DataStore for Amazon ElastiCache for Redis. Today I’d like to talk about that time ElastiCache for Redis helped cause an outage that led to drama. This was a few years back and I worked at a B2B company—B2B of course, meaning business-to-business. We were not dealing direct-to-consumer—I was a different person then, and it was a different time, specifically, the time was late one Sunday evening, and my phone rang. This was atypical because most people didn’t have that phone number. At this stage of my life, my default answer when my phone rang was, “Sorry, you have the wrong number.” If I wanted phone calls, I’d have taken out a personals ad. Even worse when I answered the call, it was work. Because I ran the ops team, I was pretty judicious in turning off alerts for anything that wasn’t actively harming folks. If it wasn’t immediately actionable and causing trouble, then there was almost certainly an opportunity to be able to fix it later during business hours. So, the list of things that could wake me up was pretty small. As a result, this was the first time that I had been called out of hours during my tenure at this company, despite having spent over six months there at this point, so who could possibly be on the phone but my spineless coward of a boss? A man who spoke only in metaphor, we certainly weren’t social friends because who can be friends with a person like that?“What can I do for you?” “As the roses turn their faces to the sun, so my attention turned to a call from our CEO. There’s an incident.” My response was along the lines of, “I’m not sure what’s wrong with you, but I’m sure it’s got a long name, it is incredibly expensive to fix.” Then I hung up on him and dialed into the conference bridge. It seemed that a customer had attempted to log into our website recently and had gotten an error page, and this was causing some consternation. Now, if you’re used to a B2C or business-to-consumer environment, that sounds a bit nutty because you’ll potentially have millions of customers. If one person hits an error page, that’s not CEO level of engagement. One person getting that error is, sure it’s still not great, but it’s not the end of the world. I mean, Netflix doesn’t have an all hands on deck disaster meeting when one person has to restart a stream. In our case, though, we didn’t have millions of customers, we had about five and they were all very large businesses. So, when they said jump, we were already mid-air. I’m going to skip past the rest of that phone call in the evening because it’s much more instructive to talk about this with the clarity lent by the sober light of day the following morning. And the post mortem meeting that resulted from it. So, let’s talk about that. After this message from our sponsor. In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast. So, in hindsight, what happens makes sense, but at the time when you’re going through an incident, everything’s cloudy, you’re getting conflicting information. And it’s challenging to figure out exactly what the heck happened. As it turns out, there were several contributing factors, specifically four of them. And here’s the gist of what those four were. Number one, we used Amazon ElastiCache for Redis. Really, we were kind of asking for trouble. Two, as tends to happen with managed services like this, there was a maintenance event that Amazon emailed us about. Given that we weren’t completely irresponsible, we braved the deluge of marketing to that email address, and I’d caught this and scheduled it in the maintenance calendar. In fact, we specifically were allowed to schedule when that maintenance took place. So, we scheduled it for a weekend. In hindsight: mistake. When you’re having maintenances like this happen, you want to make sure that they take place when there are people around to keep an eye on things. Three, the maintenance was supposed to be invisible. The way that Amazon ElastiCache for Redis works is you have clusters, and you have a primary and you have a replica. The way that they do maintenances is they wind up updating the rep...

Mar 30, 2020 • 12min
The "AWS For God's Sake Leave Me Alone" Service
AWS Morning Brief for the week of March 30, 2020.


