

Last Week In AWS Podcast
Corey Quinn
The latest in AWS news, sprinkled with snark. Posts about AWS come out over sixty times a day. We filter through it all to find the hidden gems, the community contributions--the stuff worth hearing about! Then we summarize it with snark and share it with you--minus the nonsense.
Episodes
Mentioned books

Jun 5, 2020 • 12min
Whiteboard Confessional: The Time I Almost Built My Own Email Marketing Service
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.Linkshttp://nops.io/snarkhttp://snark.cloud/n2ws @QuinnyPigTranscriptCorey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.nOps will help you reduce AWS costs 15 to 50 percent if you do what tells you. But some people do. For example, watch their webcast, how Uber reduced AWS costs 15 percent in 30 days; that is six figures in 30 days. Rather than a thing you might do, this is something that they actually did. Take a look at it. It's designed for DevOps teams. nOps helps quickly discover the root causes of cost and correlate that with infrastructure changes. Try it free for 30 days, go to nops.io/snark. That's N-O-P-S dot I-O, slash snark.Welcome once again to the AWS Morning Brief: Whiteboard Confessional. Today I want to talk about, once again, an aspect of writing my Last Week in AWS newsletter. This goes back to before I was sending it out twice a week, instead of just once, and my needs weren't that complex. I would gather a bunch of links throughout the week: I would make fun of them and I had already built this absolutely ridiculous system that would render all of my sarcasm from its ridiculous database where it lived down into something that would work to generate out HTML. And I've talked about that system previously, I'm sure I will again. That's not really the point of the story. Instead, what I want to talk about is what happened after I had that nicely generated HTML. Now, I've gone through several iterations of how I sent out my newsletter. The first version was through a service known as Drip, that's D-R-I-P. And they were great because they were aimed at effectively non-technical folks, by and large, where it’s, “Oh, you want to use a newsletter. Go ahead.” I looked at a few different vendors. MailChimp is the one that a lot of folks go with for things like this. At the time I was doing that selection, they were having a serious spam problem. People were able to somehow bypass their MFA. Basically, their reputation was in the toilet and given my weird position on email spam, namely, I don't like it, I figured this is probably not the best time to build something on top of that platform, so that was out. Drip was interesting, in that they offered a lot of useful things, and they provided something far more than I needed at the time. They would give me ways to say, “Okay, when someone clicks on this link, I can put them in different groups, etcetera, etcetera.” You know, the typical email, crappy tracking thing that squicks people out. Similarly to the idea of, “Hey, I noticed you left something in your cart. Do you want to go back and buy it?” Those emails that everyone finds vaguely disquieting? Yeah, that sort of thing. So, 90 percent of what they were doing, I didn't need, but it worked well enough, and I got out the door and use them for a while. Then they got acquired, and it seemed like they got progressively worse month after month, as far as not responding to user needs, doing a hellacious redesign that was retina searingly bad, being incredibly condescending toward customer complaints, subtweeting my co-founder on a podcast, and other delightful things. So, the time came to leave Drip. So, what do I do next? Well, my answer was to go to SendGrid. And SendGrid was pretty good at these things. They are terrific at email deliverability—in other words, getting email, from when I hit send on my system into your inbox, bypassing the spam folder, because presumably, you've opted in to receive this and confirmed that you have opted in—so that wasn't going to be a problem. And they still are top-of-class for that problem, but I needed something more than that. I didn't want to maintain my own database of who was on the newsletter or not. I didn't want to have to handle all the moving parts of this. So, fortunately, they wound up coming out with a tool called Marketing Campaigns, which is more or less designed for kind of newsletter-ish style things if you squint at it long enough. And I went down that path and it was, to be very honest with you, abysmal. It was pretty clear that SendGrid was envisioning two different user personas. You had the full-on developer who was going to be talking to them via API for the sending of transactional email, and that's great. Then you had marketing campaign folks who were going to be sending out these newsletter equivalents or mass broadcast campaigns, and there was no API to speak of for these things. It was very poorly done. I'm smack dab between this, where I want to be able to invoke these things via API, but I want to also be able to edit it in a web interface, and I don't want to have to handle all the moving parts myself. So, I sat down and had a long conversation with Mike, my business partner. And well, before I get into what we did, let's stop here. This episode is sponsored in part by N2WS. You know what you care about? Many things, but never backups. At least until right after you really, really, really needed to care about backups. That's what N2WS does for your AWS account. It allows you to cycle backups through different storage tiers; you can back things up cost-effectively, and safely. For a limited time, N2WS is offering you $100 in AWS credits for setting up their free trial, and I encourage you to give it a shot. To learn more, visit snark.cloud/n2ws. That's snark.cloud/n2ws. So, we looked at ConvertKit, which does a lot of things right, but there are a few things wrong. There's still no broadcast API, you have to click things to make it go. So, I have an email background, Mike has an engineering background. And we sat there and we've decided we're going to build something ourselves to solve this problem. And we started drawing on a whiteboard in my office. This thing is four feet by six feet, it is bolted to the wall by a professional because I have the good sense to not tear my own wall down, instead hiring someone else to do it for me. <...

Jun 1, 2020 • 9min
AWS Security Landscapers
AWS Morning Brief for the week of June 1, 2020

May 29, 2020 • 13min
Whiteboard Confessional: The Core Problem in Cloud Economics
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.LinksnOpsN2WS@QuinnyPigTranscriptCorey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.nOps will help you reduce AWS costs 15 to 50 percent, if you do what tells you. But some people do. For example, watch their webcast, how Uber reduced AWS costs 15 percent in 30 days; that is six figures in 30 days. Rather than a thing you might do, this is something that they actually did. Take a look at it. It's designed for DevOps teams. nOps helps quickly discover the root causes of cost, and correlate that with infrastructure changes. Try it free for 30 days, go to nops.io/snark. That's N-O-P-S dot I-O, slash snark.Corey: Welcome to the AWS Morning Brief: Whiteboard Confessional. Today we're going to tell a story that only happened a couple of weeks ago. We don't usually get to tell stories about what we do in the AWS bill fixing realm because companies are understandably relatively reticent to talk about this stuff in public. They believe, rightly or wrongly, that it will annoy Amazon which, frankly, is one of my core competencies. They think that it shows improper attention to detail to their investors and others.I don't see it that way, but I found a story that we can actually talk about today in a bit more depth and detail than we normally would. So, we get a phone call about three weeks ago. Someone has a low five-figure bill every month on AWS. That's generally not large enough for us to devote a consulting project to, because frankly the ROI would take far too long. It's too small of a bill to make an engagement with us cost-effective, but we had a few minutes to kill and thought, “Eh, go ahead and pull up your bill. We'll take a quick look at it here.” Half of their bill historically—and growing—had been data transfer which, okay, that's interesting. And as we've known from previous episodes, data transfers is always strange. So, they looked at this and they said, “Okay, well, obviously then instead of serving things directly, we should get a CDN.” So, then instead of getting a CDN, they chose to set up CloudFront, which basically is a CDN only worse in every way. And they saw no impact to their bill after a month of this. Okay, let's change that up a bit. Now, instead of CloudFront. We're going to move to an actual CDN. So, they did, there was a small impact to their bill. Okay, and costs continue to rise and what's going on? At this point in the story, they call us, which generally if you're seeing something strange on your bill, is not a terrible direction to go in. We see a lot of these things and if we can help you, we will point you towards someone who can. So, our consensus on this was, great. It is too small to look at the bill. But let's pop into Cost Explorer and see what's going on. We break it down by service and S3 was the biggest driver of spend. Now, that's interesting. Number two was EC2. But okay, we start with the big numbers and work our way down. This is apparently novel for folks doing in-depth bill analysis, but we're going to go with it anyway. We start taking a look within that S3 category of usage type, and lo and behold, data transfer out to the internet is driving almost all of it. The cost per request is super low. That tells us in turn that—because we've seen a lot of these—that there are large objects, but relatively few requests for them. So, all right, we're going to slice and dice slightly differently within Cost Explorer, AWS’s free—with an asterisk next to it—tool for exploring various aspects of your bill. That asterisk, incidentally, means that if you're doing this via API, it is one cent per call. If you're doing this in the console, it's free. Be aware, that can catch you by surprise if you write a lot of very chatty scripts. You have been warned. So, yeah, most of the spend was indeed on GetObject calls. So, okay, we know that data transfer spend was coming from an S3 bucket that was not going to a CDN. Otherwise, it's going to show up as a different data transfer charge in a different section of their bill. Okay, so now we know it's S3. We have to figure out what bucket it lives within. And this is an obnoxious process, and we tell them this. And they’re like, “Oh, yeah, we know what bucket that is.” This, incidentally, is where almost every software tool tends to fall down. We could spend some time tracking this down programmatically. Or we can just ask someone who already has the context of what their business does and how it works loaded into their head. Because otherwise, what we'd have to do is tag all their buckets, and then wait a few days for that tag to percolate into the billing system, and then query it again because the visibility into this sort of thing is terrible. It's a shortcoming of both Cost Explorer and S3 in that weird seam between the two. There's fundamentally no easy way to see at a glance which buckets are costing you money unless you do something fun with tagging. To that end, I'm a big believer in having every bucket tagged with a bucket name option. So, I can start slicing on that, and then you enable that as a cost allocation tag. So, great. Now, what is it that's getting requested? Well, you can also dig into this via access logs once you have the bucket, to see what's going on. Great. Now, we take a look at this, and sure enough—well, before I go into what it actually was, let's pause here for a moment. This episode is sponsored in part by N2WS. You know what you care about? Many things, but never backups. At least until right after you really, really, really needed to care about backups. That's what N2WS does for your AWS account. It allows you to cycle backups through different storage tiers; you can back things up cost-effectively, and safely. For a limited time, N2WS is offering you $100 in AWS credits for setting up their free trial, and I encourage you to give it a shot. To learn more visit snark.cloud/n2ws. That's snark.cloud/n2ws. Corey: So, this bucket was set to public access. Aha! it only allowed GetO...

May 25, 2020 • 10min
Introducing AWS SnowCannon
AWS Morning Brief for the week of May 25, 2020.

May 22, 2020 • 13min
Whiteboard Confessional: Naming Is Hard, Don’t Make it Worse
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.Linkshttp://nops.io/snarkhttp://snark.cloud/n2ws @QuinnyPigTranscriptCorey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.nOps will help you reduce AWS costs 15 to 50 percent if you do what tells you. But some people do. For example, watch their webcast, how Uber reduced AWS costs 15 percent in 30 days; that is six figures in 30 days. Rather than a thing you might do, this is something that they actually did. Take a look at it. It's designed for DevOps teams. nOps helps quickly discover the root causes of cost, and correlate that with infrastructure changes. Try it free for 30 days, go to nops.io/snark. That's N-O-P-S dot I-O, slash snark.Good morning AWS, and welcome to the AWS Morning Brief: Whiteboard Confessional. Today we're going to revisit DNS. Now, now, slow down there, Hasty Pudding. Don't bother turning the podcast off. For once, I'm not talking about using it as a database… this time. As you're probably aware, DNS is what folks use to equate friendly names for twitterforpets.com, or incredibly unfriendly names like Oracle.com, to IP addresses, which is how computers tend to see the world. I'm not going to rehash what DNS does. Instead, I'm going to talk about a particular kind of DNS problem that befell a place I used to consult for. They're publicly traded now, so I'm not going to name them. An awful lot of shops do something that's called split-horizon DNS. What that means is that if you're on a particular network, a DNS name resolves differently than it does when you're on a different network. For example, admin.twitterforpets.com will resolve to an administrative dashboard if you're on the Twitter For Pets internal network via VPN, but it won't resolve to that dashboard if you're outside the network, or it might resolve nowhere, or it might resolve just back to their main website, www.twitterforpets.com. And that's fine. Most DNS providers can support this, and Route 53 is, of course, no exception. This is, incidentally, what the Route 53 resolver, that was released in 2018, is designed to do: it bridges private DNS zones to on-premises environments, so your internal zones can then resolve to private IP addresses without having to show your private IP address ranges in public zones to everyone. So, the reason that matters is that this keeps you from broadcasting your architecture or your network layout externally to your company. Some folks consider doing that to be a security problem because it discloses information that an attacker can then leverage to gain further toeholds into your network. Some folks also think that that tends to be a little bit on the extreme side. I'll let you decide because I don't care, and that's not what the story is about. The point is that split-horizon DNS is controversial, for a few reasons, but in many shops, it is considered the right thing to do because it's what they've been doing. The internal DNS names either don't resolve anything publicly, or they resolve to a different system that’s configured to reject the request outright. But there is another path you can take; a third option that no one discusses because it's a path that's far darker, because it is oh, so very much dumber. But first…This episode is sponsored in part by N2WS. Do you know what you care about? Many things, but never backups. At least until right after you really, really, really needed to care about backups. That's what N2WS does for your AWS account. It allows you to cycle backups through different storage tiers; you can back things up cost-effectively, and safely. For a limited time, N2WS is offering you $100 in AWS credits for setting up their free trial, and I encourage you to give it a shot. To learn more visit snark.cloud/n2ws. That's snark.cloud/n2ws. What I'm about to describe is far too stupid for my made-up startup of Twitter For Pets, so we're going to have to invent a somehow even dumber company, and we're going to call it Uber For Squirrels. It's like regular Uber, except it somehow manages to lose less money. Now, there's a very strong argument among the engineering community inside of Uber For Squirrels. Split-horizon DNS is dangerous is what is decided and argued for. And that's the proclamation because a misconfiguration could leak records in the wrong places, and theoretically take the entire online site for Uber For Squirrel down. There are merits to those arguments and you can't dismiss them out of hand, so a bargain was struck. The external DNS zone was therefore decreed to be uberforsquirrels.com, while the internal zone was configured to be uberforsquirrels.net. The uberforsquirrels.net zone was only accessible inside of the network. From the outside, nobody could query it. Now, this is, in isolation—before I go further—a bad plan all on its own. When you're reading quickly, uberforsquirrels.com and uberforsquirrels.net don't jump out visually to people as being meaningfully different. You're going to typo it in config files constantly without meaning to, and then you're going to have a hell of a time tracking it down because it's not immediately obvious that you're talking to the wrong thing; you might think it's a network problem. Your tab completion is going to break out of your known_hosts file, if you have such a thing configured in your environment, it's going to have to hit tab a couple of extra times to cycle through the dot net variants and the dot com variants. It's just a general irritant. But that's not enough to justify an episode of the show. Because wait, that is still some Twitter For Pets level brokenness. Why do I need to throw Uber For Squirrels under the bus? Well, because it turns out that despite using uberforsquirrels.net everywhere as their internal domain, they didn't actually own uberforsquirrels.net. It wasn't entirely clear who did other than that the registration was in another country, so it probably wasn't something that the CEO registered and then forgot about in his random domain list of things he acquired for companies he was going to start o...

May 18, 2020 • 10min
Amazon Macie Some Well Deserved Pushback
AWS Morning Brief for the week of May 18, 2020.

May 15, 2020 • 11min
Whiteboard Confessional: You Down with UTC? Yeah, You Know Me
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.LinksCHAOSSEARCH@QuinnyPigTranscriptCorey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.nOps will help you reduce AWS costs 15 to 50% if you do what it tells you. But some people do, for example, watch their webcast, "How Uber reduced AWS costs 15% in 30 days". That is six figures in 30 days. Rather than a thing you might do, this is something that they actually did. Take a look at it. It's designed for dev ops teams nOps helps quickly discover the root causes of costs and correlate that with infrastructure changes. Try it free for 30 days. Go to nops.io/snark. That's nops.io/snark.Today I want to talk about a funny thing: time. Time has taken on a different meaning for many of us during the current pandemic. Hours seem like days. Days seem like months. But in the context of computers, time is a steady thing. Except when it's not. Things like leap years, leap seconds, Google's famous leap smear and, of course, our ever-changing friends, time zones, combine and collude with one another to make time a very hard problem when it comes to computers. In the general case, computers think of time in terms of seconds since the start of the Unix epoch on January 1, 1970. This is incidentally—and not the point of this episode—going to cause a heck of a lot of excitement when 32-bit counters rollover in 2038. But that's a future problem similar to Y2K, that I'm sure won't bother anyone. Time leads to suboptimal architectural choices, which is bad, and then those choices become guidance which is in turn far, far worse. Now, AWS has said a lot of things over the years that I despise and take massive issue with. Some petty and venial, like pronunciation, but none of them were quite so horrifying as a tweet. On May 17, 2018, the official AWS Cloud Twitter account tweeted out an article with the following caption, “Change the timezone of your Amazon RDS instance to local time.” I hit the roof immediately and began ranting about it and railing against that tweet in particular. I believe this is the first time that me yelling at AWS in public hit semi-viral status. My comment, to be precise, was absolutely do not do this. UTC is the proper server timezone unless you want an incredibly complex problem after you scale. Fixing this specific problem has bought consultants entire houses in San Francisco. Now, I stand by that criticism and I maintain that your databases should be in UTC at all times, as should the rest of your servers. And I'll explain why, but first:This episode is sponsored in part by N2WS. You know what you care about? Many things, but never backups. At least, until right after you really, really, really needed to care about backups. That's what N2WS does for your AWS account. It allows you to cycle backups through different storage tiers so you can back things up cost effectively and safely. For a limited time N2WS is offering you a hundred dollars in AWS credits for setting up their free trial. And I encourage you to give it a shot. To learn more, visit snark.cloud/n2ws. That's snark.cloud/n2ws. It's important that all of your systems be in the same timezone UTC, or Universal Time Coordinated doesn't change with the seasons. It doesn't observe daylight saving time. It's the closest thing we've got to a unified central time that everyone can agree on. Now, you're going to take issue with a lot of that, and I'm not suggesting that you should display that time to your users. You have a lot of options around how you can alter the display of time at the presentation level. You can detect the timezone that their browser is set to. You can let them select their time zone in the settings of your application. You can do what ConvertKit—one of my vendors—does, and force everything to display in US East Coast time for some godforsaken reason. But all of those options are far better than setting the server time to local time. Years ago, I've been told that this shameful secret exists within companies during job interviews when I asked what kind of problems they're currently wrestling with, and it's a big deal because changing one system requires changing every system that winds up tying back to that. Google apparently had all of their servers originally set to Pacific Coast time or headquarters time, and this caused them problems for over a decade. I can't confirm that because I haven't ever worked there, so I wouldn't know other than stories people tell while sobbing into beers. But it stands to reason because once you've gone down this path, it is incredibly difficult to fix it. What's not so obvious is why exactly this is so painful. And the problem comes down to change. Time zones change. Daylight saving time alters when it takes place in the given location from year to year. And time zones themselves don't hold still either, as geopolitical things tend to change. Remember that computers don't just use time to tell you what time something is right now. They look at when log entries were made, what happened in a particular time frame? What was the order of those specific events that all came from different systems? When was a change actually implemented? And you really, really don't want to have to apply complex math to logs just to reconstruct historical events in time. “Well, that one was before daylight saving time took effect that year in that particular location where the server was running in, so just carry the two.” That becomes awful stuff, and no one wants to have to go through that. It also leads to scenarios where you can introduce errors with bad timezone math. Now, there are a couple of solid objections here, but one of the only ones that I saw advocated on Twitter when I started ranting about it was of the very reasonable form, “Look, most stuff that uses databases in a lot of companies is for a single location at a single company, and it's never going to need ...

May 11, 2020 • 10min
The AWS Machine That Goes PING
AWS Morning Brief for the week of May 11, 2020.

May 8, 2020 • 10min
Whiteboard Confessional: Click Here to Break Production
About Corey QuinnOver the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.LinksCHAOSSEARCH@QuinnyPigTranscriptCorey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semi-polite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real-world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.On this show, I talk an awful lot about architectural patterns that are horrifying. Let’s instead talk for a moment about something that isn’t horrifying. CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets, and you can access it using APIs you’ve come to know and tolerate, through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive disks in triplicate, and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at CHAOSSEARCH.io.Today on the AWS Morning Brief: Whiteboard Confessional, I'm telling a different story than I normally do. Specifically, this is the tale of an outage from several weeks ago. The person who shared this story with me has requested to remain anonymous and further wishes me to not mention their company at all. This is, incidentally, a common occurrence. Folks don't generally want to jeopardize their relationship with AWS by disclosing a service issue they see, whereas I don't have that particular self-preservation instinct. Then again, I'm not a big AWS customer myself. I'm not contractually bound to AWS in any meaningful way, and I'm not an AWS partner, nor am I an AWS Hero. So, all that AWS really has over me in terms of leverage is the empty threat of taking away my birthday. So, let's dive into this anonymous story. It's a good one. A company was minding its own business, and then had a severity one incident. For those who aren't familiar with that particular designation, you can think of that as being the company's primary service fell over in an embarrassingly public way. Customers noticed, and everyone runs around screaming a whole lot. Now, if we skip past the delightful hair-on-fire diagnosis work, the behavior that was eventually tracked down was that an SNS topic had a critical listener get unsubscribed. That SNS topic invoked said listener, which in turn drove a critical webhook call via API gateway. This is a bad thing, obviously. Fundamentally, customers stopped receiving webhooks that they were expecting, and this caused a nuclear meltdown given the nature of what the company does, which I can't disclose and isn't particularly relevant anyway. But, for those who are not up to date on the latest AWS terminology, service names, and parlance, what this means at a high level is that a thing happens inside of AWS, and whenever that thing happens, it's supposed to fire off an event that notifies this company's paying customers. This broke because something somewhere unsubscribed the firing off dingus from the notification system. Now that we're aware of what caused the issue at a very high level, time to dig into how it happened and what to do about it. But first:In the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom to pay for their ridiculous implementations. It doesn’t have to be that way. Consider CHAOSSEARCH. The data lives in your S3 buckets in your AWS accounts, and we know what that costs. You don’t have to deal with running massive piles of infrastructure to be able to query that log data with APIs you’ve come to know and tolerate, and they’re just good people to work with. Reach out to CHAOSSEARCH.io. And my thanks to them for sponsoring this incredibly depressing podcast. The logs for who unsubscribed it are, of course, empty, which is a problem for this company’s blameless-in-theory-but-blame-you-all-the-way-out-of-the-company-if-it-turns-out-that-it-was-you-that-clicked-this-thing-and-didn't-tell-anyone, philosophy. CloudTrail doesn't log this event because why would it? CloudTrail’s primary purpose is to rack up bills and take the long way around before showing events in your account, not to assist with actual problem diagnosis, by all accounts. Now, fortunately, this customer did have AWS Enterprise Support. It exists for precisely this kind of problem. It granted them access to the SNS team which had considerably more insight into what the heck had happened, at which point the answer became depressingly clear, as well as clearly depressing. It turns out that the unsubscribe URL at the bottom of every SNS notification wasn't authenticated. Therefore, anyone who had access to the link could have invoked it, and that's what happened when a support person did something very reasonable: Copy and paste a log message containing that unsubscribe link into a team Slack channel. It wasn't their fault [00:06:04 unintelligible] because they didn't click it. The entity triggering this was—and I swear I'm not making this up—Slackbot. Have you ever noticed that when you paste a URL into Slack, it auto expands the link to show you a preview? It tries to do that on every URL, and you can't disable URL expansion at the -Slack workspace level. You can blacklist URLs but only if the link expansion succeeds. In this case, it doesn't have a preview, so it doesn't succeed, so there's nothing for it to blacklist. Slack’s helpful feature can't be disabled on a team-wide level, so when that unsubscribe URL shows up in a log snippet that got pasted, it silently unsubscribed the consumer from SNS and broke the entire system. Now, there are an awful lot of things that could have been different here. Isn't this the sort of thing that might be better off with SQS, you might reasonably ask? Well, four years ago, when this system was built, SQS itself could not, and did not support invoking Lambda functions, so SNS was the only real option. T...

May 4, 2020 • 11min
AWS Non-Profit Organisations
AWS Morning Brief for the week of May 4, 2020.


