The InfoQ Podcast

InfoQ

Software engineers, architects and team leads have found inspiration to drive change and innovation in their team by listening to the weekly InfoQ Podcast. They have received essential information that helped them validate their software development map. We have achieved that by interviewing some of the top CTOs, engineers and technology directors from companies like Uber, Netflix and more. Over 1,200,000 downloads in the last 3 years.

Episodes

Mentioned books

Dec 16, 2016 • 36min

Keith Adams on the Architecture of Slack, using MySql, Edge Caching, & the backend Messaging Server

In this week’s podcast, QCon chair Wesley Reisz talks to Keith Adams, chief architect at Slack. Prior he was an engineer at Facebook where he worked on the search type live backend, and is well-known for the HipHop VM [hhvm.com]. Adams presented How Slack Works at QCon SanFrancisco 2016. Why listen to this podcast: - Group messaging succeeds when it feels like a place for members to gather, rather than just a tool - Having opt-in group membership scales better than having to define a group on the fly, like a mailing list instead of individually adding people to a mail - Choosing availability over consistency is sometimes the right choice for particular use cases - Consistency can be recovered after the fact with custom conflict resolution tools - Latency is important and can be solved by having proxies or edge applications closer to the user Notes and links can be found on: http://bit.ly/keith-adams 3m:30s Voice and video interactions are impacted by latency; the same is true of messaging clients 4m:00s The user interface can provide indications of presence, through avatars indicating availability and typing indicators 4m:15s Latency is important; sometimes the difference is between 100ms and 200ms so the message channel monitors ping timeout between server and client 4m:40s 99th percentile is less than 100ms ping time 5m:15s If the 99th percentile is more than 100ms then it may be server based, such as needing to tune the Java GC 5m:25s Network conditions of the mobile clients are highly variable 6m:20s Mobile clients can suffer intermittent connectivity More on this: Quick scan our curated show notes on InfoQ. http://bit.ly/keith-adams You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq

Dec 9, 2016 • 25min

Haley Tucker on Responding to Failures in Playback Features at Netflix

In this week’s podcast, Thomas Betts talks with Haley Tucker, a Senior Software Engineer on the Playback Features team at Netflix. While at QCon San Francisco 2016, Tucker told some production war stories about trying to deliver content to 65 million members. Why listen to this podcast: - Distributed systems fail regularly, often due to unexpected reasons - Data canaries can identify invalid metadata before it can enter and corrupt the production environment - ChAP, the Chaos Automation Platform, can test failure conditions alongside the success conditions - Fallbacks are an important component of system stability, but the fallbacks must be fast and light to not cause secondary failures - Distributed systems are fundamentally social systems, and require a blameless culture to be successful Notes and links can be found on: http://bit.ly/2hqzQ6K 2m:05s - The Video Metadata Service aggregates several sources into a consistent API consumed by other Netflix services. 2m:43s - Several checks and validations were in place within the video metadata service, but it is impossible to predict every way consumers will be using the data. 3m:30s - The access pattern used by the playback service was different than that used in the checks, and led to unexpected results in production. 3m:58s - Now, the services consuming the data are also responsible for testing and verifying the data before it rolls out to production. The Video Metadata Service can orchestrate the testing process. More on this: Quick scan our curated show notes on InfoQ. http://bit.ly/2hqzQ6K You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq

Dec 2, 2016 • 29min

Kolton Andrus on Lessons Learnt From Failure Testing at Amazon and Netflix and New Venture Gremlin

In this week's podcast, QCon chair Wesley Reisz talks to Kolton Andrus. Andrus is the founder of Gremlin Inc. He was a Chaos Engineer at Netflix, focused on the resilience of the Edge services. He designed and built FIT: Netflix’s failure injection service. Prior, he improved the performance and reliability of the Amazon Retail website. Why listen to this podcast: - Gremlin, Kolton Andrus' new start-up, is focused on providing failure testing as a service. Version 1, currently in closed beta, is focused on infrastructure failures. - Lineage-driven Fault Injection (LDFI) allowed Netflix to dramatically reduce the number of tests they needed to run in order to explore a problem space. - You generally want to run failure tests in production, but you can't start there. Start in developemnt and build up. - Having failure testing at an application level, as Netflix does, so you can have request level fault injection for a specific user or a specific device. - Being able to trace infrastructure with something like Dapper or Zipkin offers tremendous value. At Netflix, the failure injection system is integrated into the tracing system, which meant that when they caused a failure they could see all the points in the system that it touched. More on this: Quick scan our curated show notes on InfoQ. http://bit.ly/2fT9YiM You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq

Nov 18, 2016 • 26min

Preslav Le on How Dropbox Moved off AWS and What They Have Been Able to Do Since

As InfoQ previously reported in March 2016, Dropbox announced that they had migrated away from Amazon Web Services (AWS). In this week's podcast Robert Bluman talks to Preslav Le. Preslav has been a software engineer at Dropbox for the past three years, contributing to various aspects of Dropbox’s infrastructure including traffic, performance and storage. He was part of the core oncall and storage oncall rotations, dealing with high emergency real world issues, from bad code pushes to complete datacenter outages. Why listen to this podcast: - Dropbox migrated away from Amazon S3 to their own data centres to allow them to optimise for their specific use case. - They are experimenting with Shingled Magnetic Recording (SMR) drives for primary storage to increase storage density. All writes go to an SSD cache and then get pushed asynchronously to the SMR disk. - Their average block size is 1.6MB with a maximum block size of 4MB. Knowing this allows the team to tune their storage system. - Three languages are used for the backend infrastructure. Python is used mainly for business logic, Go is the primary language used for heavy infrastructure services, and in some cases, for example where more direct control over memory is needed, Rust is also used. - Dropbox invest very heavily in verification and automation. A verifier scans every byte on disk and checks that it matches the checksum in the index. - Verification is also used to check that each box has the block keys it should have. Notes and links can be found on http://bit.ly/preslav-le Dropbox’s motivation for moving off the cloud 2:40 - Dropbox used Amazon S3 and other services where it made sense, but they stored all the metadata in their own data centres. 3:30 - Initially this was done because Amazon had poor support for persistent storage at the time. This has since improved but it didn’t make sense for dropbox to move the metadata back. 4:01 - By that time the dropbox team was ready to tackle the storage problem and built their own in-house replacement for S3, called Magic Pocket. Magic Pocket allowed Dropbox to move away from Amazon altogether. 4:30 - The move saved money, but also allowed DropBox to optimise for their specific use case and be faster. More on this: Quick scan our curated show notes on InfoQ. http://bit.ly/preslav-le You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq

Nov 11, 2016 • 26min

Randy Shoup on Stitch Fix's Technology Stack, Data Science and Microservices

In this week's podcast QCon chair Wesley Reisz talks to Randy Shoup. Shoup is the vice president of engineering at Stitch Fix. Prior to Stitch Fix, he worked for Google as the director of engineering and cloud computing, CTO and co-founder of Shopilly, and chief engineer at Ebay. Why listen to this podcast: - Stitch Fix's business is a combination of art and science. Humans are much better with the machines, and the machines are much better with the humans. - Stitch Fix has 60 engineers, with 80 data scientists and algorithm developers. This ratio of data science to engineering is unique. - With Ruby-on-Rails on top of Postgres, the company maintains about 30 different applications on the same stack. - The practice of Test Driven Development makes Continuous Delivery work, and the practice of having the same people build the code as those who operate the code makes both of these things much more powerful. - Microservices gives feature velocity, the ability for individual teams to move quickly and independently of each other, and independent deployments. - Microservices solve a scaling problem. They solve an organisational scaling problem, and a technological scaling problem. These are not the problems that you have early on in the startup. - In the monolithic world, if you're not able to continue to vertically scale the application or the database or whatever your monolith is. And so for scaling reasons alone you might consider breaking it up into what we call microservices. Notes and links can be found on http://bit.ly/randy-shoup-podcast Data Science and Stitch Fix 1m:57s - Stitch Fix re-imagines retail, particularly for clothing. When you sign up, you fill out survey of the kinds of things that you like and you don't like, and we choose what we think you're going to enjoy based on the millions of customers that we have already. And we use a ton of data science in that process. 3m:00s - That goes into our algorithms and then our algorithms make personalised recommendations based on all the things we know about our other customers... there's a human element as well: we have 3,200 human stylists that are all around the United States and they choose the five items that go into the box [of clothing]. 3m:29s - What we like is that this is a combination of art and science. Modern companies combine what machines are really good at, such as chugging through the 60 to 70 questions times the millions of customers, and combining that with the human element of the stylists, figuring out what things go together, what things are trending, what things are appropriate... Humans are much better with the machines, and the machines are much better with the humans. [...] More on this: Quick scan our curated show notes on InfoQ. http://bit.ly/randy-shoup-podcast You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq

Nov 4, 2016 • 29min

Tal Weiss on Observability, Instrumentation and Bytecode Manipulation on the JVM

In this week's podcast, QCon chair Wesley Reisz talks to Tal Weiss, CEO of OverOps, recently re-branded from Takipi. The conversation covers how the OverOps product works, explores the difference between instrumentation and observability, discusses bytecode manipulation approaches and common errors in Java based applications. A keen blogger, Weiss has been designing scalable, real-time Java and C++ applications for the past 15 years. He was co-founder and CEO at VisualTao which was acquired by Autodesk in 2009, and also worked as a software architect at IAI Space Industries focusing on distributed, real-time satellite tracking and control systems. Why listen to this podcast: - OverOps uses a mixture of machine code instrumentation and static code analysis at deployment time to build up an index of the code - Observability is how you architect your code to be able to capture information from its outputs. Instrumentation is where you come in from the outside and use bytecode or machine code manipulation techniques to capture information after the system has been designed and built. - Bytecode instrumentation is a technique that most companies can benefit from learning a bit more about. Bytecode isn’t machine code - it is a high-level programming language. Being able to read it really helps you understand how the JVM works. - There are a number of bytecode manipulation tools you can use to work with bytecode - ASM is probably the most well known. - A fairly small number of events within an application’s life-cycle generate the majority of the log volume. A good practice is to regularly review your log files to consider if what is being logged is the right thing. Notes and links can be found on http://bit.ly/2fInGsW SaaS vs On-Premise 5:43 - OverOps started as a SaaS product, but given that a lot of the data it collects is potentially sensitive, they introduced a new product called Hybrid. Hybrid separates the data into two independent streams: data and metadata. 6:42 - The data stream is the raw data that is captured which is then privately encrypted using 256 bit AES encryption keys which are only stored on the production machine and by the user when they need to decrypt it. The metadata stream is not sensitive since it is just an abstract mathematical graph. 7:18 - Because the data stream is already privately encrypted, that stream can be stored behind a firewall and never needs to leave a company’s network. More on this: Quick scan our curated show notes on InfoQ. http://bit.ly/2fInGsW You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq

Sep 16, 2016 • 32min

Caitie McCaffrey on Engineering Effectiveness, Diversity, & Verification of Distributed Systems

In this week's podcast, QCon chair Wes Reisz and Werner Schuster talk to Caitie McCaffrey. McCaffrey works on distributed systems with the engineering effectiveness team at Twitter, and has experience building the large scale services and systems that power the entertainment industry at 343 Industries, Microsoft Game Studios, and HBO. McCaffrey's presentation at QCon New York was called The Verification of a Distributed System. Why listen to this podcast: - Twitter's engineering effectiveness team aims to help make dev tools better, and make developers happier and more efficient. - Asking someone to speak at your conference or join your team solely because of their gender does more harm than people think. - There is not one prescriptive way to make good, successful technology. - Even when we don't have time for testing, there are options to increase your confidence in your system. - The biggest problem when running a unit test is that it is only testing the input you hard code into the unit test. Notes and links can be found on http://bit.ly/2al6BRp Engineering Effectiveness 1:24 - The purpose of the engineering effectiveness team is to help make dev tools better, and to make Twitter's developers happier and more efficient. 2:44 - The team is trying to make infrastructure so that not every team has to solve the distributed problem on their own, and give developers some APIs and tools so that they can build systems easily. More on this: Quick scan our curated show notes on InfoQ. http://bit.ly/2al6BRp You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. http://bit.ly/24x3IVq

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app