The InfoQ Podcast
InfoQ
Software engineers, architects and team leads have found inspiration to drive change and innovation in their team by listening to the weekly InfoQ Podcast. They have received essential information that helped them validate their software development map. We have achieved that by interviewing some of the top CTOs, engineers and technology directors from companies like Uber, Netflix and more. Over 1,200,000 downloads in the last 3 years.
Episodes
Mentioned books

Dec 16, 2016 • 36min
Keith Adams on the Architecture of Slack, using MySql, Edge Caching, & the backend Messaging Server
In this week’s podcast, QCon chair Wesley Reisz talks to Keith Adams, chief architect at Slack. Prior he was an engineer at Facebook where he worked on the search type live backend, and is well-known for the HipHop VM [hhvm.com]. Adams presented How Slack Works at QCon SanFrancisco 2016.
Why listen to this podcast:
- Group messaging succeeds when it feels like a place for members to gather, rather than just a tool
- Having opt-in group membership scales better than having to define a group on the fly, like a mailing list instead of individually adding people to a mail
- Choosing availability over consistency is sometimes the right choice for particular use cases
- Consistency can be recovered after the fact with custom conflict resolution tools
- Latency is important and can be solved by having proxies or edge applications closer to the user
Notes and links can be found on: http://bit.ly/keith-adams
3m:30s Voice and video interactions are impacted by latency; the same is true of messaging clients
4m:00s The user interface can provide indications of presence, through avatars indicating availability and typing indicators
4m:15s Latency is important; sometimes the difference is between 100ms and 200ms so the message channel monitors ping timeout between server and client
4m:40s 99th percentile is less than 100ms ping time
5m:15s If the 99th percentile is more than 100ms then it may be server based, such as needing to tune the Java GC
5m:25s Network conditions of the mobile clients are highly variable
6m:20s Mobile clients can suffer intermittent connectivity
More on this:
Quick scan our curated show notes on InfoQ. http://bit.ly/keith-adams
You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq

Dec 9, 2016 • 25min
Haley Tucker on Responding to Failures in Playback Features at Netflix
In this week’s podcast, Thomas Betts talks with Haley Tucker, a Senior Software Engineer on the Playback Features team at Netflix. While at QCon San Francisco 2016, Tucker told some production war stories about trying to deliver content to 65 million members.
Why listen to this podcast:
- Distributed systems fail regularly, often due to unexpected reasons
- Data canaries can identify invalid metadata before it can enter and corrupt the production environment
- ChAP, the Chaos Automation Platform, can test failure conditions alongside the success conditions
- Fallbacks are an important component of system stability, but the fallbacks must be fast and light to not cause secondary failures
- Distributed systems are fundamentally social systems, and require a blameless culture to be successful
Notes and links can be found on: http://bit.ly/2hqzQ6K
2m:05s - The Video Metadata Service aggregates several sources into a consistent API consumed by other Netflix services.
2m:43s - Several checks and validations were in place within the video metadata service, but it is impossible to predict every way consumers will be using the data.
3m:30s - The access pattern used by the playback service was different than that used in the checks, and led to unexpected results in production.
3m:58s - Now, the services consuming the data are also responsible for testing and verifying the data before it rolls out to production. The Video Metadata Service can orchestrate the testing process.
More on this:
Quick scan our curated show notes on InfoQ. http://bit.ly/2hqzQ6K
You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq

Dec 2, 2016 • 29min
Kolton Andrus on Lessons Learnt From Failure Testing at Amazon and Netflix and New Venture Gremlin
In this week's podcast, QCon chair Wesley Reisz talks to Kolton Andrus. Andrus is the founder of Gremlin Inc. He was a Chaos Engineer at Netflix, focused on the resilience of the Edge services. He designed and built FIT: Netflix’s failure injection service. Prior, he improved the performance and reliability of the Amazon Retail website.
Why listen to this podcast:
- Gremlin, Kolton Andrus' new start-up, is focused on providing failure testing as a service. Version 1, currently in closed beta, is focused on infrastructure failures.
- Lineage-driven Fault Injection (LDFI) allowed Netflix to dramatically reduce the number of tests they needed to run in order to explore a problem space.
- You generally want to run failure tests in production, but you can't start there. Start in developemnt and build up.
- Having failure testing at an application level, as Netflix does, so you can have request level fault injection for a specific user or a specific device.
- Being able to trace infrastructure with something like Dapper or Zipkin offers tremendous value. At Netflix, the failure injection system is integrated into the tracing system, which meant that when they caused a failure they could see all the points in the system that it touched.
More on this:
Quick scan our curated show notes on InfoQ. http://bit.ly/2fT9YiM
You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq

Nov 18, 2016 • 26min
Preslav Le on How Dropbox Moved off AWS and What They Have Been Able to Do Since
As InfoQ previously reported in March 2016, Dropbox announced that they had migrated away from Amazon Web Services (AWS).
In this week's podcast Robert Bluman talks to Preslav Le. Preslav has been a software engineer at Dropbox for the past three years, contributing to various aspects of Dropbox’s infrastructure including traffic, performance and storage. He was part of the core oncall and storage oncall rotations, dealing with high emergency real world issues, from bad code pushes to complete datacenter outages.
Why listen to this podcast:
- Dropbox migrated away from Amazon S3 to their own data centres to allow them to optimise for their specific use case.
- They are experimenting with Shingled Magnetic Recording (SMR) drives for primary storage to increase storage density. All writes go to an SSD cache and then get pushed asynchronously to the SMR disk.
- Their average block size is 1.6MB with a maximum block size of 4MB. Knowing this allows the team to tune their storage system.
- Three languages are used for the backend infrastructure. Python is used mainly for business logic, Go is the primary language used for heavy infrastructure services, and in some cases, for example where more direct control over memory is needed, Rust is also used.
- Dropbox invest very heavily in verification and automation. A verifier scans every byte on disk and checks that it matches the checksum in the index.
- Verification is also used to check that each box has the block keys it should have.
Notes and links can be found on http://bit.ly/preslav-le
Dropbox’s motivation for moving off the cloud
2:40 - Dropbox used Amazon S3 and other services where it made sense, but they stored all the metadata in their own data centres.
3:30 - Initially this was done because Amazon had poor support for persistent storage at the time. This has since improved but it didn’t make sense for dropbox to move the metadata back.
4:01 - By that time the dropbox team was ready to tackle the storage problem and built their own in-house replacement for S3, called Magic Pocket. Magic Pocket allowed Dropbox to move away from Amazon altogether.
4:30 - The move saved money, but also allowed DropBox to optimise for their specific use case and be faster.
More on this:
Quick scan our curated show notes on InfoQ. http://bit.ly/preslav-le
You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq

Nov 11, 2016 • 26min
Randy Shoup on Stitch Fix's Technology Stack, Data Science and Microservices
In this week's podcast QCon chair Wesley Reisz talks to Randy Shoup. Shoup is the vice president of engineering at Stitch Fix. Prior to Stitch Fix, he worked for Google as the director of engineering and cloud computing, CTO and co-founder of Shopilly, and chief engineer at Ebay.
Why listen to this podcast:
- Stitch Fix's business is a combination of art and science. Humans are much better with the machines, and the machines are much better with the humans.
- Stitch Fix has 60 engineers, with 80 data scientists and algorithm developers. This ratio of data science to engineering is unique.
- With Ruby-on-Rails on top of Postgres, the company maintains about 30 different applications on the same stack.
- The practice of Test Driven Development makes Continuous Delivery work, and the practice of having the same people build the code as those who operate the code makes both of these things much more powerful.
- Microservices gives feature velocity, the ability for individual teams to move quickly and independently of each other, and independent deployments.
- Microservices solve a scaling problem. They solve an organisational scaling problem, and a technological scaling problem. These are not the problems that you have early on in the startup.
- In the monolithic world, if you're not able to continue to vertically scale the application or the database or whatever your monolith is. And so for scaling reasons alone you might consider breaking it up into what we call microservices.
Notes and links can be found on http://bit.ly/randy-shoup-podcast
Data Science and Stitch Fix
1m:57s - Stitch Fix re-imagines retail, particularly for clothing. When you sign up, you fill out survey of the kinds of things that you like and you don't like, and we choose what we think you're going to enjoy based on the millions of customers that we have already. And we use a ton of data science in that process.
3m:00s - That goes into our algorithms and then our algorithms make personalised recommendations based on all the things we know about our other customers... there's a human element as well: we have 3,200 human stylists that are all around the United States and they choose the five items that go into the box [of clothing].
3m:29s - What we like is that this is a combination of art and science. Modern companies combine what machines are really good at, such as chugging through the 60 to 70 questions times the millions of customers, and combining that with the human element of the stylists, figuring out what things go together, what things are trending, what things are appropriate... Humans are much better with the machines, and the machines are much better with the humans.
[...]
More on this:
Quick scan our curated show notes on InfoQ. http://bit.ly/randy-shoup-podcast
You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq

Nov 4, 2016 • 29min
Tal Weiss on Observability, Instrumentation and Bytecode Manipulation on the JVM
In this week's podcast, QCon chair Wesley Reisz talks to Tal Weiss, CEO of OverOps, recently re-branded from Takipi. The conversation covers how the OverOps product works, explores the difference between instrumentation and observability, discusses bytecode manipulation approaches and common errors in Java based applications.
A keen blogger, Weiss has been designing scalable, real-time Java and C++ applications for the past 15 years. He was co-founder and CEO at VisualTao which was acquired by Autodesk in 2009, and also worked as a software architect at IAI Space Industries focusing on distributed, real-time satellite tracking and control systems.
Why listen to this podcast:
- OverOps uses a mixture of machine code instrumentation and static code analysis at deployment time to build up an index of the code
- Observability is how you architect your code to be able to capture information from its outputs. Instrumentation is where you come in from the outside and use bytecode or machine code manipulation techniques to capture information after the system has been designed and built.
- Bytecode instrumentation is a technique that most companies can benefit from learning a bit more about. Bytecode isn’t machine code - it is a high-level programming language. Being able to read it really helps you understand how the JVM works.
- There are a number of bytecode manipulation tools you can use to work with bytecode - ASM is probably the most well known.
- A fairly small number of events within an application’s life-cycle generate the majority of the log volume. A good practice is to regularly review your log files to consider if what is being logged is the right thing.
Notes and links can be found on http://bit.ly/2fInGsW
SaaS vs On-Premise
5:43 - OverOps started as a SaaS product, but given that a lot of the data it collects is potentially sensitive, they introduced a new product called Hybrid. Hybrid separates the data into two independent streams: data and metadata.
6:42 - The data stream is the raw data that is captured which is then privately encrypted using 256 bit AES encryption keys which are only stored on the production machine and by the user when they need to decrypt it. The metadata stream is not sensitive since it is just an abstract mathematical graph.
7:18 - Because the data stream is already privately encrypted, that stream can be stored behind a firewall and never needs to leave a company’s network.
More on this:
Quick scan our curated show notes on InfoQ. http://bit.ly/2fInGsW
You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq

Sep 16, 2016 • 32min
Cathy O'Neil on Pernicious Machine Learning Algorithms and How to Audit Them
In this week's podcast InfoQ’s editor-in-chief Charles Humble talks to Data Scientist Cathy O’Neil. O'Neil is the author of the blog mathbabe.org. She was the former Director of the Lede Program in Data Practices at Columbia University Graduate School of Journalism, Tow Center and was employed as Data Science Consultant at Johnson Research Labs. O'Neil earned a mathematics Ph.D. from Harvard University. Topics discussed include her book “Weapons of Math Destruction,” predictive policing models, the teacher value added model, approaches to auditing algorithms and whether government regulation of the field is needed.
Why listen to this podcast:
- There is a class of pernicious big data algorithms that are increasingly controlling society but are not open to scrutiny.
- Flawed data can result in an algorithm that is, for instance, racist and sexist. For example, the data used in predictive policing models is racist. But people tend to be overly trusting of algorithms because they are mathematical.
- Data scientists have to make ethical decisions even if they don’t acknowledge it. Often problems stem from an abdication of responsibility.
- Auditing for algorithms is still a very young field with ongoing academic research exploring approaches.
- Government regulation of the industry may well be required.
Notes and links can be found on http://bit.ly/2eYVb9q
Weapons of math destruction
0m:43s - The central thesis of the book is that whilst not all algorithms are bad, there is a class of pernicious big data algorithms that are increasingly controlling society.
1m:32s - The classes of algorithm that O'Neil is concerned about - the weapons of math destruction - have three characteristics: they are widespread and impact on important decisions like whether someone can go to college or get a job, they are somehow secret so that the people who are being targeted don’t know they are being scored or don’t understand how their score is computed; and the third characteristic is they are destructive - they ruin lives.
2m:51s - These characteristics undermine the original intention of the algorithm, which is often trying to solve big society problems with the help of data.
More on this:
Quick scan our curated show notes on InfoQ. http://bit.ly/2eYVb9q
You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq

Aug 19, 2016 • 24min
John Langford on Vowpal Wabbit, Used by MSN, and Machine Learning in Industry
In this week's podcast QCon chair Wesley Reisz talks to Machine learning research scientist John Langford. Topics include his Machine Learning system Vowpal Wabbit, designed to be very efficient and incorporating some of the latest algorithms in the space. Vowpal Wabbit is used for news personalisation on MSN. They also discuss how to get started in the field and it’s shift from academic research to industry use.
Why listen to this podcast:
- Vowpal Wabbit is a ML system that attempts to incorporate some of the latest machine learning algorithms.
- How to learn ML: take a class or two, get accustomed with learning theory and practice.
- ML has moved from the research field into the industry, 4 out of 9 ICML tutorials coming from the industry.
- It’s hard to predict when you have enough data.
- AlphaGo is a milestone in artificial intelligence. It uses reinforcement learning, deep learning and existing moves played by Go masters.
- Deep Learning is currently a disruptive technology in areas such a vision or speech recognition.
- What’s trendy: Neural Networks, Reinforcement and Contextual Learning.
- Machine Learning is being commoditized.
Notes and links can be found on http://bit.ly/2b4YNqQ
How to Approach Machine Learning
6m:12s To start learning Machine Learning, Langford recommends taking a class or two, mentioning the course by Andrew Ng and another course by Yaser S. Abu-Mostafa.
6m:50s It is recommended to get accustomed with learning theory to avoid some of the rookie's mistakes.
Quick scan our curated show notes on InfoQ. http://bit.ly/2atBFgk
You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. http://bit.ly/24x3IVq

Aug 1, 2016 • 43min
Shuman Ghosemajumder on Security and Cyber-Crime
In this week's podcast, professor Barry Burd talks to Shuman Ghosemajumder. Ghosemajumder is VP of product management at Shape Security and former click fraud czar for Google. Ghosemajumder is also the co-author of the book CGI Programming Unleashed, and was a keynote speaker at QCon New York 2016 presenting Security War Stories.
Why listen to this podcast:
With more of our lives conducted online through technology and information retrieval systems, the use of advanced technology gives criminals the opportunity to be able to do things that they weren't able to do.
- Cyber-criminals come from all over the world and every socioeconomic background, so long as there's some level of access to computers and technology.
- You see organised cyber-crime focusing on large companies because of the fact that they get a much greater sense of efficiency for their attacks.
- Cyber-criminals are getting creative, and coming up with ways to interact with websites we haven't thought of before.
- You can have very large scale attacks that are completely invisible from the point of view of the application that's being attacked.
- The context of what are you are using software for is more important than just going through an understanding of the code level vulnerability.
Notes and links can be found on http://bit.ly/2atBFgk
The People Behind Cyber-Crime
5:28 - There are all kinds of different personalities and demographics involved. Cyber-criminals come from all over the world and every socioeconomic background, so long as there's some level of access to computers and technology. Even in cases where a cyber criminal doesn't know how to use technology directly, or how to create something like a piece of malware, they can still be involved in a cyber-criminal's scheme.
6:29 - A scheme which uses large groups of individuals and which doesn’t necessarily need to have skills itself, is stealing money from bank accounts. Being able to transfer money using malware on people’s machines from one account to another account that the cyber-criminal controls still involves getting that money out. That last step can involve a set of bank accounts that are assigned to real individuals.
More on this:
Quick scan our curated show notes on InfoQ. http://bit.ly/2atBFgk
You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. http://bit.ly/24x3IVq

Jul 22, 2016 • 33min
Caitie McCaffrey on Engineering Effectiveness, Diversity, & Verification of Distributed Systems
In this week's podcast, QCon chair Wes Reisz and Werner Schuster talk to Caitie McCaffrey. McCaffrey works on distributed systems with the engineering effectiveness team at Twitter, and has experience building the large scale services and systems that power the entertainment industry at 343 Industries, Microsoft Game Studios, and HBO. McCaffrey's presentation at QCon New York was called The Verification of a Distributed System.
Why listen to this podcast:
- Twitter's engineering effectiveness team aims to help make dev tools better, and make developers happier and more efficient.
- Asking someone to speak at your conference or join your team solely because of their gender does more harm than people think.
- There is not one prescriptive way to make good, successful technology.
- Even when we don't have time for testing, there are options to increase your confidence in your system.
- The biggest problem when running a unit test is that it is only testing the input you hard code into the unit test.
Notes and links can be found on http://bit.ly/2al6BRp
Engineering Effectiveness
1:24 - The purpose of the engineering effectiveness team is to help make dev tools better, and to make Twitter's developers happier and more efficient.
2:44 - The team is trying to make infrastructure so that not every team has to solve the distributed problem on their own, and give developers some APIs and tools so that they can build systems easily.
More on this:
Quick scan our curated show notes on InfoQ. http://bit.ly/2al6BRp
You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. http://bit.ly/24x3IVq


