The InfoQ Podcast
InfoQ
Software engineers, architects and team leads have found inspiration to drive change and innovation in their team by listening to the weekly InfoQ Podcast. They have received essential information that helped them validate their software development map. We have achieved that by interviewing some of the top CTOs, engineers and technology directors from companies like Uber, Netflix and more. Over 1,200,000 downloads in the last 3 years.
Episodes
Mentioned books

Nov 8, 2019 • 28min
Victor Dibia on TensorFlow.js and Building Machine Learning Models with JavaScript
Victor Dibia is a Research Engineer with Cloudera’s Fast Forward Labs. On today’s podcast, Wes and Victor talk about the realities of building machine learning in the browser. The two discuss the capabilities, limitations, process, and realities around using TensorFlow.js. The two wrap discussing techniques like Model distillation that may enable machine learning models to be deployed in smaller footprints like serverless.
- While there are limitations in running machine learning processes in a resource constrained environment like the browser, there are tools like TensorFlow.js that make it worthwhile. One powerful use case is the ability to protect the privacy of a user base while still making recommendations.
TensorFlow.js takes advantage of the WebGL library for its more computational intense operations.
- TensorFlow.js enables workflows for training and scoring models (doing inference) purely online, by importing a model built offline with more tradition Python tools, and a hybrid approach that builds offline and finetunes online.
To build an offline model, you can build a model with TensorFlow Python (perhaps using a GPU cluster). The model can be exported into the TensorFlow SaveModel Format (or the Keras Model Format) and then converted with TensorFlow.js into the TensorFlow Web Model Format. At that point, the can be directly imported into your JavaScript.
- TensorFlow Hub is a library for the publication, discovery, and consumption of reusable parts of machine learning models and was made available by the Google AI team. It can give developers a quick jumpstart into using trained models.
- Model compression promises to make models small enough to run in places we couldn’t run models before. Model distillation is a process where a smaller model is trained to replicate the behavior of a larger one. In one case, BERT (a library almost 500MB in size) was distilled to about 7MB (almost 60x compression).
More on this: Quick scan our curated show notes on InfoQ https://bit.ly/32rWnab
You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq
Subscribe: www.youtube.com/infoq
Like InfoQ on Facebook: bit.ly/2jmlyG8
Follow on Twitter: twitter.com/InfoQ
Follow on LinkedIn: www.linkedin.com/company/infoq
Check the landing page on InfoQ: https://bit.ly/32rWnab

Nov 1, 2019 • 34min
Michelle Krejci on Moving to Microservices: Visualising Technical Debt, Kubernetes, and GraphQL
In this podcast, Daniel Bryant spoke to Michelle Krejci, service engineer lead at Pantheon, about the Drupal and Wordpress webops-based company’s move to a microservices architecture. Michelle is a well-known conference speaker in the space of technical leadership and continuous integration, and she shared her lessons learned over the past four years of the migration.
Why listen to this podcast:
- The backend for the Pantheon webops platform began as a Python-based monolith with a Cassandra data store. This architecture choice initially enabled rapid feature development as the company searched for product/market fit. However, as the company found success and began scaling their engineering teams, the ability to add new functionality rapidly to the monolith became challenging.
- Conceptual debt and technical debt greatly impact the ability to add new features to an application. Moving to microservices does not eliminate either of these forms of debt, but use of this architectural pattern can make it easier to identify and manage the debt, for example by creating well-defined APIs and boundaries between modules.
- Technical debt -- and the associated engineering toil -- is real debt, with a dollar value, and should be tracked and made visible to everyone.
Establishing “quick wins” during the early stages of the migration towards microservices was essential. Building new business-focused services using asynchronous “fire and forget” event-driven integrations with the monolith helped greatly with this goal.
- Using containers and Kubernetes provided the foundations for rapidly deploying, releasing, and rolling back new versions of a service. Running multiple Kubernetes namespaces also allowed engineers to clone the production namespace and environment (without data) and perform development and testing within an individually owned sandboxed namespace.
- Using the Apollo GraphQL platform allowed schema-first development. Frontend and backend teams collaborated on creating a GraphQL schema, and then individually built their respective services using this as a contract. Using GraphQL also allowed easy mocking during development. Creating backward compatible schema allowed the deployment and release of functionality to be decoupled.

Oct 4, 2019 • 29min
Ryan Kitchens on Learning from Incidents at Netflix, the Role of SRE, and Sociotechnical Systems
In today’s podcast we sit down with Ryan Kitchens, a senior site reliability engineer and member of the CORE team at Netflix. This team is responsible for the entire lifecycle of incident management at Netflix, from incident response to memorialising an issue.
Why listen to this podcast:
- Top level metrics can be used as a proxy for user experience, and can be used to determine that issue should be alerted on an investigated. For example, at Netflix if the customer playback initiation “streams per second” metric declines rapidly, this may be an indication that something has broken.
- Focusing on how things go right can provide valuable insight into the resilience within your system e.g. what are people doing everyday that helps us overcome incidents. Finding sources of resilience is somewhat “the story of the incident you didn’t have”.
- When conducting an incident postmortem, simply reconstructing an incident is often not sufficient to determine what needs to be fixed; there is no root cause with complex socio-technical systems as found at Netflix and most modern web-based organisations. Instead, teams must dig a little deeper, and look for what went well, what contributed to the problem, and where are the recurring patterns.
- Resilience engineering is a multidisciplinary field that was established in the early 2000s, and the associated community that has emerged is both academic and deeply practical. Although much resilience engineering focuses on domains such as aviation, surgery and military agencies, there is much overlap with the domain of software engineering.
- Make sure that support staff within an organisation have a feedback loop into the product team, as these people providing support often know where all of the hidden problems are, the nuances of the systems, and the workarounds.
More on this: Quick scan our curated show notes on InfoQ https://bit.ly/2LLwk8T
You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq
Subscribe: www.youtube.com/infoq
Like InfoQ on Facebook: bit.ly/2jmlyG8
Follow on Twitter: twitter.com/InfoQ
Follow on LinkedIn: www.linkedin.com/company/infoq
Check the landing page on InfoQ: https://bit.ly/2LLwk8T

Sep 20, 2019 • 25min
Oliver Gould on the Three Pillars of Service Mesh, SMI, and Making Technology Bets
In this podcast we sit down with Oliver Gould, co-founder and CTO of Buoyant. Oliver has a strong background in networking, architecture and observability, and worked on solving associated technical challenges at both Yahoo! and Twitter. Oliver is a regular presenter at cloud and infrastructure conferences, and alongside his co-founder William Morgan, you can often find them in the hallway track, waxing lyrical about service mesh -- a term they practically coined -- and trying to bring others along on the journey.
Service mesh technology is still young, and the ecosystem is still very much a work in progress, but there have been several recent interesting developments within this space. One of these was the announcement of the service mesh interface (SMI) at the recent KubeCon EU in Barcelona. The SMI spec seeks to unlock service mesh integrators and implementers, as this can provide an abstraction that removes the need to bet on any single service mesh implementation. This can be good for both tool makers and enterprise early adopters. Many organisations like Microsoft and HashiCorp are involved with working alongside the community to help define the SMI, including Buoyant.
In this podcast we summarise the evolution of the service mesh concept, with a focus on the three pillars: visibility, security, and reliability. We explore the new traffic “tap” feature within Linkerd that allows near real time in-situ querying of metrics, and discuss how to implement network security by leveraging the primitives like Service Account provided by Kubernetes. We also discuss how reliability features, such as retries, time outs, and circuit-breakers are becoming table stakes for infrastructure platforms.
We also cover the evolution of the service mesh interface, explore how service meses may impact development and platforms in the future, and briefly discuss some of the benefits offered by the Rust language in relation to building a data plane for Linkerd. We conclude the podcast with a discussion of the importance of community building.
Why listen to this podcast:
- A well-implemented service mesh can make a distributed software system more observable. Linkerd 2.0 supports both the emitting of mesh telemetry for offline analysis, and also the ability to “tap” communications and make queries dynamically against the data. The Linkerd UI currently makes use the tap functionality.
- Linkerd aims to make the implementation of secure service-to-service communication easy, and it does this by leveraging existing Kubernetes primitives. For example, Service Accounts are used to bootstrap the notion of identity, which in turn is used as a basis for Linkerd’s mTLS implementation.
- Offering reliability is “table stakes” for any service mesh. A service mesh should make it easy for platform owners to offer fundamental service-to-service communication reliability to application owners.
- The future of software development platforms may move (back) to more PaaS-like offerings. Kubernetes-based function as a service (FaaS) frameworks like OpenFaaS and Knative are providing interesting features in this space. A service mesh may provide some of the glue for this type of platform.
- Working on the service mesh interface (SMI) specification allowed the Buoyant team to sit down with other community members like HashiCorp and Microsoft, and share ideas and identify commonality between existing service mesh implementations.
More on this: Quick scan our curated show notes on InfoQ https://bit.ly/2m5DSJ6
You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq
Subscribe: www.youtube.com/infoq
Like InfoQ on Facebook: bit.ly/2jmlyG8
Follow on Twitter: twitter.com/InfoQ
Follow on LinkedIn: www.linkedin.com/company/infoq
Check the landing page on InfoQ: https://bit.ly/2m5DSJ6

Sep 13, 2019 • 25min
Event Sourcing: Bernd Rücker on Architecting for Scale
Bernd Rücker, co-founder and chief technologist at Camunda, shares his expertise in workflow automation and event-driven architectures. He delves into the advantages of event sourcing for scalability and contrasts it with traditional RDBMS limitations. The discussion highlights the confusion between commands and events, the importance of sagas in maintaining consistency, and the need for effective orchestration in microservices. Bernd also touches on Zeebe's architecture, emphasizing its strong state guarantees and the role of event streaming for enhanced performance.

Sep 6, 2019 • 27min
Pat Kua on Technical Leadership, Cultivating Culture, and Career Growth
In this podcast we discuss a holistic approach to technical leadership, and Pat provides guidance on everything from defining target operating models, cultivating culture, and supporting people in developing the career they would like. There are a bunch of great stories, several book recommendations, and additional resources to follow up on.
* Cultivating organisational culture is much like gardening: you can’t force things, but you can set the right conditions for growth. The most effective strategy is to communicate the vision and goals, lead the people, and manage the systems and organisational structure.
* N26, a challenger bank based in Berlin has experienced hypergrowth over the past two years. Both the number of customers and the amount of employees have increased over threefold. This provides lots of opportunities for ownership of product and projects, and it creates unique leadership challenges.
* A target operating model (TOM) is a blueprint of a firm's business vision that aligns operating capacities and strategic objectives and provides an overview of the core business capabilities, internal factors, and external drivers, strategic and operational levers. This should be shared widely within an organisation
* Pat has curated a “trident operating” model for employee growth. In addition to the class individual contributor (IC) and management tracks, he believes that a third “technical leadership” track provides many benefits.
* People can switch between these tracks as their personal goals change. However, this switch can be challenging, and an organisation must support any transition with effective training.
* Pat recommends the following books for engineers looking to make the transition to leadership: The Manager’s Path, by Camille Fournier; Resilient Management, by Lara Hogan; Elegant Puzzle, by Will Larson; and Leading Snowflakes by Oren Ellenbogen. Pat has also written his own book, Talking with Tech Leads.
* It is valuable to define organisation values upfront. However, these can differ from actual culture, which more about what behaviours you allow, encourage, and stop.
* Much like the values provided by Netflix’s Freedom and Responsibility model, Pat argues that balancing autonomy and alignment within an organisation is vital for success. Managers can help their team by clearly defining boundaries for autonomy and responsibility.
* Developing the skills to influence people is very valuable for leaders. Influence is based on trust, and this must be constantly cultivated. Trust is much like a bank account, if you don’t regular deposit actions to build trust, you may find yourself going overdrawn when making a deposit. This can lead to bad will and defensive strategies being employed.

Sep 2, 2019 • 27min
Thomas Graf on Cilium, the 1.6 Release, eBPF Security, & the Road Ahead
Cilium is open source software for transparently securing the network connectivity between application services deployed using Linux container management platforms like Docker and Kubernetes. It is a CNI plugin that offers layer 7 features typically seen with a service mesh. On this week’s podcast, Thomas Graf (one of the maintainers of Cilium and co-founder of Isovalent) discusses the recent 1.6 release, some of the security questions/concerns around eBPF, and the future roadmap for the project.
Why listen to this podcast:
* Cilium brings eBPF to the Cloud Native World. It works across both layer 4 and a layer 7. While it started as a pure eBPF plugin, they discovered that just caring about ports was not enough from a security perspective.
* Cilium went 1.0 about a year and a half ago. 1.6 is the most featured-packed release of Cilium yet. Today, it has around 100 contributors.
* While Cilium can make it much easier to manage IPTables, Cilium overlaps with a service mesh in that it can do things like understand application protocols, HTTP routes, or even restrict access to specific tables in data stores.
* Cilium provides both in kernel and sidecar deployments. For sidecar deployments, it can work with Envoy to switch between kernel space and userspace code. The focus is on flexibility, performance, and low overhead.
* BPF (Berkeley Packet Filter) was initial designed to do filtering on data links. eBPF has the same roots but it’s now used for system call filtering, tracing, sandbox, etc. It’s grown to be a general-purpose programming language to extend the Linux kernel.
* Cilium has a multi-cluster feature built-in. The 1.6 release can run in a kube-proxy free configuration. It allows fine-grain network policies to run across multiple clusters without the use of IPTables.
* Cilium offers on-the-wire encryption using in-kernel encryption technology that enables mTLS across all traffic in your service fleet. The encryption is completely transparent to the application.
* eBPF has been used in all production environments at Facebook since May 2017. It’s been used at places like Netflix, Google, and Reddit. There are a lot of companies who have an interest in eBPF being secure and production-ready, so there’s a lot of attention focused on fixing and resolving and security issues that arise.
* 1.6 also released KVstore-free operation, socket-based load balancing, CNI chaining, Native AWS ENI mode, enhancements to transparent encryption, and more.
* The plans for 1.17 is to keep raising up the stack into the socket level (to offer things like load balancing and transparent encryption at scale) and likely offering deeper security features such as process-aware security policies for internal pod traffic.
More on this: Quick scan our curated show notes on InfoQ https://bit.ly/2HCGnLa
You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq
Subscribe: www.youtube.com/infoq
Like InfoQ on Facebook: bit.ly/2jmlyG8
Follow on Twitter: twitter.com/InfoQ
Follow on LinkedIn: www.linkedin.com/company/infoq
Check the landing page on InfoQ: https://bit.ly/2HCGnLa

Aug 28, 2019 • 32min
Yuri Shkuro on Tracing Distributed Systems Using Jaeger
The three pillars of observability are logs, metrics, and tracing. Most teams are able to handle logs and metrics, while proper tracing can still be a challenge. On this podcast, we talk with Yuri Shkuro, the creator of Jaeger, author of the book Mastering Distributed Tracing, and a software engineer at Uber, about how the Jaeger tracing backend implements the OpenTracing API to handle distributed tracing.
Why listen to the podcast:
- Jaeger is an open-source tracing backend, developed at Uber. It also has a collection of libraries that implement the OpenTracing API.
- At a high level, Jaeger is very similar to Zipkin, but Jaeger has features not available in Zipkin, including adaptive sampling and advanced visualization tools in the UI.
- Tracing is less expensive than logging because data is sampled. It also gives you a complete view of the system. You can see a macro view of the transaction, and how it interacted with dozens of microservices, while still being able to drill down into the details of one service.
- If you have only a handful of services, you can probably get away with logging and metrics, but once the complexity increases to dozens, hundreds, or thousands of microservices, you must have tracing.
- Tracing does not work with a black box approach to the application. You can't simply use a service mesh then add a tracing framework. You need correlation between a single request and all the subsequent requests that it generates. A service mesh still relies on the underlying components handling that correlation.
More on this: Quick scan our curated show notes on InfoQ https://bit.ly/2ZlvMyR
You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq
Subscribe: www.youtube.com/infoq
Like InfoQ on Facebook: bit.ly/2jmlyG8
Follow on Twitter: twitter.com/InfoQ
Follow on LinkedIn: www.linkedin.com/company/infoq
Check the landing page on InfoQ: https://bit.ly/2ZlvMyR

Aug 19, 2019 • 29min
Louise Poubel on the Robotic Operating System
ROS is the Robotic Operating System. It’s been used by thousands of developers to prototype and create a robotic application. ROS can be found on robotics in warehouses, self-driving car companies, and on the International Space Station. Louise Poubel is an engineer working with Open Robotics. Today on the podcast, she talks about what it takes to develop software that moves in physical space, including the Sense, Think, Act Cycle, the developer experience, and architecture of ROS.
Why listen to this podcast:
- Writing code for robot development, you use the Sense, Think, Act Cycle.
- ROS is an SDK for robotics. It provides a communication layer that enables data to flow between nodes that handle sensors, logic, and actuation.
- ROS has two versions and has been around for twelve years. ROS 1 was entirely implemented in C. ROS 2 offers is a common C layer with implementations in many different languages, including Java, JavaScript, and Rust.
- Released on a six-month cadence, Dashing was the latest release (May 2019). Previous releases were supported for one year, Dashing is the first LTS and will be supported for two years.
- ROS 2 builds on top of the standard Data Distribution Service (DDS) that you find in mission-critical systems like nuclear power and airplanes.
- Simulation is an important step in robotics. It allows you to prototype a system before deploying to a physical system.
- Rviz is a three-dimensional visualizer used to visualize robots, the environments they work in, and sensor data. It is a highly configurable tool, with many different types of visualizations and plugins. It allows you to put together all your data in one place and see it.
More on this: Quick scan our curated show notes on InfoQ https://bit.ly/2ZbOvrO
You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq
Subscribe: www.youtube.com/infoq
Like InfoQ on Facebook: bit.ly/2jmlyG8
Follow on Twitter: twitter.com/InfoQ
Follow on LinkedIn: www.linkedin.com/company/infoq
Check the landing page on InfoQ: https://bit.ly/2ZbOvrO

Aug 9, 2019 • 41min
Matt Klein on Envoy Mobile, Platform Complexity, and a Universal Data Plane API for Proxies
In this podcast we sit down with Matt Klein, software plumber at Lyft and creator of Envoy, and discuss topics including the continued evolution of the popular proxy, the strength of the open source Envoy community, and the value of creating and implementing standards throughout the technology stack. We also explore the larger topic of cloud natives platforms, and discuss the tradeoffs between using a simple and opinionated platform against something that is bespoke and more configurable, but also more complex. Related to this, Matt shares his thoughts on when and how to make the decision within an organisation to embrace technology like container orchestration and service meshes.
Finally, we explore the creation of the new Envoy Mobile project. The goal of this project is to expand the capabilities provided by Envoy all the way out to mobile devices powered by Android and iOS. For example, most current user-focused traffic shifting that is conducted at the edge is implemented with coarse-grained approaches via by BGP and DNS, and using something like Envoy within mobile app networking stacks should allow finer-grained control.
Why listen to this podcast:
- The Envoy Proxy community has grown from strength-to-strength over the last year, from the inaugural EnvoyCon that ran alongside KubeCon NA 2018, to the increasing number of code contributions from engineers working across the industry
- Attempting to create a community-driven “universal proxy data plane” with clearly defined APIs, like Envoy’s XDS API, has allowed vendors to collaborate on a shared abstraction while still allowing room for “differentiated success” to be built on top of this standard
Google’s gRPC framework is adopting the Envoy XDS APIs, as this will allow both Envoy and gRPC instances to be operated via a single control plane, for example, Google Cloud Platform’s Traffic Director service.
- There is a tendency within the software development industry to fetishise architectures that are designed and implemented by the unicorn tech companies, but not every organisation operates at this scale.
- However, there has also been industry pushback against the complexity that modern platform components like container orchestration and service meshes can introduce to a technology stack. - Using a platform within these components provides the best return on investment when an organisation’s software architecture and development teams have reached a certain size.
- Function-as-a-Service (Faas)-type platforms will most likely be how engineers will interact with software in the future. Business-focused developers often do not want to interact with the platform plumbing
Envoy Mobile is building on prior art, and aims to expand the capabilities provided by Envoy all the way out to mobile devices using Android and iOS. Most current end user traffic shifting is implemented with coarse-grained approaches via BGP and DNS, and using something like Envoy instead will allow finer-grained control.
- Using Envoy Mobile in combination with Protocol Buffers 3, which supports annotations on APIs, can facilitate working with APIs offline, configuring caching, and handling poor networking conditions. One of the motivations for this work is that small increases in application response times can lead to better business outcomes.
More on this: Quick scan our curated show notes on InfoQ https://bit.ly/33nlGMu
You can also subscribe to the InfoQ newsletter to receive weekly updates on the hottest topics from professional software development. bit.ly/24x3IVq
Subscribe: www.youtube.com/infoq
Like InfoQ on Facebook: bit.ly/2jmlyG8
Follow on Twitter: twitter.com/InfoQ
Follow on LinkedIn: www.linkedin.com/company/infoq
Check the landing page on InfoQ: https://bit.ly/33nlGMu


