

KubeFM
KubeFM
Discover all the great things happening in the world of Kubernetes, learn (controversial) opinions from the experts and explore the successes (and failures) of running Kubernetes at scale.
Episodes
Mentioned books

Apr 7, 2026 • 30min
Intelligent Kubernetes Load Balancing, with Rohit Agrawal
You're running gRPC services in Kubernetes, load balancing looks fine on the dashboard — but some pods are burning at 80% CPU while others sit idle, and adding more replicas only partially helps.Rohit Agrawal, a Staff Software Engineer on the traffic platform team at Databricks, explains why this happens and how his team replaced Kubernetes's default networking with a proxy-less, client-side load-balancing system built on the xDS protocol.In this episode:Why KubeProxy's Layer 4 routing breaks down under high-throughput gRPC: it picks a backend once per TCP connection, not per requestHow Databricks built an Endpoint Discovery Service (EDS) that watches Kubernetes directly and streams real-time pod metadata to every clientHow zone-aware spillover cut cross-availability-zone costs without sacrificing availabilityWhy CPU-based routing failed (monitoring lag creates oscillation) and what signals to use insteadThe system has been running in production for three years across hundreds of services, handling millions of requests.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/y803JMhBkInterested in sponsoring an episode? Learn more.

Mar 31, 2026 • 29min
That Time I Found a Service Account Token in my Log Files, with Vincent von Büren
You're integrating HashiCorp Vault into your Kubernetes cluster and adding a temporary debug log line to check whether the ServiceAccount token is being passed correctly. Three months later, that log line is still in production — and the token it prints has a 1-year expiry with no audience restrictions.Vincent von Büren, a platform engineer at ipt in Switzerland, lived through exactly this incident. In this episode, he breaks down why default Kubernetes ServiceAccount tokens are a quiet security risk hiding in plain sight.You will learn:What's actually inside a Kubernetes ServiceAccount JWT (issuer, subject, audience, and expiry)Why tokens with no audience scoping enable replay attacks across internal and external systemsHow Vault's Kubernetes auth method and JWT auth method compare, and when to choose eachWhat projected tokens are, why they dramatically reduce blast radius, and what's holding teams back from using themPractical steps for auditing which pods actually need API access and disabling auto-mounting everywhere elseSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/LTnB_NtbcInterested in sponsoring an episode? Learn more.

Mar 24, 2026 • 46min
GPU Containers as a Service, with Landon Clipp
Running GPU workloads on Kubernetes sounds straightforward until you need to isolate multiple tenants on the same server. The moment you virtualize GPUs for security, you lose access to NVIDIA kernel drivers — and almost every tool in the ecosystem assumes those drivers exist.Landon Clipp built a GPU-based Containers as a Service platform from scratch, solving each isolation layer — from kernel separation with Kata Containers + QEMU to NVLink fabric partitioning to network policies with Cilium/eBPF — and shares exactly what broke along the way.In this interview:Why standard NVIDIA tooling (GPU Operator) fails in multi-tenant setups, and how to use CDI with PCI topology scanning to make GPUs visible to Kubernetes without kernel driversHow to partition the NVLink fabric between tenants using a trusted service VM running Fabric Manager, and why the physical PCIe wiring differs between Supermicro HGX and NVIDIA DGX systemsWhy gVisor doesn't work for GPU workloads — NVIDIA's unstable ioctl ABI means Google has to update gVisor for every driver release, and they only support a handful of GPUsWhat caused 8-GPU VMs to take 30+ minutes to boot, and the specific fixes (IOMMUFD, cold plugging, kernel upgrades) that brought it down to minutesHow Cilium network policies enforce tenant isolation at the Kubernetes identity level instead of fragile IP-based rulesWhere Containers as a Service fits best: inference workloads where AI teams want to ship an OCI image without managing infrastructure or signing multi-million dollar cluster contracts.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/jjK_yJTDzInterested in sponsoring an episode? Learn more.

Mar 17, 2026 • 21min
How We Cut Build Debugging Time by 75% with AI, with Ron Matsliah
Build failures in Kubernetes CI/CD pipelines are a silent productivity killer. Developers spend 45+ minutes scrolling through cryptic logs, often just hitting rerun and hoping for the best.Ron Matsliah, DevOps engineer at Next Insurance, built an AI-powered assistant that cut build debugging time by 75% — not as a dashboard, but delivered directly in Slack where developers already work.In this episode:Why combining deterministic rules with AI produces better results than letting an LLM guess aloneHow correlating Kubernetes events with build logs catches spot instance terminations that produce misleading errorsWhy integrating into existing workflows and building feedback loops from day one drove adoptionThe prompt engineering lessons learned from testing with real production data instead of synthetic examplesThe takeaway: simple rules plus rich context consistently outperform complex AI queries on their own.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/PDdYfC00wInterested in sponsoring an episode? Learn more.

Mar 10, 2026 • 25min
Migrating Kubernetes Off Big Cloud, with Fernando Duran
Managed Kubernetes on a major cloud provider can cost hundreds or even thousands of dollars a month — and much of that spending hides behind defaults, minimum resource ratios, and auxiliary services you didn't ask for.Fernando Duran, founder of SadServers, shares how his GKE Autopilot proof of concept ran close to $1,000/month on a fraction of the CPU of the actual workload and how he cut that to roughly $30/month by moving to Hetzner with Edka as a managed control plane.In this interview:Why Kubernetes hasn't delivered on its original promise of cost savings through bin packing — and what it actually provides insteadA real cost comparison: $1,000/month on GKE vs. $30/month on Hetzner with Edka for the same nominal capacityWhat you need to bring with you (observability, logging, dashboards) when leaving a fully managed cloud providerThe decision comes down to how tightly coupled you are to cloud-specific services and whether your team can spare the cycles to manage the gaps.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/6nSDbz9m4Interested in sponsoring an episode? Learn more.

Mar 3, 2026 • 1h 2min
Migrating to Karpenter: Fun Stories, with Adhi Sutandi
Running multiple Kubernetes clusters on AWS with the cluster autoscaler? Every four months, you face the same grind: upgrading Kubernetes versions, recreating auto scaling groups, and hoping instance type changes stick.Adhi Sutandi, DevOps Engineer at Beekeeper by LumApps, shares how his team migrated from the cluster autoscaler to Karpenter across eight EKS clusters — and the hard lessons they learned along the way.In this episode:Why AWS auto scaling groups are immutable and how that creates upgrade bottlenecks at scaleHow the latest AMI tag accidentally turned less critical clusters into chaos engineering environments, dropping SLOs before anyone realized Karpenter was the causeWhy pre-stop sleep hooks solved pod restartability problems that Quarkus's built-in graceful shutdown couldn'tThe case for pod disruption budgets over Karpenter annotations when protecting critical workloads during node rotationsHow Karpenter's implicit 10% disruption budget caught the team off guard — and the explicit configuration that fixed itSponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/XyVfsSQPrInterested in sponsoring an episode? Learn more.

Feb 24, 2026 • 38min
From ECS to Kubernetes: A Real Migration Story, with Radosław Miernik
Migrating from ECS to Kubernetes sounds straightforward — until you hit spot capacity failures, firewall rules silently dropping traffic, and memory metrics that lie to your autoscaler.Radosław Miernik, Head of Engineering at aleno, walks through a real production migration: what broke, what they missed, and the fixes that made it work.In this interview:Running Flux and Argo CD together — Flux for the infra team, Argo CD's UI for developers who don't want to touch YAMLHow the wrong memory metric caused OOM errors, and why switching to jemalloc cut memory usage by 20%Splitting WebSocket and API containers into separate deployments with independent autoscalingFour months of migration, over 100 configuration changes in the first month, and a concrete breakdown of what platform work looks like when you can't afford downtime.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/x6wFMhVsxInterested in sponsoring an episode? Learn more.

Feb 17, 2026 • 22min
Faster EKS Node and Pod Startup, with Jan Ludvik
Kubernetes nodes on EKS can take over a minute to become ready, and pods often wait even longer — but most teams never look into why.Jan Ludvik, Senior Staff Reliability Engineer at Outreach, shares how he cut node startup from 65 to 45 seconds and reduced P90 pod startup by 30 seconds across ~1,000 nodes — by tackling overlooked defaults and EBS bottlenecks.In this episode:Why Kubelet's serial image pull default quietly blocks pod startup, and how parallel pulls fix itHow EBS lazy loading can silently negate image caching in AMIs — and the critical path workaroundA Lambda-based automation that temporarily boosts EBS throughput during startup, then reverts to save costThe kubelet metrics and logs that expose pod and node startup latenc,y most teams never monitorEvery second saved translates to faster scaling, lower AWS bills, and better end-user experience.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/B7TzKXyxfInterested in sponsoring an episode? Learn more.

Feb 10, 2026 • 28min
Kubernetes is not just for Black Friday, with Thibault Martin
You self-host services at home, but upgrades break things, rollbacks require SSH-ing in to kill containers manually, and there's no safety net if your hardware fails.Thibault Martin, Director of Program Development at the Matrix Foundation, walked this exact path — from Docker Compose to Podman with Ansible to Kubernetes on a single server — and explains why each transition happened and what it solved.In this interview:Why Ansible's declarative promise fell short with the Podman collection, forcing sequential imperative steps instead of desired-state definitionsHow community Helm charts replace the need to write and maintain every manifest yourselfWhy GitOps isn't just a deployment workflow — it's a disaster recovery strategy when your infrastructure lives in your living roomHow k3s removes the barrier to entry by bundling opinionated defaults so you can skip choosing CNI plugins and storage providersKubernetes doesn't have to be enterprise-scale — with the right distribution and community tooling, it can be a practical, low-overhead choice for anyone who cares about their data.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/Xk5S7VqXzInterested in sponsoring an episode? Learn more.

Feb 3, 2026 • 25min
Patroni Backups: when pgBackRest and Argo CD have your back (literally), with Ziv Yatzik
Your database backup strategy shouldn't be the thing that takes your production systems down.Ziv Yatzik manages 600+ Postgres clusters in a closed network environment with no public cloud. After existing backup solutions proved unreliable — causing downtime when disks filled up — his team built a new architecture using pgBackRest, Argo CD, and Kubernetes CronJobs.In this episode:Why storing WAL files on shared NAS storage prevents backup failures from cascading into database outagesHow GitOps with Argo CD lets them manage backups for hundreds of clusters by adding a single YAML fileThe Ansible + Kubernetes hybrid approach that keeps VM-based Patroni clusters in sync with Kubernetes-orchestrated backupsA practical blueprint for making database backups boring, reliable, and safe.SponsorThis episode is sponsored by LearnKube — get started on your Kubernetes journey through comprehensive online, in-person or remote training.More infoFind all the links and info for this episode here: https://ku.bz/Rg_sQYSmwInterested in sponsoring an episode? Learn more.


