The DevSecOps Talks Podcast

Mattias Hemmingsson, Julien Bisconti and Andrey Devyatkin
undefined
May 6, 2020 • 1h 1min

DEVSECOPS Talks #6-2020 - SemVer or not to SemVer

This time Johan Abildskov, a Senior Consultant with Praqma/Eficode, joins us to talk about SemVer (Semantic Versioning), and we finally get to hear what Julien has to say about it. We get to explore different options regarding versioning and how it helps humans communicate. At the end of the podcast, everyone gets to share their approach and recommendations for versioning things.
undefined
Apr 7, 2020 • 1h 1min

DEVSECOPS Talks #5-2020 - What we have been working on

We had a couple of possible topics for this episode but before getting started with them we decided to discuss what technological problems we were solving during the last two weeks. Well, turns out there was quite a lot to discuss. Tune in for tips on ssh session logging on the ssh server, preventing downloads from AWS S3 even if you got read access, credentials in Git repository 🤦, why you should (or should not) do K8S and more. Summary In this free-form early episode of DevSecOps Talks, a casual "what have you been up to" catch-up turns into a sharp exchange on the gap between security in theory and security in practice. One host discovers plaintext service account keys, database passwords, and a production SSH tunnel all committed straight into a Git repository — and the team walks through how to unwind that without breaking delivery. Julien Bisconti argues that security tooling is fundamentally failing developers because it is too hard to use under real delivery pressure. The episode also delivers strong opinions on why teams should not default to Kubernetes, the hidden complexity of S3 encryption with KMS keys, and why Google's BeyondCorp model makes VPNs look like a relic. Key Topics SSH session logging, bastion hosts, and compliance visibility The episode opens with a deep dive into SSH session logging for bastion hosts in AWS. One of the hosts explains how AWS Systems Manager Session Manager can be used to access instances without VPNs or direct inbound connectivity — the SSM agent on each instance calls home to AWS, and AWS proxies the connection back. That model is attractive for hybrid and on-prem environments because it removes networking complexity around NAT, port forwarding, and VPN setup. It also provides session logging, IAM-based access control, and command output recording. But the drawbacks surface quickly. Session Manager logs users in as a generic SSM agent user with /usr/bin as the working directory. Documentation is sparse, and Bash is launched in shell mode to support color interpretation, which pollutes session logs with escape characters. A bigger concern is that access control rests entirely on IAM credentials — in an environment with fully dynamic, short-lived credentials that is manageable, but it becomes risky anywhere static keys exist. The host describes trying to map Session Manager logins to individual users, only to find that it requires static IAM identities with specially named tags containing usernames — a non-starter for environments where everything is dynamic. That leads into alternative approaches. An AWS blog post describes forcing SSH connections through the Unix script utility to record sessions, then uploading logs to S3. But even that is fragile: logs are owned by the user, so technically the user can delete or overwrite them. A more robust path is tlog, a terminal I/O logger that writes session data in JSON format to the systemd journal, where it cannot be easily tampered with. From there, the CloudWatch agent can export journal data to S3 for long-term storage. The broader point is that command logging sounds simple in compliance conversations, but in practice it becomes a deep rabbit hole full of bypasses, noise, and design tradeoffs. Monitoring user activity without drowning in logs The hosts compare notes on monitoring shell activity. One host mentions using auditd to track user actions on bastion hosts in a previous environment, but the log volume was overwhelming — even Elasticsearch struggled to keep up with the ingestion rate. That sparks a discussion around anomaly detection and heuristics. The real challenge is not collecting logs but determining what is unusual and worth investigating. Failed SSH login alerts are mentioned as a useful signal, though another host pushes back: "Should you have SSH with the password at all? You should have a key." The point stands — without careful tuning, even sensible alerts generate noise faster than teams can act on them. The exchange captures a recurring DevSecOps reality: collecting telemetry is the easy part; turning it into something actionable is where most teams get stuck. S3 bucket security, public access controls, and KMS encryption surprises The conversation shifts to AWS S3 security. Public buckets remain a common source of breaches, but AWS now offers S3 Block Public Access — account- and bucket-level settings that prevent public access regardless of individual object ACLs. In Terraform, this is a dedicated resource block. The more nuanced insight is about encryption. The host explains the difference between S3 server-side encryption with the default AWS-managed key (SSE-S3) and encryption with a customer-managed KMS key (SSE-KMS). With SSE-S3, S3 decrypts objects transparently for any client with read access to the bucket. With a customer-managed KMS key, S3 cannot decrypt the object unless the requester also has kms:Decrypt permission on that specific key. This became a real problem in a cross-account, cross-region workflow involving Go Lambda binaries. Go Lambdas require the deployment artifact to reside in the same region as the function. The team was copying artifacts between accounts and regions, had granted S3 read permissions, but downloads kept failing. CloudTrail logs revealed the real culprit: "I cannot decrypt." The consumers lacked KMS key access. In that case, the fix was switching to SSE-S3 since the artifacts did not require the stronger protection of a customer-managed key. The host is careful to note that AWS documentation on cross-account S3 access does not prominently flag this encryption interaction — a gap that can cost teams hours of debugging. Plaintext secrets in Git: a frighteningly common anti-pattern One of the most memorable segments comes when a host describes reviewing an application stack and finding service account keys committed in cleartext in the repository root. The repository also contained a large configuration file with usernames, passwords, API credentials for mail services, login providers, and multiple environments (dev, prod) — all in plain text. But the worst part: for local development, the team SSH-tunneled into the production SQL server, mapping remote port 3306 to local port 3307. An SSH key providing direct access to the production database was sitting right there in the repo. The reaction is immediate — this is exactly the kind of setup that accumulates when convenience wins over security for too long. But rather than proposing a risky teardown, the host outlines an incremental migration plan: Dockerize the local development stack so developers can run everything locally without production keys Move secrets into the CI/CD pipeline, injecting them only at build and deploy time Separate environments so test clusters get test credentials and production secrets never touch developer machines Andrey pushes the thinking further: injecting secrets at build time is still risky because anyone who gets the Docker image gets the secrets. The better model is runtime secret retrieval — workloads authenticate dynamically at startup and fetch only the secrets they need. HashiCorp Vault is the concrete example: in a Kubernetes environment, a pod uses its Kubernetes service account to authenticate to Vault, obtains a short-lived token, and retrieves static or dynamic secrets. If someone steals the image and runs it outside the cluster, they cannot authenticate and get nothing. Vault versus cloud-native secret management The secrets discussion expands into a broader comparison. Andrey, who has been doing public speaking about Vault and fielding consulting requests around it, frames the choice pragmatically. For hybrid-cloud or multi-cloud environments, Vault is likely the best option because it provides a unified interface for secret management, dynamic credentials, and synchronization across providers. For single-cloud commitments — say, all-in on AWS — native services can cover many of the same use cases: AWS STS for temporary credentials, RDS IAM authentication for database logins, AWS Secrets Manager (which may even be running Vault underneath, as one host speculates), and AWS Certificate Manager for TLS certificates. If the organization is not going multi-cloud, the overhead of running Vault may not be justified. The recommendation is not ideological. It depends on architecture, portability needs, and operational complexity. When Vault works technically but fails organizationally Julien Bisconti adds an important caveat from experience. He describes deploying Vault in a multi-availability-zone setup with full redundancy — technically solid. But the project "went to a halt completely" when it hit governance questions: who should access what, under which rules, and who owns the policies. It became a political war, and the entire deployment had to be rolled back. The lesson: security tools are good at automating technical workflows, but if the underlying organizational process is broken, you automate a broken process. Security, monitoring, deployment, and access control are deeply entangled, and tooling alone cannot untangle them. Security tooling fails because developers cannot use it Julien brings the strongest developer-empathy argument of the episode. Developers do not ignore security because they are careless — they bypass it because secure workflows are too awkward under delivery pressure. A manager does not understand why the developer is blocked, pressure mounts, and the result is // just hardcode that here, I don't care, it works. Even simple tasks illustrate the problem. Julien asks: can you generate an SSL certificate with OpenSSL from memory right now? Most engineers cannot — it is something they do every few months and have to look up each time. He references the famous XKCD comic about entering the correct tar command with ten seconds left. This evolves into a philosophical observation. One host identifies as a "tool builder" rather than a "product builder" — someone who enjoys building mechanisms but does not always think deeply about end-user experience. That mindset, common among infrastructure and security engineers, may explain why so many DevSecOps tools are powerful but painful to adopt. The gap is not in capability but in usability. VPNs, zero trust, and the BeyondCorp model Julien argues that VPNs are an increasingly painful abstraction. Even Cisco — the company that essentially built enterprise VPN technology — had to raise capacity limits during the COVID-19 pandemic because their own infrastructure could not handle the load. Split tunneling introduces its own vulnerabilities, and full-tunnel VPN creates a bottleneck for everything. He points to Google's BeyondCorp model, published in 2014, which established the principle that network location should not determine access. The analogy: do you build a castle with walls where anyone inside has full access, or do you put a guard in every room checking credentials? The latter — zero trust — is harder to implement, but it limits blast radius and removes the binary "in or out" problem. Andrey connects this to the emerging service mesh ecosystem. Technologies like Consul Connect implement zero-trust networking at the application level with mutual TLS and identity-based authorization. The hosts note that the service mesh space is still fragmented — just as there was a "war of orchestrators" before Kubernetes emerged as the default, there is now a "war of service meshes" still playing out. Kubernetes hype versus simpler orchestration A significant portion of the episode is a productive debate about orchestration choices. Andrey argues strongly against defaulting to Kubernetes. He describes a hybrid-cloud project in Africa running the full HashiCorp stack: Consul for service discovery, configuration, and networking; Nomad for workload scheduling. A team member with relatively little experience got the stack up and running in days. Andrey outlines the operational weight of Kubernetes: cluster version upgrades where in-place upgrades may skip new security defaults (making full cluster recreation the recommended path), autoscaler configuration layers (pod autoscaler, cluster autoscaler, resource limits), ingress management, YAML sprawl from Helm charts, and a platform that evolves so rapidly it demands continuous learning. He especially warns against running databases in Kubernetes — the statefulness adds pain. For single-cloud AWS, he argues that ECS is often the better choice: the control plane is free (or nearly so), the per-node overhead is minimal compared to Kubernetes, and AWS handles the operational burden. Mattias pushes back with a practical counterpoint. Kubernetes provides a consistent platform for diverse workloads — containers, databases, monitoring, custom jobs — all managed through the same interface. Helm charts for common components like nginx-ingress, cert-manager, and external-dns make the ecosystem approachable. The value is in standardization and adaptability. The hosts also note GKE's pricing evolution: Google introduced a per-cluster management fee (roughly $0.10/hour per control plane) to discourage sprawl and encourage consolidation — a signal that even managed Kubernetes has real costs. The disagreement is honest but constructive. The shared conclusion: start with what the business needs, then pick the simplest tool that gets you there. "The best battle is the battle you don't fight." And as Julien notes, teams that avoid the Kubernetes default often demonstrate deeper architectural thinking — choosing based on the hype is an insurance policy, but it is not the same as choosing based on needs. Slack bots, workflow automation, and the security surface Near the end, Mattias raises the topic of Slack bots for operational tasks — deployment reporting, status checks, and interactive queries. Andrey reframes the conversation around security: if Slack becomes part of a privileged control plane — for example, a bot that handles privilege escalation by requesting approvals through Slack messages — then request spoofing, account compromise, and weak isolation become serious concerns. The idea of a privilege-escalation bot is interesting (request access via Slack, get approval from designated approvers, receive time-limited credentials with full audit logging), but the attack surface is real. Slack provides a powerful collaboration platform for building workflows without custom UIs, but once it handles access decisions, security design matters as much as convenience. Highlights "All the service account keys were in clear text. In the repo." A host describes opening up a client's application stack and finding cloud service keys, usernames, passwords, API credentials, and an SSH key that tunnels directly into the production SQL server — all committed to Git in plain text. It is the kind of discovery that instantly explains years of hidden risk. How do you unwind that without breaking delivery? The hosts walk through an incremental migration plan in this episode of DevSecOps Talks. "Security tooling is actually not that usable." Julien Bisconti delivers a sharp truth: developers do not bypass security because they are careless. They do it because secure workflows are too slow, too confusing, and too far removed from how they actually work. When the pressure comes from a manager who does not understand the blocker, the shortcut wins every time. A candid take on why hardcoded secrets keep showing up in real codebases. Listen to the full discussion on DevSecOps Talks. "I really applaud people who don't choose Kubernetes — that means they actually know what they're doing." One of the spicier platform takes of the episode. The argument is not that Kubernetes is bad, but that defaulting to it without analyzing your actual needs is a sign of hype-driven architecture. If a simpler stack solves the problem, picking the biggest platform just creates more operational burden. Hear the full Kubernetes-versus-Nomad-versus-ECS debate on DevSecOps Talks. "If your process is not good, you're going to automate a bad process." Julien recounts deploying Vault with full HA and multi-AZ redundancy, only to have the project grind to a halt over organizational politics — who should access what, and who decides. The tooling worked perfectly. The organization did not. A reminder that DevSecOps maturity is not just about picking better tools. Catch the full story on DevSecOps Talks. "Once somebody is inside, they have the keys to the kingdom." The VPN and zero-trust discussion delivers one of the strongest security arguments of the episode. Julien explains why broad network access — the castle-and-moat model — is the wrong abstraction for modern systems, and why identity-based, fine-grained access control is worth the implementation cost. If the old perimeter model still shapes how your team thinks about infrastructure security, this part of the episode will resonate. Listen on DevSecOps Talks. Resources AWS Systems Manager Session Manager — AWS documentation for Session Manager, which provides secure instance access without SSH keys, open ports, or bastion hosts, with built-in session logging. tlog — Terminal I/O Logger — Open-source terminal session recording tool that logs to systemd journal in JSON format, making sessions searchable and tamper-resistant. Discussed in the episode as a more robust alternative to the Unix script command. AWS S3 Block Public Access — AWS documentation on account- and bucket-level settings to prevent public access to S3 resources, regardless of individual object ACLs or bucket policies. Troubleshooting Cross-Account Access to KMS-Encrypted S3 Buckets — AWS guidance on the exact issue discussed in the episode: S3 downloads failing because the requester lacks KMS key permissions, even when bucket-level access is granted. BeyondCorp: A New Approach to Enterprise Security — Google's foundational 2014 paper on zero-trust networking, which established the principle that network location should not determine access. Referenced by Julien in the VPN discussion. HashiCorp Nomad — A lightweight workload orchestrator with native Consul and Vault integrations. Discussed as a simpler alternative to Kubernetes, especially for hybrid-cloud and small-team environments. Consul Service Mesh (Consul Connect) — HashiCorp's service mesh solution providing zero-trust networking through mutual TLS and identity-based authorization. Mentioned as the networking layer in the Africa hybrid-cloud project. XKCD 1168: tar — The comic Julien references about the impossibility of remembering command-line flags — a humorous illustration of why security tooling needs better usability.
undefined
Mar 26, 2020 • 56min

DEVSECOPS Talks #4-2020 - Is docker more secure then VM

In this episode Mattias is trying to convince that running docker in k8s is more security then VM. Did he success ? listen and find out. Summary Mattias makes a bold claim: Docker containers are more secure than virtual machines. Andrey and Julien push back hard — and by the end, the three hosts explicitly agree to disagree. Along the way, they dig into why container breakouts are harder than people assume, how Lambda micro VMs can be exploited through warm TMP folders, why "containers do not contain" without extra kernel controls, and whether good monitoring matters more for security than any isolation technology. Recorded during COVID-19 lockdowns in 2020, the debate captures a moment when the container-vs-VM argument was far from settled. Key Topics Docker vs. VM security: technology vs. ways of working Mattias opens the main debate by arguing that Docker containers are more secure than VMs in practice. His reasoning: containers are smaller, more focused, and more ephemeral than traditional virtual machines, which reduces attack surface. In a typical VM, you find mail agents, host-based intrusion detection, syslog, monitoring tools, and other services all coexisting with the application. In a container, you ideally run only the application itself. Andrey pushes back immediately. He argues Mattias is comparing operational models, not technology. A well-run VM can also be immutable and minimal — you redeploy from a new image the same way you replace a container. Likewise, a badly built container can be long-lived, bloated, and full of unnecessary tools. Andrey has seen enterprises that run containers for months, SSH into them, and treat them like VMs. Mattias concedes the point but maintains that the standard approach differs: VMs are typically kept running longer with more tools, while the standard approach for containers in Kubernetes is to rotate them and keep a smaller footprint. Andrey counters that most Docker images run as root by default, giving attackers more privilege than they would have on a typical VM where processes run under limited service accounts. This is one of the sharpest exchanges in the episode — better tooling does not fix insecure defaults. The hosts eventually agree that both technologies can be secured well, but do not reach consensus on which is easier. Andrey summarizes it cleanly: containers make it "a little bit easier" to do the right thing because they narrow the focus to the application rather than the entire operating system, but it is absolutely possible to reach the same security level with VMs. Why container breakout is not as trivial as people imply Mattias challenges the common assumption that containers are unsafe because "you can break out of them." He points out that every container breakout CVE he has reviewed requires significant preconditions: either running an attacker-controlled image or running in privileged mode. You cannot take a standard Ubuntu container image, run a single command, and escape. The threat is real but requires chained attacks, not a single exploit. Julien and Andrey accept the premise but note that the comparison matters. VM isolation is fundamentally stronger at the hypervisor level. Container breakout may be hard, but it is architecturally easier than VM escape. The discussion reframes the question: runtime security is less about one isolation boundary and more about how many obstacles an attacker must pass through. Micro VMs, Firecracker, and Lambda attack vectors Andrey brings up an important middle ground between containers and VMs: micro VMs. AWS Lambda runs on Firecracker, an open-source micro VM monitor. Lambdas are ephemeral, have read-only file systems, minimal tooling, and no access to source code or settings — making them quite secure by design. But Andrey describes a real attack path researchers have demonstrated. The /tmp directory in Lambda is writable. If an attacker exploits a vulnerability to get code execution within the Lambda, and the Lambda is kept warm (invoked within 15 minutes so it stays in memory), the /tmp folder persists between invocations. An attacker can download tools incrementally across multiple Lambda runs, building up capability over time. From there, they can explore IAM permissions, exfiltrate data by encoding it in resource tags, or even override the Lambda function itself. The point is that even well-designed ephemeral environments have attack paths when defenders are not paying attention. Security depends on hardening and monitoring, not just on the isolation primitive. Containers do not contain: AppArmor, Seccomp, and policy controls Julien delivers the episode's sharpest technical point: "Containers do not contain." They are primarily Linux namespace isolation and need additional kernel controls — AppArmor profiles and Seccomp filters — to properly restrict what applications can do at runtime. Without those extra layers, a container running as root is effectively root on the host machine, and a container with host network access is the same as running directly on the server. This shifts security responsibility in uncomfortable ways. In VM environments, operations and security teams traditionally handle access controls. In containerized environments, developers are often expected to define security profiles for their workloads — but they may not know which system calls or privileges their applications need. Julien describes this as a fundamental organizational gap: the people writing the workload and the people securing the workload are rarely working hand in hand. Mattias suggests that platform teams can solve this by enforcing policies centrally. He references tools like Open Policy Agent to set standards for what gets deployed into a cluster, rather than relying on every developer to configure security correctly. Kubernetes makes monitoring and response easier Mattias makes a strong case for container platforms as detection and response environments. He describes working with Falco, a runtime security tool, and highlights a powerful capability: if someone opens a shell inside a container, Falco can detect that behavior and the container is killed automatically. That kind of automated response is natural in an environment built around disposable workloads. On a VM, shells are a normal part of operations, making the same detection much harder to act on. Julien extends this into a broader argument about monitoring and security being inseparable. He argues that when monitoring is poor, access control becomes chaotic — developers need broad production access just to debug issues. But with strong observability, teams can use feature flags, targeted routing, and centralized logging instead of SSH-ing into production. Good monitoring reduces the need for risky access patterns. Julien offers a practical example: instead of blocking developers from opening shells in containers, observe that they are doing it and ask why. If they need logs, build a secure log access API. If they need to debug, improve the observability tooling. Monitoring turns security violations into product requirements. Minimizing container images Julien mentions using DockerSlim (now SlimToolkit) to strip unnecessary components from container images, reducing attack surface without requiring deep knowledge of every dependency. It is not a complete security solution, but it is an easy first step that removes much of the bloat containers inherit from their base images. For organizations with compliance requirements, Julien notes that third-party security vendors provide validated runtime solutions — useful for audit purposes where you need a third party to confirm that the running workload matches what was built internally. Bundling dependencies with the application Mattias raises a concern about how containerization changes dependency management. In older models, operations maintained the web server (Apache, Nginx) separately from the application. In containers, the web server, runtime, and application are bundled together. That means patching the web server requires rebuilding and redeploying the entire container, even when the application code has not changed. Andrey reframes this as a different packaging model, not a new problem. With Java WAR files deployed to Tomcat, you already had dependency coupling — you just managed it differently. Containers actually improve the situation in one way: each application owns its own dependency lifecycle instead of sharing an application server. One application can upgrade independently without affecting others on the same host. Both hosts note that dedicated application servers are fading. Modern applications in Go, Python, and Node.js often handle HTTP directly, removing the need for a separate web server entirely. The ingress controller in Kubernetes handles routing at the cluster level, which is a separate concern from the application. The hosts agree to disagree The episode ends without consensus. Mattias remains firmly convinced that containers, run properly in Kubernetes, are more secure than VMs. Julien's final position: "Containers can be as secure as VMs, but they need more work to get there." Andrey advocates for a layered approach — use both VMs and containers, with container security focused on application concerns and VM security focused on operational and resource isolation. He also notes that CoreOS, once the go-to minimal container OS, had recently been discontinued by IBM, leaving teams to find alternatives like Fedora CoreOS. Highlights "Containers do not contain." Julien delivers the episode's most quotable line, reminding listeners that containers are mostly Linux namespacing — not real isolation boundaries. Without AppArmor, Seccomp, and careful configuration, a container is far less restrictive than people assume. A sharp reality check for anyone treating "containerized" as synonymous with "secure." Listen to the full episode on DevSecOps Talks to hear why container security is never just about packaging. "If somebody pops a shell in a container, that container is killed." Mattias describes working with Falco and highlights a capability that captures the strongest pro-container argument: disposable workloads change the incident response model entirely. On a VM, a shell is normal. In a container, it is an alarm — and the platform can act on it automatically. Listen to the episode to hear how the hosts connect runtime detection, monitoring, and automated response. "Most of the Docker images out there are running as root." Just when the debate leans in Docker's favor, Mattias himself brings it crashing back. On VMs, running as root is rare. In containers, it is the default. Better tooling does not fix insecure defaults — and this remains one of the most practical risks in container environments. Hear the full back-and-forth on DevSecOps Talks. "We have to separate apples from bananas — the technology from the ways of working." Mattias draws a sharp line that reframes the entire debate. Are containers actually more secure, or are teams comparing modern container practices against outdated VM operations? A useful reminder that architecture arguments often hide workflow arguments underneath. Listen to the full conversation for the spirited disagreement that follows. "Monitoring very much goes hand in hand with security." Julien makes the case that bad observability leads directly to bad access control. When developers cannot see what is happening in production safely, they need more privileges, more access, and more risky workarounds. Fix the monitoring, and many security problems solve themselves. Listen to the episode on DevSecOps Talks to hear why observability might be the most underrated security control. "Containers can be as secure as VMs, but they need more work." Julien's final verdict — delivered over Mattias's loud objections — perfectly captures the episode's unresolved tension. The hosts explicitly agree to disagree, making this one of the more honest security debates you will hear on a podcast. Catch the full exchange on DevSecOps Talks. Resources Falco — CNCF-graduated runtime security tool that detects anomalous behavior in containers and Kubernetes using eBPF. Mentioned by Mattias for its ability to automatically kill containers when suspicious activity like shell access is detected. Firecracker — Open-source micro VM monitor built by AWS, powering Lambda and Fargate. Discussed by Andrey as an example of ephemeral, hardened execution environments and their attack surfaces. SlimToolkit (formerly DockerSlim) — Tool for analyzing and minimizing container images, automatically generating AppArmor and Seccomp profiles. Mentioned by Julien as a practical way to reduce attack surface without deep security expertise. Open Policy Agent (OPA) — General-purpose policy engine for enforcing security and operational policies across Kubernetes clusters. Referenced by Mattias for centrally enforcing deployment standards. AppArmor — Linux kernel security module that restricts application capabilities through mandatory access control profiles. Discussed by Julien as an essential add-on for meaningful container isolation. Seccomp (Secure Computing Mode) — Linux kernel facility that restricts which system calls a process can make. Used by Docker and Kubernetes to reduce the container attack surface by blocking unnecessary syscalls. Fedora CoreOS — Successor to CoreOS Container Linux (discontinued 2020), a minimal, auto-updating operating system designed for running containerized workloads. Relevant context for Andrey's mention of CoreOS being killed by IBM.
undefined
Mar 20, 2020 • 38min

DEVSECOPS Talks #3-2020 - Docker securing builds

Your docker images and build are be coming the base for our platform. But are they secure? In this episode we talk about how you can secure your docker images. Summary In this early DevSecOps Talks episode, Mattias, Andrey, and Julien dig into Docker security as a supply chain problem — and quickly dismantle the assumption that a signed container means you know what is inside. Julien pushes back sharply: signing only gives a "semantic guarantee" that an image is what it claims to be, not that it is safe. Mattias argues that containers were designed to be convenient, not secure by default, while Andrey points out that containerization has fundamentally changed the patching game — once the OS, web server, and application are packaged together, every security fix becomes a rebuild-and-redeploy exercise. The hosts make the case for layered scanning, slim runtime images, multi-stage builds, and continuous rebuilding as the practical path to running containers safely in production. Key Topics Container images vs. running containers The conversation starts by separating two distinct parts of container security: the image and the running container. Mattias explains that a container image can be treated much like any other file or archive — a zip or tar file sitting on disk. Because of that, teams can sign images cryptographically to verify origin and integrity, similar to how Node.js developers sign releases with their private keys. That gives consumers confidence that the image came from a known source and has not been tampered with. But Julien pushes back on a common misunderstanding: signing does not mean the contents are inherently safe. As he puts it, you get a "semantic guarantee that this image is what it's pretending to be" — but not proof that everything inside is secure. Authenticity is not the same as security. The hosts frame this as a trust problem. In a production cluster, teams often want to prevent engineers or workloads from pulling arbitrary images and running them without controls. Signed images and curated registries help, but they do not eliminate the need for careful validation. Trust, Docker Hub, and the container supply chain A major part of the episode focuses on how much trust teams should place in public images, including those from Docker Hub. Andrey raises the practical reality: if you are running four different languages, you cannot build and maintain base images for all of them. It is much easier to grab the latest Node.js, Python, Ruby, or Java images from Docker Hub and build from there. Julien and Mattias acknowledge that reality, but caution against treating "official" or branded images as automatically secure. Julien walks through the different trust levels on Docker Hub: Images from unknown individuals are the hardest to trust Organization-backed images (Red Hat, CloudBees, etc.) provide more accountability based on brand recognition Even reputable images can contain known vulnerabilities — scanning a Jenkins image from Docker Hub can reveal a surprising number of CVEs A trusted source can still introduce problems, whether by mistake or through malicious intent That leads into a broader discussion of supply chain attacks. Julien references real examples where Node.js libraries on npm were taken over by malicious parties after the original maintainer walked away. The same risk applies to container images. Julien points out that large organizations sometimes go as far as rebuilding all dependencies from source — he mentions having heard of teams that do not pull jar files from Maven Central but build their own from source to verify exactly what they are shipping. While that is not feasible for every team, the principle stands: reduce blind trust and increase verification where the environment demands it. Why container security is not just image signing The discussion then shifts from image authenticity to runtime security. Mattias explains that containers rely on Linux kernel primitives — namespaces for process isolation, along with controls for networking, memory, and disk. These low-level APIs are useful for resource sharing and scaling, but they were not originally designed as strong security boundaries. As he puts it, "the container does not contain things, it's just an abstraction." Container breakout vulnerabilities matter because an attacker who can exploit the runtime or host interface may reach beyond the container itself. This leads to one of the episode's sharpest observations from Mattias: containers became popular because they are efficient and convenient to operate — you can bin-pack them on the same hardware and run far more applications per server. But from a security perspective, "it was not designed to be secure by default, it was designed to be convenient." That gap between convenience and security is what teams must actively address through scanning, hardening, and runtime controls. CVE scanning: registries, dependencies, and source code The hosts spend a good amount of time discussing scanning tools and where each fits in the security pipeline. Mattias notes that most container registries now offer built-in vulnerability scanning, sometimes called container analysis APIs. Julien suggests a practical AWS-based pattern: if you do not want to pay for Docker Hub premium but still want to use public images, you can pull from Docker Hub, push into AWS Elastic Container Registry (ECR), and take advantage of its built-in CVE scanning. Then you restrict your production orchestrators to pull only from ECR. Julien draws an important distinction between types of scanning: Registry scanning examines what packages are installed in the image at the OS level. The registry unpacks the image, identifies installed packages, and flags known CVEs. Dependency scanning tools like Snyk, Dependabot, and similar platforms check application dependency manifests (package.json, requirements.txt, pom.xml) against CVE databases. They are protecting against supply chain vulnerabilities in libraries, not scanning custom application code. Static analysis and linters can catch some obvious security issues in application source code, but as Andrey notes, they mainly catch "easy targets" and default patterns. Julien initially states that registry scans do not cover source code, then corrects himself to clarify the distinction more precisely: registries scan installed OS packages, while separate tools scan programming language dependencies. Neither deeply analyzes your own custom code. That leaves an unknown component in the stack that teams need to address through other means — code review, testing, and secure development practices. Andrey also mentions using Anchore, which he describes as the foundation for many of these CVE scanning capabilities. The shift from OS patching to image rebuilding One of the most practical insights comes from Andrey, who compares containers to older operational models. In traditional environments, teams could patch the operating system or update components like Nginx independently of the application. With containers, those layers are packaged together. If a new Nginx vulnerability is disclosed, the team needs to rebuild and redeploy the entire image that contains both the web server and the application code. This changes patching from an infrastructure task into an application delivery task. Security updates are no longer something ops handles in isolation — they flow through the same build-and-deploy pipeline as feature code. The hosts argue that this is why security must be a concern from the earliest stages. As Andrey puts it, referencing Julien's earlier point: security belongs in the first commit, because that is when it is cheapest and easiest to get right. A green build today does not guarantee a safe deployment tomorrow if new CVEs are published against the packages already running in production. Slim images, distroless approaches, and DockerSlim Mattias argues strongly for reducing container contents to the bare minimum. He highlights DockerSlim (now SlimToolkit), a project he uses frequently that strips images down to only the components essential for the application. In his example, a Maven-based application image dropped from roughly 600 MB to 140 MB — with no bash shell or other standard OS tooling left in the result. Julien reinforces the security rationale: "the less code you have, the less vulnerability you have, and that's what you want in production." He mentions Alpine Linux and Google's distroless images as complementary approaches that aim for the same goal — minimal OS footprint in production containers. The common theme is that production containers should not carry build tools, shells, package managers, or debugging utilities. Every unnecessary binary is a potential attack surface. The best production image is not the one easiest to build, but the one that contains the least unnecessary code. Multi-stage builds and separate build vs. runtime images The hosts spend considerable time on one of the most practical Docker security patterns: multi-stage builds. Julien explains the concept of build stages — using an intermediate container with all build dependencies to compile the application, then copying only the final artifact into a much smaller production image. This separation means the production image does not need compilers, package managers, or the full dependency tree. Andrey confirms this maps directly to Docker's multi-stage build feature: "You just build your Docker build in one stage and then just copy build results to the next stage." He also points out the developer experience benefit — since the build environment is defined inside the Dockerfile itself, developers do not need to set up different language toolchains on their local machines when working across multiple microservices. Julien adds a performance angle: pulling a pre-built container image with cached dependencies is often much faster than resolving and fetching all dependencies from scratch. He has seen Maven builds that took 20 minutes purely because they had to re-fetch all artifacts every time. Pre-building and caching the dependency layer can dramatically improve total build-to-production time. Continuous rebuilding and reducing attacker persistence Andrey recommends reducing the lifetime of deployed images by rebuilding base images and all derived containers regularly — potentially every week — pulling in the latest patches each time. While this adds operational overhead, it shortens the window of exposure and makes it significantly harder for attackers to maintain persistence in stale environments. Julien frames this as a recurring maintenance budget that every engineering team must accept. As he puts it, "if you don't spend at least one day per week updating the stuff, it's going to accumulate over a year or something. And then you have to spend two weeks fixing all that." The compound interest on security debt is steep. Tags, digests, signing, and private registries Near the end of the episode, Mattias raises a practical deployment question: how should teams store and reference images securely? He contrasts mutable tags (which can be overwritten on Docker Hub) with immutable SHA-based digests, image signing, and private registries — and admits there are so many options it is hard to know where to start. Julien recommends implementing all of these controls, but not all at once. He advocates for an incremental approach: define your security objectives, then build toward them layer by layer. Start with what gives the most immediate protection and expand from there. The hosts do not present a single silver bullet. Instead, they emphasize defense in depth: scanning at every level (code dependencies, container base images, production images), signing for authenticity, private registries for access control, and infrastructure-level enforcement. Build pipeline security and handling secrets The episode closes by touching on a problem the hosts agree deserves its own dedicated discussion: securing the build system itself. Mattias points out that the build server has access to source code, credentials, signing keys, registries, and deployment systems. If an attacker compromises it, they can inject malicious code during the build process — effectively poisoning everything downstream. The hosts then discuss the challenge of passing credentials into container builds for private dependencies. Andrey notes that recent Docker versions support passing SSH agents and secrets more safely during builds. He recommends using short-lived credentials (like AWS STS tokens with 15-minute expiration) so that even if credentials leak into image layers, they are already expired by the time anyone could exploit them. He also mentions using IMG, a daemonless image builder, as an alternative to Docker that avoids the need for a Docker daemon during builds. Julien takes a different approach to runtime secrets: encrypting them with KMS and storing them in a cloud bucket, then fetching them only at container startup. He observes that the real cloud vendor lock-in is never the runtime — "it's always the IAM" — because authorization and access control mechanisms are deeply cloud-specific and difficult to migrate. Julien adds that handling build secrets often becomes an awkward "dance" of fetching credentials, granting temporary access, and cleaning up afterward. It works, but it remains operationally clumsy. The hosts agree that build server hardening and the connection between security and cost management (which Julien briefly mentions as natural partners, since understanding who has access to what benefits both) are topics worthy of their own future episodes. Highlights "You don't know what's inside — you only have a semantic guarantee." Julien cuts through a common assumption in container security: signing an image proves origin, not safety. That distinction shapes the entire episode, as the hosts explore why authenticity, trust, and actual security are three separate problems. Listen to this episode of DevSecOps Talks for a grounded discussion on what image signing can — and cannot — guarantee. "Containers were designed to be convenient, not secure by default." Mattias makes one of the sharpest points of the episode: containers became popular because they are efficient and easy to operate, not because they provide strong isolation. The container "does not contain things, it's just an abstraction." That is why runtime hardening and vulnerability management still matter so much. Listen to DevSecOps Talks to hear why container adoption created as many security questions as it solved. "Official on Docker Hub doesn't mean secure — scan a Jenkins image and you'd be surprised." Julien challenges the idea that a branded or official image should be trusted blindly. Even well-known organization-backed images can contain a surprising number of CVEs, and reputable sources can still introduce malicious changes — intentionally or by mistake. Listen to this DevSecOps Talks episode for a practical conversation about defining trust in your container supply chain. "The less code you have, the less vulnerability you have." Julien sums up a recurring theme: smaller runtime images are not just cleaner — they are fundamentally safer. From DockerSlim shrinking a 600 MB Maven image to 140 MB, to Alpine and distroless approaches, the hosts argue for removing everything production does not absolutely need. Listen to DevSecOps Talks to hear why image size and security are more connected than many teams realize. "Nginx gets a CVE? Now you have to rebuild your entire app." Andrey highlights how containerization merged the OS patching cycle with the application delivery cycle. In the old world, ops could patch Nginx without touching the app. In the container world, every security update means a full image rebuild and redeploy — making security an application delivery concern, not just an infrastructure one. Listen to this DevSecOps Talks episode for a practical take on why modern patching must flow through the CI/CD pipeline. "If you don't spend one day a week updating, you'll spend two weeks fixing it later." Julien describes dependency and image maintenance as a non-negotiable recurring budget. Skip the updates and the security debt compounds fast — turning routine maintenance into an emergency remediation project. Listen to DevSecOps Talks for an honest take on the operational cost of staying secure in containerized environments. "The real lock-in is never the runtime — it's always the IAM." In a brief but pointed aside about handling secrets in containers, Julien observes that authorization and access control are the truly cloud-specific parts of any architecture. Runtime workloads can move; IAM policies cannot. Listen to this DevSecOps Talks episode for a candid discussion on where the real complexity lies in cloud-native security. Resources SlimToolkit (formerly DockerSlim) — Open-source tool that minifies container images by removing non-essential components, reducing image size and attack surface without code changes. Mentioned by Mattias in the episode. Google Distroless Container Images — Minimal container base images from Google that contain only the application and its runtime dependencies, stripping out shells, package managers, and OS utilities. Docker Multi-Stage Builds — Official Docker documentation on using multiple build stages to produce smaller, cleaner production images by separating the build environment from the runtime image. Docker Content Trust — Docker's built-in mechanism for cryptographic signing and verification of image integrity and publisher identity using Notary. Amazon ECR Image Scanning — AWS documentation on scanning container images for OS and language package vulnerabilities in Elastic Container Registry, mentioned by Julien as a practical alternative to paid Docker Hub scanning. Snyk Container — Developer security tool for scanning container images and application dependencies for known vulnerabilities, with remediation guidance and base image upgrade recommendations. Anchore Container Scanning — SBOM-powered container vulnerability scanning platform, referenced by Andrey as the engine behind many registry-level CVE scanning capabilities. Alpine Linux Docker Image — Minimal 5 MB base image built on musl libc and BusyBox, widely used as a lightweight, security-conscious alternative to full Linux distribution base images.
undefined
Mar 20, 2020 • 18min

DEVSECOPS Talks #2-2020 - GitOps

Gitops a new concept on devops. Whats is it and how can you use it when deploy and setup your k8s cluster.  Summary GitOps sounds simple — put Kubernetes manifests in Git and let the cluster pull changes — but the episode quickly reveals the real debate is not about Git at all. Andrey argues the only genuinely novel thing about GitOps is the pull-based model where an in-cluster agent reconciles state, while Julien questions whether GitOps is good for day-2 operations or just for bootstrapping clusters. The spiciest moment: Andrey declares "life is too short to do pull requests" and advocates pushing straight to master with strong CI/CD guardrails instead. Key Topics What GitOps actually is — and what it is not Andrey frames the discussion by separating what is genuinely new about GitOps from what teams have already been doing for years. Storing deployment specifications in Git, he argues, is just version control — teams have done that for a decade. The meaningful difference is the deployment model: instead of an external CI/CD server pushing changes into Kubernetes by calling the cluster API, GitOps places an agent inside the cluster that either receives a webhook or polls a Git repository, pulls in the desired state, and applies it from within. That pull-based model is what Andrey identifies as the core innovation. It eliminates the need to expose the Kubernetes API externally — a real concern when using hosted CI services like CircleCI, which would otherwise need network access to the cluster. As Andrey puts it, exposing the API externally is risky "unless you want someone mining bitcoin on your cluster." He references the tooling landscape at the time: Weaveworks (the company that coined the term "GitOps" and created WeaveNet, a Kubernetes CNI driver), Flux, Argo, and Jenkins X. He notes that Flux and Argo were joining forces at the time of recording. He also mentions Jenkins X as a potential GitOps tool, since it runs CI/CD jobs natively in Kubernetes, but expresses skepticism about using Kubernetes for build workloads — Kubernetes is declarative about desired state, but "you cannot declare my build is successful because you have no idea how your build gonna go." Editor's note: Weaveworks, the company that originated the term "GitOps," shut down in February 2024. Flux continues as a CNCF graduated project. The GitOps principles have since been formalized by the OpenGitOps project under the CNCF. The Weaveworks definition, read straight from the source Andrey reads Weaveworks' concise GitOps definition from their blog and walks through its key points: The desired state of the whole system is described declaratively — Git is the single source of truth for every environment. All changes to desired state are Git commits — operations are driven through version control. The cluster state is observable — so teams can detect when desired and observed states have converged or diverged. A convergence mechanism brings the system back — when states diverge, the cluster automatically reconciles, either triggered immediately by a webhook or on a configurable polling interval. Rollback is simply convergence to an earlier desired state. Andrey also raises a nuance about Helm: since Helm templates can produce different output depending on input variables, true GitOps implies committing not only the Helm charts but also the rendered manifests — because the generated output is what actually represents the declarative desired state. He draws a comparison to GitHub's earlier promotion of ChatOps, noting that many of the same ideas — observable, verifiable changes driven through a central workflow — were already part of GitHub's operational philosophy, just with a different interface. Two layers: infrastructure-as-code and in-cluster GitOps Julien offers a more practical framing, splitting the problem into two distinct layers: Infrastructure as code — setting up the underlying infrastructure (VPCs, clusters, networking) GitOps — managing what runs inside the Kubernetes cluster: applications, operational tooling like monitoring (he mentions FluentD as an example), and supporting services In Julien's model, a Git repository becomes the authoritative inventory of everything that should exist in the cluster. He describes the ideal: "if anything else is running here, alert me or kill it." That gives teams confidence that the observed cluster state matches the intended one, and helps prevent configuration drift — a problem the hosts discussed in their earlier infrastructure-as-code episode. Day-2 operations: where the model gets tested While Julien appreciates GitOps for defining and bootstrapping cluster state, he is openly skeptical about its effectiveness for long-running operations. He distinguishes between two very different challenges: "setting up things" versus "running things for a long time — they're not the same." Real environments drift. People intervene manually during incidents. Urgent fixes happen outside the normal workflow. The clean desired-state model becomes harder to maintain once the messiness of day-2 operations enters the picture. Julien frames this as an open question rather than a settled answer: GitOps may be excellent for establishing a clean baseline, but whether it holds up as a complete long-term operating model remains to be proven. Who controls changes: developers, operators, or both? Andrey raises a governance concern: GitOps can look like a direct developer-to-cluster pathway. If a developer changes a YAML file, commits it, and the cluster automatically applies the change, operations staff are effectively bypassed — "there is nowhere an operation person can interfere with this." Julien pushes back, arguing that the workflow — not the tooling — determines who has control. If changes go through pull requests with review and approval, it does not matter whether the author is a developer or an operator. Both participate in the same process. The mechanism is the same one used for application code: propose a change, review it, merge it. Pull requests, compliance, and "push to master" The conversation takes its most opinionated turn when the topic shifts to pull requests. Andrey is blunt: "Life is too short to do pull requests. You never get anything done. You do a pull request, you ask for review and then you hunt the person for two days." His preference is to push directly to master and build CI/CD pipelines strong enough to catch mistakes — "you build your system to defend yourself from the fools." He does acknowledge an important exception: regulated industries where every production deployment must be peer-reviewed or approved. In those environments, formal review is not just a process preference but a compliance mechanism that can significantly reduce legal exposure when something goes wrong. Andrey also shares a personal practice: because he frequently switches between projects and loses context, the first thing he does is document every verification step as part of the CI/CD pipeline. That way, when he returns to a project months later, the pipeline already encodes everything he would need to remember. "There is no guarantee that someone else has a better understanding of what I did." Observability gaps in GitOps pipelines Andrey identifies a practical developer-experience problem with GitOps: the visibility gap. In a traditional pipeline, a developer can trace a change end-to-end — build, test, deploy — in one place. With GitOps, the CI pipeline ends when it commits changes to a repository. The actual deployment happens later, inside the cluster, through a separate reconciliation process. "My pipeline stops at the place where I do commit, push, done. Since then, pipeline doesn't have much to absorb." To understand whether a deployment succeeded, the developer needs to inspect cluster state rather than the original pipeline. Bridging that gap requires additional tooling and represents a real paradigm shift in how teams observe deployments. He also flags a repository-structure problem: if source code and deployment manifests live in the same repository, updating manifests can trigger the source-code pipeline again — requiring conditional logic to prevent unnecessary rebuilds. Deployment ordering and full-system validation Julien closes the discussion with a practical concern: deployment order matters in real systems. A proxy may need a backend to exist first. Some components cannot be rolled out in arbitrary order without causing failures. He also questions the validation model. In a software build pipeline, teams rebuild and test the entire application from the main branch to verify the whole system works. But with GitOps, a change to one part of the cluster may be applied incrementally without validating the full cluster state end-to-end. "I will never test the full master branch and rebuild the full cluster from it, except everything goes." That leaves an open question the hosts do not fully resolve: how can teams preserve the elegance of declarative Git-driven deployment while managing sequencing, dependencies, and whole-system confidence? Highlights "Unless you want someone mining bitcoin on your cluster" Andrey explains the security motivation behind the pull-based GitOps model — if you use an external CI system, you need to expose your Kubernetes API, which is not exactly ideal. His colorful warning about cryptocurrency miners makes the point memorable. Listen to the episode for Andrey's full breakdown of why the pull-vs-push distinction is the real heart of GitOps. "Life is too short to do pull requests." The spiciest take of the episode. Andrey argues that pull requests slow teams to a crawl — you open one, ask for review, then spend two days hunting the reviewer. His alternative: push to master and build pipelines strong enough to protect against mistakes. He does carve out an exception for regulated industries where peer review is legally required. Listen to the episode and decide whether you agree or strongly disagree. "GitOps is a nice way to set up your Kubernetes cluster — but is it a good tool to keep it running? I'm not sure." Julien draws a sharp line between bootstrapping a cluster and operating it long-term. Setting up things and running things for a long time are "not the same." It is a refreshingly honest admission that a clean architecture pattern does not automatically solve the messy reality of day-2 operations. Listen to the episode for a take that many GitOps advocates skip over. "You build your system to defend yourself from the fools." Andrey's philosophy in one sentence. Rather than relying on human review processes, invest in CI/CD pipelines and automated guardrails that prevent mistakes regardless of who pushes the change. He backs this up with a personal habit: encoding every verification step into the pipeline so future-him does not have to remember anything. Listen to the episode for a practical argument in favor of automation over process. "If anything else is running here — alert me or kill it." Julien describes the appeal of GitOps as an authoritative inventory of what should exist in a cluster. If the Git repository defines the desired state and the cluster enforces it, anything unauthorized can be flagged or removed. It is one of the clearest expressions of why teams are drawn to the GitOps model. Listen to the episode for a practical view of GitOps as cluster hygiene. The daughter interruption Mid-argument about observability gaps, Andrey's daughter walks in wanting to share something exciting. It is a charming reminder that even deep infrastructure debates happen in real life with real interruptions. Listen to the episode for the unscripted moment — and Andrey's smooth recovery. Resources Flux — the GitOps family of projects — The CNCF-graduated GitOps toolkit originally created by Weaveworks. Continuously reconciles Kubernetes cluster state with Git repositories. Argo CD — Declarative GitOps CD for Kubernetes — A declarative, GitOps continuous delivery tool for Kubernetes with a rich web UI for visualizing application state and deployments. Jenkins X — Cloud Native CI/CD Built On Kubernetes — An opinionated CI/CD platform for Kubernetes that automates pipelines, preview environments, and promotion using GitOps principles. OpenGitOps — CNCF Sandbox Project — The vendor-neutral, CNCF-backed project that formalizes GitOps principles: declarative, versioned and immutable, pulled automatically, and continuously reconciled. What is GitOps Really? — Weaveworks Blog — The original Weaveworks blog post defining GitOps that Andrey reads from during the episode. Weaveworks coined the term before shutting down in February 2024. gitops.tech — A community resource explaining GitOps concepts, principles, and the ecosystem of tools that implement the pattern. Awesome ChatOps — A curated list of ChatOps resources and tools, relevant to Andrey's comparison between GitOps and GitHub's earlier ChatOps movement driven by Hubot.
undefined
Mar 19, 2020 • 50min

DEVSECOPS Talks #1-2020 - Infra as code

Are infra as code always the best way to go and if not when and where should you use it. Here we are trying to better understand when its god to use and when its not.  Summary In this inaugural episode, Mattias, Andrey, and Julian discuss what infrastructure as code really means, why teams adopt it, and where it can go wrong. They explore the evolution from manual server management to declarative infrastructure, the differences between configuration management and infrastructure provisioning, the growing complexity of tools like Terraform and CloudFormation, and why culture, process, and operational discipline matter as much as the tooling itself. Key Topics What Infrastructure as Code Actually Solves The discussion starts with Mattias describing the shift from manually editing Apache configs over SSH to defining cloud environments in code. He recalls the progression: first managing individual servers by hand, then adopting configuration management tools like Puppet, Chef, and Ansible, and finally arriving at cloud-native tools like AWS CloudFormation that can provision entire environments declaratively. Andrey pushes the conversation toward first principles, arguing that it is important to separate the "what" from the "how." He explains that infrastructure as code depends on having APIs — software-defined interfaces that allow infrastructure to be created and managed programmatically. Without that kind of interface, teams are limited to SSH and the manual tools they had before. The rise of public cloud providers and platforms like OpenStack finally gave teams the APIs they needed to describe infrastructure declaratively in definition files. Configuration Management vs Infrastructure as Code A key distinction in the episode is the difference between server configuration tools and true infrastructure as code. Andrey notes that tools like Puppet, Chef, and Ansible were originally conceived as server configuration management tools — designed to automate the provisioning and configuration of servers, not to define infrastructure itself. He acknowledges this is a gray area, since tools like Ansible can now call AWS APIs and manage infrastructure directly. But historically, the configuration management era was about fighting configuration drift on existing servers, while the cloud era introduced the ability to declare entire environments as code. If you asked the vendors selling Chef, they would tell you Chef is "all about infrastructure as code" — but the original intent was different. When to Automate — and When Not To The hosts caution against automating too early. Andrey says he tends not to automate things until they genuinely need automation. If creating one cluster with a few nodes and one database is all you need, full automation may be premature. But if you know you will eventually manage hundreds or thousands, starting early makes sense. Julian reinforces this point with a memorable gym analogy: "You go to the gym, you see Arnold Schwarzenegger lifting 200 kilos from the ground and you say, he does it, I can do it. And then you pick up the little weight and find out that if you start with 200 kilos, you're gonna break your back." His point is that infrastructure as code tools get you up and running fast — that is what they are designed for — but day two operations always come knocking. The automation itself can become a burden if you are not careful about what you automate and when. Infrastructure as Documentation and Source of Truth Mattias describes one of his main reasons for using infrastructure as code: knowing what is actually running. He sees the codebase as documentation and as proof of the intended state of the environment — a way to verify that what he thinks is deployed matches what is actually in the cloud. The hosts agree with that idea, but they also point out the tension between declared state and reality. If people still make manual changes in the cloud console, the code drifts away from what is actually running. Andrey notes the problem: if undocumented manual changes are not reflected back into code, the next infrastructure deployment could recreate the original broken state — "you're back to the fire state, basically." The Terraform Complexity Problem Julian brings up Terraform as "the elephant in the room" and argues that it has become significantly more complex over time. He says the language started out as purely descriptive, but newer features in HCL2 — such as for loops, conditionals, and sequencing logic — have pushed it closer to a general-purpose programming language. His concern is that this makes infrastructure definitions harder to read and reason about. Instead of simply describing desired state, users now have to mentally execute the code to understand what it will produce. Andrey agrees there is a legitimate need for this evolution — once a declarative setup grows large enough, you genuinely want loops and conditionals — but acknowledges it creates a tension between readability and expressiveness. Declarative vs Imperative Approaches The episode explores the difference between declarative and imperative models. Andrey explains that shell scripts are imperative — you tell the system exactly what to do, step by step — while a declarative tool lets a team state the desired outcome and rely on the platform to converge on that state. Kubernetes is presented as a strong example of the declarative model. You submit manifests that declare what you want, and operators work to make reality match that intent — not necessarily immediately, but as soon as all requirements are fulfilled. Andrey suggests infrastructure tooling may evolve in this direction, with systems that continuously enforce declared state rather than only applying changes on demand. He gives a security example: an intruder stops AWS CloudTrail, but a reactive system — like a Kubernetes operator — detects the deviation and turns it back on automatically. Julian adds that this is already happening. He mentions that a Kubernetes operator exists to bridge the gap to cloud APIs, allowing teams to define infrastructure resources inside Kubernetes YAML manifests and have the operator create them in the cloud. Google Cloud's Config Connector is a concrete example of this pattern, letting teams manage GCP resources as native Kubernetes objects. Immutable Infrastructure and Emergency Changes Andrey strongly advocates for immutable infrastructure: baking golden images using tools like Packer, deploying them as-is, and replacing systems rather than patching them in place. In that model, people should not be logging into systems or making changes manually. If you need a change, you burn a new image and roll it out. SSH should not even be enabled in a proper cloud setup. Mattias raises a practical challenge: in real incidents, people with admin access to the cloud console often need to click a button to resolve the problem quickly. He describes his own experience — the team started with read-only production access but had to grant write access once on-call responsibilities kicked in. Andrey agrees that teams should not be dogmatic when production is on fire: "You go and do whatever it takes to put fire down." But those emergency fixes must be reflected back into code, and the team must know exactly what was changed. Otherwise, the next deployment may recreate the original problem. Culture and Process Matter More Than Tools One of the clearest themes in the conversation is that infrastructure as code is not just a tooling choice. Julian argues that it does not matter what technology you use if your process and culture are not aligned with security and best practices: "You can fix the technology only so much, but it's mainly about people." Mattias describes a setup where Jenkins applies all CloudFormation changes, and every modification to the cloud goes through pull requests, code review, and change management — the same workflow used for application code. This means infrastructure changes become auditable, reviewable, and easier to track. Andrey sees this as applying development principles to infrastructure: version history, visibility into who changed what, the ability to ask someone why they made a change, and code review before changes are applied. Guardrails for Manual Changes Andrey shares a practical example from a previous engagement where developers had near-admin access to the AWS console and would create EC2 instances, S3 buckets, and other resources outside of Terraform or CloudFormation. To control cost and reduce unmanaged resources, the team built a system using specific tags generated by a Terraform module. A Lambda function ran every night, scanned for resources without the required tags, posted a Slack notification saying "I found these, gonna delete them next day," and tagged them for deletion. The following night, anything still tagged for deletion was removed. This gave developers flexibility for experimentation — they could spin up resources manually and try things out — while preventing forgotten resources from becoming permanent, invisible infrastructure. It also helped keep costs under control. Tooling Is Only the Start Julian stresses that adopting infrastructure as code does not automatically make systems reliable, immutable, or resilient. In his view, it is "just the beginning of the journey." He warns against the myth that infrastructure as code equals immutable infrastructure — you can absolutely build stateful, mutable systems with code if you choose to. He also pushes back on the assumption that automation always saves time, admitting with self-awareness: "I automated a task, it took me two days to automate it, and I saved barely 10 seconds of my life." His advice is to measure the actual benefit rather than being seduced by the marketing brochure. Data will tell you more about a tool's real value than excitement will. Abstraction, Code Generation, and Developer Experience The hosts discuss the challenge of making infrastructure easy for developers who just want a database and a connection string, not a deep understanding of DBA work and security configuration. Andrey argues that abstracting best practices away from developers saves enormous organizational time, since developer time is expensive and holds back feature delivery. He describes a third approach beyond declarative and imperative: code generators. Large companies with resources sometimes build internal generators that take simplified YAML inputs and output fully declarative specs. This creates another level of abstraction on top of existing tools, allowing developers to be productive without needing to understand infrastructure details. It is controversial — in some ways it takes power away from people — but it can dramatically simplify the developer experience. Pulumi vs Terraform and Community Support Andrey introduces Pulumi as an interesting new branch of infrastructure tooling that lets teams describe infrastructure in general-purpose languages like TypeScript, Python, or Go instead of domain-specific languages like HCL. He notes that while it feels familiar to developers — you stay in your comfort zone — you still need to learn a new DSL embedded in that language. It is "not entirely like you just described infrastructure in the language you know." Julian says he tried Pulumi and found it appealing for developers who want consistency across their codebase. But he remains cautious, arguing that "code is a liability" — referencing Kelsey Hightower's satirical GitHub project nocode ("write nothing, deploy nowhere, run securely") to make the point that less code means fewer problems. For beginners, Julian recommends starting with Terraform or the native tooling from cloud providers, mainly because the community is larger, tutorials are more abundant, and there are meetup groups where people can learn from each other. His advice is pragmatic: "Just make sure that you ditch Terraform the minute it gets in your way." Start With the Problem, Not the Tool Andrey repeatedly returns to the same question: what problem is being solved? He argues that teams should choose tools based on business needs and existing team capabilities, not because a tool is fashionable. "A lot of people and developers, they like shiny tools — and there's nothing wrong about that — but you always have to ask, what is the problem we are solving?" Andrey's framing connects tool selection to team dynamics: if your team already has knowledge of a particular tool, relearning a new one just because it is trendy does not make sense. What you need is not a fancy tool but to deliver business value with the capabilities you have. Migration, Legacy, and Incremental Adoption The hosts acknowledge that many teams are not starting from a clean slate. Andrey points out that legacy infrastructure exists for a reason: it helped the business survive and grow. As Mattias puts it bluntly: "Legacy pays the bills." For organizations with years of manually built systems and hybrid environments, Andrey suggests doing value stream mapping to identify the biggest pain point and tackling that first. A greenfield project can serve as a success story to demonstrate the approach before trying to transform everything. He emphasizes that coming into an organization with a shiny idea and telling people "whatever you did before was crap" is a sure way to lose allies. Technology boils down to working with people — the tools are fine, but they do not replace the people running them. The Templating Dilemma Mattias raises a specific frustration with infrastructure as code: templating. He likes looking at his Git repository and seeing exactly what is running, but heavy use of variables and templates means he sees placeholder names instead of actual values. This tension — between reusable, DRY templates and readable, concrete definitions — is a real challenge that the hosts acknowledge without a clean resolution. Resilience and Recovery Near the end of the episode, Andrey gives a concrete example of losing a Kubernetes cluster in production. Because the environment had been defined as code, the team was able to recreate it and recover in about one to two hours. Some things that were not properly documented slowed them down; with complete documentation the recovery could have been as fast as 15 to 20 minutes — mostly just waiting for AWS to provision the resources after the API calls. Julian adds context to this: he argues that even with infrastructure as code, recreating a Kubernetes cluster and failing over traffic while maintaining service is genuinely hard. The concept is sound, but having the safety net to actually do it takes time, practice, and a lot of work. His advice for building confidence is to adopt the mentality of immutable infrastructure and get into the habit of regularly recreating things and practicing failovers. Final Advice Andrey recommends education first. He specifically mentions the book Infrastructure as Code by Kief Morris (published by O'Reilly, now in its third edition) as a strong foundation. His broader advice: understand the domain, define the problem clearly, ask yourself what outcome you want to deliver for the business, and let the answers to those questions guide your tool decisions. Julian's closing thought is that in a large organization, a dedicated infrastructure team using infrastructure as code can manage everything — on-prem or cloud — with a single workflow. That team can abstract complexity so developers do not need to learn Terraform, CloudFormation, or any other tool. The specialization pays off by reducing onboarding friction and letting each team focus on what they do best. Highlights Julian's gym analogy: "You go to the gym, you see Arnold Schwarzenegger lifting 200 kilos from the ground and you say, he does it, I can do it. And then you pick up the little weight and find out that if you start with 200 kilos, you're gonna break your back." His point: infrastructure as code tools make the start easy, but day two operations will humble you. Julian on Terraform: He calls it "the elephant in the room" and argues that it has drifted from a descriptive language toward something more like a programming language, making it harder to reason about. Julian on coding overhead: "Code is a liability" — referencing Kelsey Hightower's nocode project to argue that every line of infrastructure code carries long-term maintenance costs. Julian on no silver bullets: "There is no silver bullets. Stop dreaming. Just see how much work it takes, how complicated it is." Julian on automation ROI: "I automated a task, it took me two days to automate it, and I saved barely 10 seconds of my life." Andrey on incidents: Teams should not be dogmatic — "If your production is on fire, you go and do whatever it takes to put fire down" — then put the fix back into code afterward. Andrey on manual cloud resources: He describes a Lambda-based cleanup system that scanned for untagged AWS resources nightly, posted to Slack, and deleted them the next day if no one claimed them. Andrey on shiny tools: "A lot of people and developers, they like shiny tools — and there's nothing wrong about that — but you always have to ask, what is the problem we are solving?" Mattias on legacy: "Legacy pays the bills" — a reminder that existing infrastructure made the business successful, and it deserves respect during any modernization effort. Andrey's recovery story: After losing a production Kubernetes cluster, the team recreated everything in one to two hours because it was defined as code — and could have done it in 15-20 minutes with better documentation. Resources Infrastructure as Code by Kief Morris (O'Reilly) — The book Andrey recommends as essential reading for practitioners. Now in its third edition, it covers patterns and practices for building and evolving infrastructure as code. Pulumi — Infrastructure as Code in Any Programming Language — The tool discussed in the episode that lets teams define infrastructure using TypeScript, Python, Go, and other general-purpose languages instead of domain-specific languages like HCL. AWS CloudFormation — AWS's native infrastructure as code service using declarative YAML/JSON templates, which Mattias and the team use with Jenkins for their deployment pipeline. GCP Config Connector — The Google Cloud Kubernetes operator mentioned in the episode that lets teams manage GCP resources as native Kubernetes objects, bridging the gap between Kubernetes and cloud APIs. Forseti Security (archived) — The GCP security scanning tool Julian describes that could detect policy violations (like open ports) and automatically revert changes. Originally developed by Spotify and Google, it was archived in January 2025. Kelsey Hightower's nocode — The satirical GitHub project Julian references: "The best way to write secure and reliable applications. Write nothing; deploy nowhere." A humorous reminder that code is a liability. HashiCorp Terraform — The infrastructure as code tool the hosts discuss extensively, including its evolution from a simple declarative language to the more complex HCL2 with loops and conditionals.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app