Vibesre - Podcast Details

Shows

Platform Engineering Playbook Podcast Replace 5 Databases with 1? SurrealDB for AI Agents ExplainedYour AI agents are using five different databases right now - and you don't even know it. This database sprawl is silently killing your platform's performance and your team's sanity. In today's Platform Engineering Playbook, we dive deep into SurrealDB's multi-model approach and how it's revolutionizing AI infrastructure. Plus, breaking news on vulnerability management patterns that every platform engineer needs to understand. **What You'll Learn:** • Why database proliferation is the hidden killer of AI agent performance • SurrealDB's architecture deep dive and real-world deployment strategies • When (and when NOT) to consolidate your AI infras...

2026-02-1819 min

Platform Engineering Playbook Podcast Agoda’s API Agent Turns Any API into MCP — No Code, No Deployments**What if API integration nightmares could disappear without writing a single line of code?** Agoda just dropped a game-changing solution that transforms any API into MCP (Model Context Protocol) with zero deployments - and it's about to reshape how platform teams approach AI integrations. In today's Platform Engineering Playbook, we break down this revolutionary no-code approach and explore what it means for enterprise platform strategies. Plus, we dive into Docker's latest sandbox capabilities with NanoClaw, performance testing breakthroughs for Identity Management systems using encrypted DNS in OpenShift, and the emerging patterns for running AI coding...

2026-02-1718 min

Platform Engineering Playbook Podcast LocalStack Kills Community Edition: What Breaks in March**LocalStack just killed their open-source edition - but what does this really mean for your platform engineering stack?** In today's episode of Platform Engineering Playbook, we break down LocalStack's shocking decision to discontinue their Community Edition and what it means for teams relying on AWS local development. Plus, we dive into the ripple effects across the developer ecosystem and provide a practical decision framework for your next moves. **What You'll Learn:** • Why LocalStack's pricing shift from free to $39/month matters for platform teams • Decision frameworks for evaluating local development alternatives • How AI is rev...

2026-02-1615 min

Platform Engineering Playbook Podcast OpenTofu vs Terraform: What Enterprise Teams Are Actually Doing (2026)**Is your infrastructure strategy about to become obsolete?** By 2025, half of all Terraform installations could be running OpenTofu - and the implications for platform engineering teams are massive. In today's deep dive, we break down the OpenTofu vs. Terraform battle that's reshaping infrastructure as code. You'll learn the real mechanics behind migrating between these tools, practical decision frameworks for enterprise teams, and why this choice could define your platform's next five years. **What You'll Learn:** • The technical and business drivers behind the OpenTofu fork • Step-by-step migration strategies and gotchas to avoid • How to eval...

2026-02-1318 min

Platform Engineering Playbook Podcast Why Databases Inside Kubernetes Are Becoming Technical Debt**Is running databases in Kubernetes about to become legacy technical debt overnight?** By 2026, the inference cloud revolution is forcing platform engineers to completely rethink database architecture - and the implications are massive. In today's deep dive, we break down the "container paradox" that's reshaping how we think about stateful workloads in Kubernetes. You'll discover why the rise of AI inference is making traditional database-in-K8s patterns unsustainable and what this means for your platform strategy. **What You'll Learn:** • Why the inference cloud demands decoupled database architectures • A practical framework for assessing your statefulness spec...

2026-02-1217 min

Platform Engineering Playbook Podcast 47% of CNCF Projects Slowed Down in 2025 — Why That’s Actually Good News**Why did 47% of CNCF projects slow down their development velocity in 2025 — and why platform engineers should celebrate this trend?** In today's Platform Engineering Playbook, we decode what declining commit velocity across cloud native projects actually reveals about infrastructure maturity and what it means for your platform strategy. **What You'll Learn:** • How to interpret CNCF project velocity metrics as leading indicators for platform decisions • Why slower development cycles might signal stronger, more stable infrastructure foundations • Strategic insights for platform engineers navigating the evolving cloud native landscape • Breaking analysis of agentic AI transforming DevOps aut...

2026-02-1118 min

Platform Engineering Playbook Podcast The Claude Skills That Stop AI From Writing Dangerous Infrastructure as Code**Are 87% of DevOps teams unknowingly creating security vulnerabilities with AI-generated infrastructure code?** Today's Platform Engineering Playbook dives deep into the hidden risks of AI in DevOps workflows and reveals the specialized skills that top-performing teams use to harness AI safely and effectively. **What You'll Learn:** • Why AI-generated infrastructure code is creating blind spot vulnerabilities • The 8 Claude skills that actually move the needle for DevOps engineers • How to identify and automate your repetitive workflows with AI guardrails • Breaking news: Cloud complexity becomes the #1 security threat • Cloudflare's new vertical microfrontend template for edge routi...

2026-02-1019 min

Platform Engineering Playbook Podcast Docker vs Nix: Why Your Builds Aren’t Actually Reproducible97% of Docker containers can't reproduce the exact same build six months later—what does this mean for platform engineering, and why should you care? In today's episode of the Platform Engineering Playbook, we delve into the critical issue of reproducibility in Docker containers. Discover why this seemingly technical detail could significantly impact your workflows and productivity. We'll explore the limitations of traditional package managers and discuss how they can be a bottleneck in achieving true reproducibility. **Timestamps:** - **[00:00] Cold Open:** Dive into the startling statistic about Docker containers. - **[01:15] Intro:** Welcome and overview of...

2026-02-0918 min

Platform Engineering Playbook Podcast The Data Canary Pattern: How Netflix Prevents Bad Metadata Deploys**What happens when 2 billion daily metadata events could crash Netflix's entire platform with one bad transformation?** Today's Platform Engineering Playbook dives deep into Netflix's Data Canary system - a masterclass in building trust and validation into your data pipelines at scale. Plus, we cover the latest platform engineering news that's reshaping how we deploy and monitor distributed systems. **What You'll Learn:** • How Netflix validates massive data transformations without risking production • Container readiness strategies for Spring Boot in Kubernetes environments • LinkedIn's redesigned SAST pipeline using GitHub Actions and CodeQL • Why GitOps is becoming...

2026-02-0715 min

Platform Engineering Playbook Podcast Claude Opus 4.6: The First AI That Feels Like a Teammate**Claude Opus 4.6 just demolished GPT-4 on every coding benchmark - and it's about to reshape how we think about platform engineering automation.** In today's episode, we break down Anthropic's game-changing AI release and what it means for platform teams worldwide. We dive deep into the autonomous capabilities that could revolutionize how we handle infrastructure operations, but also explore the new risks this creates for production environments. **What You'll Learn:** • How Claude Opus 4.6's coding performance impacts platform tooling decisions • Why autonomous AI operations require new safety frameworks • Practical strategies for identifying AI automa...

2026-02-0616 min

Platform Engineering Playbook Podcast Autonomous AI in DevOps Is Here — And Most Teams Are Doing It Wrong**Will 87% of DevOps teams really be obsolete by 2026?** As AI agents take control of production infrastructure, we're witnessing the biggest transformation in platform engineering history. In today's episode, we dive deep into **autonomous AI agents in DevOps workflows** and explore how they're reshaping everything from monitoring to incident response. You'll discover real-world examples of AI agents managing production systems, plus critical insights on when and how to safely implement these powerful tools in your own infrastructure. **What You'll Learn:** • How AI agents are revolutionizing observability and SRE practices • Practical implementation strategies for autonomous moni...

2026-02-0519 min

Platform Engineering Playbook Podcast Kubernetes Is Retiring Ingress NGINX (And 50% of Clusters Aren’t Ready)"90% of Kubernetes clusters are running Ingress NGINX—abandoned in 16 months with zero maintainers left! What does this mean for your production systems? In this episode, we dive deep into the urgent need for migration and the alternatives available as the clock ticks down. With the retirement of Ingress NGINX set for March 2026, it's critical to understand how this affects millions of deployments worldwide. If you're among the half still relying on Ingress NGINX, you can't afford to miss this episode. 🔑 What you'll learn: - The migration timeline and key deadlines you need to know. ...

2026-02-0419 min

Platform Engineering Playbook Podcast OpenAI’s New macOS App: Is Agentic Coding Finally Here?**OpenAI just made 73% of coding assistants obsolete overnight - but what does this mean for platform engineers?** Today's episode breaks down OpenAI's game-changing macOS app for "agentic coding" and its massive implications for platform engineering workflows. We'll analyze why this isn't just another coding assistant, but a fundamental shift in how we approach infrastructure automation and developer tooling. **What You'll Learn:** ✅ Deep dive into OpenAI's new agentic coding capabilities and competitive advantages ✅ Critical risks platform teams need to consider (hallucinations, security, dependency management) ✅ How enterprise desktop computing is shifting toward immutable Linux system...

2026-02-0313 min

Platform Engineering Playbook Podcast 98% of Container CVEs Are Hiding Where You’re Not Scanning**Are your container security scans missing 98% of critical vulnerabilities?** New research from Chainguard reveals a shocking blind spot that could be exposing your infrastructure to massive security risks. In today's Platform Engineering Playbook, we unpack this bombshell finding and explore why traditional container scanning is failing at scale. You'll discover where these hidden vulnerabilities are lurking, why your current tools aren't catching them, and most importantly - what you can do about it. **What You'll Learn:** • Why 98% of container CVEs hide outside the top 20 images • The computational costs of comprehensive vulnerability scanning • How to...

2026-02-0213 min

Platform Engineering Playbook Podcast Why Forward-Deployed Engineers Are Making $300K+ (And Why Companies Are Desperate for Them)Why are forward-deployed engineers making 40% more than traditional backend developers, and why can't companies hire enough of them? In today's Platform Engineering Playbook, we dive deep into tech's hottest new role and explore three critical platform engineering developments reshaping the industry. **What You'll Learn:** • The explosive rise of forward-deployed engineers and why they're commanding premium salaries • Real-world case studies from Snowflake and financial services implementations • Three essential skill areas every successful FDE needs to master • How Artera is revolutionizing prostate cancer diagnostics with AWS architecture • Cloudflare's innovative approach to vertical microfront...

2026-01-3111 min

Platform Engineering Playbook Podcast AWS DevOps Agent in Production: What Most Teams Get Wrong**Why do 73% of AWS DevOps Agent deployments crash and burn in their first week?** It's not what you think. In this episode of Platform Engineering Playbook, we uncover the hidden culprits behind these shocking failure rates and reveal the systematic approach that separates successful platform teams from the rest. **What You'll Learn:** • The real reasons AWS DevOps Agent deployments fail (hint: it's not the code) • How to transform your incident response from "crowded stadium chaos" to "conference room clarity" • A practical framework for optimizing on-call rotations and team structure • Production-ready deployment strategi...

2026-01-3016 min

Platform Engineering Playbook Podcast AI Agents Are Rewriting the SRE Playbook (For Better or Worse)What if AI agents could flip the script on SRE work, turning 87% of firefighting into 87% prevention? That's exactly what's happening in the "agentic revolution" transforming platform engineering teams. In today's Platform Engineering Playbook, we dive deep into how AI agents are reshaping SRE workflows and what this means for your platform strategy. We'll cut through the hype to examine the real-world gap between vision and current reality, then identify which SRE tasks are actually ready for agent automation. **What You'll Learn:** • The three characteristics that make SRE tasks perfect candidates for AI automation • Why...

2026-01-2915 min

Platform Engineering Playbook Podcast DevOps Is Dead — Platform Engineering Replaced It**DevOps is dead - and the companies that created it are the ones pulling the trigger.** But what's replacing it might be the most significant shift in software delivery since containerization. In today's Platform Engineering Playbook, we dive deep into how Internal Developer Platforms are fundamentally reshaping the DevOps landscape. We'll explore why platform engineering has shed its experimental status and become the new standard for scaling development teams. **What You'll Learn:** • The five critical red flags that signal your platform needs immediate attention • Why the "black box problem" is derailing developer productivity • How to navigate the ingress-nginx archival and tr...

2026-01-2819 min

Platform Engineering Playbook Podcast 47 Countries Went Offline — What Platform Engineers Must Learn From It**What happens when 47 countries lose internet access in just 3 months—and it's not cyberattacks?** Today's Platform Engineering Playbook dives deep into the shocking Q4 2025 internet disruption data that reveals critical infrastructure vulnerabilities every platform engineer needs to understand. We'll analyze how cable cuts, storms, and DNS failures brought down entire regions, and more importantly—which companies survived and why. **What You'll Learn:** • The hidden patterns behind massive internet outages that caught most teams off guard • How resilient platform architectures saved companies millions during widespread disruptions • Specific signals to monitor for early detection of infrast...

2026-01-2719 min

Platform Engineering Playbook Podcast Two Missing Characters Nearly Compromised AWS’s Supply Chain**What if two missing characters could compromise every AWS-managed GitHub repository?** That's exactly what happened in a critical regex vulnerability that exposed massive supply-chain risks. In today's Platform Engineering Playbook, we break down this shocking security flaw and explore how platform engineers can protect their infrastructure from similar attacks. You'll discover the technical details behind the vulnerability, learn essential webhook security practices, and understand why regex validation is more critical than ever. **What You'll Learn:** ✅ How a simple regex pattern flaw created enterprise-wide security risks ✅ Webhook signature verification best practices ✅ AI-powered Linux securi...

2026-01-2615 min

Platform Engineering Playbook Podcast Kubernetes Just Became Essential for AI Growth (CNCF Report)**Why will 90% of AI workloads fail without Kubernetes in the next 18 months?** Most platform teams are walking into a disaster they can't see coming. In today's Platform Engineering Playbook, we break down the CNCF's shocking new survey results showing 82% of organizations are unprepared for AI infrastructure demands. Plus, we cover the Cloudflare BGP incident t hat took down major services and what it means for your platform resilience. **What You'll Learn:** ✅ Why Kubernetes is becoming make-or-break for AI workloads ✅ The hidden performance bottlenecks killing AI model deployments ✅ Actionable audit checklist for your current K8s setup ✅ How organizational culture trumps t...

2026-01-2518 min

Platform Engineering Playbook Podcast ChatGPT Scales PostgreSQL to power 800 million usersOpenAI is running ChatGPT for ~800 million users on PostgreSQL — and according to their own disclosures, it’s actually working. In this episode of the Platform Engineering Playbook Daily Podcast, we break down how PostgreSQL was pushed to hyperscale, the architectural tradeoffs behind a single-primary model, and the operational playbook that makes this kind of scale possible. This isn’t a generic “Postgres is great” story. It’s a real-world look at what it takes to run open-source databases at extreme scale, and what platform engineers can learn from it. ⏱️ Episod...

2026-01-2419 min

Platform Engineering Playbook Podcast 3 Skills You Need to Transition to Platform Engineer**Will 70% of DevOps engineers disappear in the next 5 years?** That's the bold prediction kicking off today's deep dive into the massive career shift happening in tech right now. In this episode of Platform Engineering Playbook, we explore the critical transition from DevOps to Platform Engineering and what it means for your career survival. You'll discover why traditional DevOps roles are evolving, how companies like Spotify are leading this transformation, and the concrete roadmap you need to navigate this shift successfully. **What You'll Learn:** • Why the DevOps-to-Platform Engineering transition is inevitable • Real-world examples from indu...

2026-01-2316 min

Platform Engineering Playbook Podcast The Infrastructure Monitoring Tools Teams Regret ChoosingThe monitoring tool everyone trusts is actually blind to 40% of your infrastructure failures—and the vendor knows it. Are you using an industry standard that misses almost half of all incidents? In this episode, we unravel the mystery of infrastructure monitoring tools and why your choice could be costing you dearly. As platform engineering teams grapple with an overwhelming array of options—from battle-tested open source tools to shiny SaaS platforms—the stakes have never been higher. The shift in focus from simple server monitoring to comprehensive observability is crucial for modern development. 🔑 What you’ll learn in...

2026-01-2217 min

Platform Engineering Playbook Podcast Your CI/CD Pipeline is a Debt Trap**73% of engineering teams are drowning in technical debt because of their CI/CD pipelines. Not despite them—because of them.** Are your automation tools secretly sabotaging your codebase? Today's Platform Engineering Playbook dives deep into the hidden ways CI/CD pipelines create technical debt and reveals practical strategies to break the cycle. **What You'll Learn:** • Why inheritance beats copying in platform design • Docker's new hardened images for bulletproof container security • How OpenTelemetry's log deduplication processor can slash your log volume • Critical vulnerabilities in Chainlit and Cloudflare you need to patch NOW • Acti...

2026-01-2111 min

Platform Engineering Playbook Podcast Kubernetes Just Revolutionized Learning — Get Ahead Now!**Are major tech companies secretly abandoning Kubernetes certifications?** What we discovered about the future of K8s learning will change how you approach platform engineering in 2026. In today's Platform Engineering Playbook, we uncover why traditional Kubernetes education is becoming obsolete and what platform teams are doing instead. Plus, breaking news that could revolutionize your infrastructure stack. **What You'll Learn:** • Why the volume of Kubernetes resources reveals a hidden shift in the industry • Microsoft's game-changing Azure Functions announcement for Model Context Protocol servers • How Pinterest's Moka is rewriting big data processing rules with Kubern...

2026-01-2017 min

Platform Engineering Playbook Podcast How AWS's New Euro Cloud Changes Data Control Forever"92% of European companies don’t trust US cloud providers with their data anymore. So, AWS just locked itself out of its own Euro Cloud! This shocking move raises critical questions about data sovereignty and compliance for businesses operating in Europe. In this episode, we dive deep into AWS's groundbreaking decision to create a completely isolated European cloud infrastructure, one that even Amazon employees can't access. Why would they cut off their own access, and what does this mean for your data strategy? 🔑 Learn about the implications of AWS's European Sovereign Cloud and how it represents a shif...

2026-01-1916 min

Platform Engineering Playbook Podcast Why Pulumi's New Move Could Change Terraform ForeverTerraform’s biggest competitor just made a move that could redefine infrastructure-as-code in 2026. Pulumi now runs Terraform and HCL natively—better than HashiCorp does. That’s not a migration tool, not a compatibility shim, but full native execution through the Pulumi engine, plus Terraform state hosted in Pulumi Cloud and financial credits to help teams exit existing HashiCorp contracts. In this episode of the Platform Engineering Playbook Daily Podcast, we break down why this announcement is one of the most important platform engineering stories of the year—and what it actually means for SREs, platform teams, a...

2026-01-1815 min

Platform Engineering Playbook Podcast Astro Joins Cloudflare: What It Means for Platform EngineersCloudflare acquires the Astro Technology Company, adding a 1M-downloads-per-week web framework to their edge platform. We analyze the strategic implications, what stays open source, and lessons about framework sustainability for platform engineering teams. Key Topics: - Astro framework overview: islands architecture, framework-agnostic components, content-first approach - Why Cloudflare acquired Astro: Developer ecosystem capture, edge compute alignment, workerd integration - Open source sustainability: MIT license preserved, historical patterns (Gatsby, Remix) - What changes for platform teams: Framework evaluation criteria, portability concerns, exit strategies - News: AWS European Sovereign Cloud, Let's Encrypt 6-day certs...

2026-01-1713 min

Platform Engineering Playbook Podcast ScyllaDB X Cloud Challenges DynamoDB Cost and PerformanceScyllaDB just launched X Cloud with claims of double the performance at half the cost compared to DynamoDB. This episode breaks down the technical architecture behind their tablet-based approach, how they're achieving 80% data compression on ARM Graviton4 instances, and when this actually makes sense for platform engineering teams running high-throughput workloads. Key Topics: - ScyllaDB X Cloud tablet-based architecture (5GB chunks) vs traditional consistent hashing - Claims of 6x performance improvement with 50% cost reduction vs DynamoDB - 80% compression on ARM Graviton4 instances, 25x faster data streaming - High-throughput workload targets: Discord, Disney, Starbucks...

2026-01-1611 min

Platform Engineering Playbook Podcast Invisible Linux Malware: The Undetectable Threat to Your Cloud InfrastructureYour Linux servers aren't just running containers anymore—they're hosting invisible tenants that security teams can't even detect. In this episode, we deep dive into VoidLink, the new cloud-native malware framework that Check Point Research just uncovered. This isn't your typical malware that got retrofitted for the cloud—this thing was born in the cloud, designed from the ground up to evade every detection tool in your security stack. We explore: • How VoidLink achieves its terrifying persistence in cloud environments • Why every major cloud provider is vulnerable to this new threat class • eBPF-based...

2026-01-1516 min

Platform Engineering Playbook Podcast The AI-Cloud Native Symbiosis - How Intelligent Infrastructure is Transforming Platform EngineeringBy 2025, 90% of new enterprise applications will be AI-powered and cloud-native. This episode explores the symbiotic relationship between AI and Kubernetes - where AI isn't just another workload, but is fundamentally transforming how we build and operate cloud native platforms. We cover real-world examples like Netflix's predictive scaling achieving 92% accuracy, the emergence of AI-driven observability platforms, and why platform engineers need to evolve from infrastructure operators to AI-infrastructure orchestrators. In this episode: - AI transforming the Kubernetes control plane with predictive scheduling - Netflix's AI-driven traffic management: 92% prediction accuracy, 35% resource reduction - AI-native observability: anomaly...

2026-01-1414 min

Platform Engineering Playbook Podcast MIT 10 Breakthrough Technologies 2026 - The Platform Engineering PerspectiveMIT just released their 10 Breakthrough Technologies for 2026 - and three of them are infrastructure problems that platform engineers are solving right now. This episode explores hyperscale AI data centers consuming 96 GW globally by 2026, vibe coding with 41% of code now AI-generated, and LLM interpretability research from Anthropic. We break down how platform engineers enable these breakthroughs through power-aware scheduling, AI coding guardrails, and new observability patterns for ML systems. In this episode: - Hyperscale AI data centers: 96 GW capacity, $600B capex, 100+ kW per rack - Vibe coding: 92% developer AI adoption, GitHub Copilot at 20M users ...

2026-01-1320 min

Platform Engineering Playbook Podcast AWS Route 53 Global Resolver - Enterprise DNS Security at the EdgeEvery DNS query your hybrid environment makes could be exposing sensitive data. AWS Route 53 Global Resolver, announced at re:Invent 2025, combines anycast routing, encrypted DNS protocols (DoH/DoT), and managed threat filtering in a single service. In this episode, we cover: - Anycast DNS architecture routing to nearest of 11 AWS regions - DoH and DoT encrypted DNS protocol support - AWS RAM authorization for multi-account private hosted zones - DNS filtering with managed threat lists - Implementation patterns for hybrid environments and remote workforces - Query logging for security visibility and...

2026-01-1220 min

Platform Engineering Playbook Podcast Kubernetes Upcoming Features Deep Dive - Extended Toleration Operators and Mutable PV Node AffinityThere's a Kubernetes cluster out there right now burning ten thousand dollars a month on GPU nodes that sit idle sixty percent of the time. Why? Because the scheduler can't say "only schedule pods on nodes with MORE than four GPUs." It's 2026, and our scheduler still can't count. But that's about to change. In this episode, we dive deep into two alpha features in Kubernetes 1.35 that represent a fundamental shift in how Kubernetes handles scheduling and storage: **Extended Toleration Operators (KEP-5471)** - Finally, numeric threshold-based scheduling with taints. New Gt (greater than) and Lt (less...

2026-01-1141 min

Platform Engineering Playbook Podcast Why Is a 2016 AWS Instance Still the Best Value? (Cloudspecs Research)New research from TUM reveals uncomfortable truths about cloud hardware stagnation. The paper "Cloudspecs: Cloud Hardware Evolution Through the Looking Glass" shows that the best-performing AWS instance for NVMe I/O per dollar was released in 2016 - and nothing since has come close. In this episode: • CIDR 2026 research from Technical University of Munich • AWS i3 instances from 2016 still beat all newer options for storage price-performance • CPU gains: 10x cores, but only 2-3x cost-adjusted improvement • Memory crisis: DRAM capacity per dollar has "effectively flatlined" • Network is the only bright spot: 10x improvement per dollar...

2026-01-1020 min

Platform Engineering Playbook Podcast Iran IPv6 Blackout - When Governments Weaponize Protocol TransitionsThe same IPv6 transition your infrastructure team has been procrastinating on is now being weaponized by governments. On January 8, 2026, Iran's IPv6 address space dropped 98.5% while IPv4 remained intact—a surgical strike against mobile users. In this episode, we break down: - Why blocking IPv6 specifically targets mobile users (hint: carrier NAT exhaustion) - The BGP mechanics of protocol-specific blocking - "Engineered degradation" vs total blackout—the new censorship playbook - How Starlink terminals are changing the calculus for authoritarian internet control - What platform engineers need to know: protocol-specific monitoring, Happy Eyeballs test...

2026-01-0924 min

Platform Engineering Playbook Podcast Venezuela BGP Anomaly - Deep Technical AnalysisA deep technical dive into the January 2026 Venezuela BGP route leak incident. Was it a cyberattack? The technical evidence says no - and that's actually more concerning. In this special deep-dive episode (no news segment), Jordan and Alex break down: - What actually happened on January 2, 2026 with AS8048 (CANTV, Venezuela's state ISP) - Why 10x AS-path prepending proves this was misconfiguration, not a man-in-the-middle attack - How BGP valley-free routing works and why Type 1 Hairpin leaks happen - The pattern of 11 similar leaks from CANTV since December 2025 - Why your multi-region...

2026-01-0828 min

Platform Engineering Playbook Podcast HolmesGPT: AI Root Cause Analysis for KubernetesDeep dive into HolmesGPT, the CNCF Sandbox AI agent that revolutionizes cloud-native troubleshooting. This episode covers what it is, its 40+ integrations, the project roadmap, and how to set it up today. News Segment: AirFrance-KLM's secure automation platform with Terraform, Vault, and Ansible AWS ECS tmpfs mounts on Fargate for secure secrets handling Qwen 30B running on Raspberry Pi - democratizing edge AI AWS European Sovereign Cloud with independent EU governance Main Topic - HolmesGPT: CNCF Sandbox project (accepted October 2025) with 1,600+ GitHub stars Agentic architecture: creates investigation task lists, queries systems, synthesizes findings 40+ built-in toolsets...

2026-01-0825 min

Platform Engineering Playbook Podcast Docker Kanvas: Infrastructure as DesignDocker just launched Kanvas, a visual tool that turns your architecture diagrams into deployable infrastructure. Built on Meshery (CNCF's 6th highest-velocity project), it converts Docker Compose files to Kubernetes manifests and challenges Helm and Kustomize dominance. In this episode, we explore: - The dev-to-prod gap that Kanvas solves - How Meshery Models add semantic understanding to infrastructure - Designer Mode vs Operator Mode capabilities - When to use Helm vs Kustomize vs Kanvas - Practical adoption strategies for platform teams Whether you're struggling with YAML hell or looking to lower...

2026-01-0723 min

Platform Engineering Playbook Podcast Remote MCP Architecture - Running AI Tool Servers on KubernetesThe MCP server registry hit 10,000+ integrations, but most teams are running these servers on laptops. This episode breaks down the production architecture that Google, Red Hat, and AWS are converging on: remote MCP servers deployed on Kubernetes. We cover three deployment patterns (local stdio, remote HTTP/SSE, and managed), the critical difference between wrapper-based and native API implementations, and a defense-in-depth security model using dedicated ServiceAccounts, time-bound tokens, RBAC, and audit logging. In this episode: - Remote MCP is production MCP—local stdio mode is for experimentation only; team-scale access requires HTTP/SSE mode - Na...

2026-01-0623 min

Platform Engineering Playbook Podcast AWS DevOps Agent - Promises vs RealityAWS launched DevOps Agent at re:Invent 2025 as an "autonomous on-call engineer." But before you cancel your PagerDuty subscription, we separate marketing from mechanics. NEWS THIS EPISODE: • KubeCon Europe 2026: March 23-26 in Amsterdam, 224 sessions across 5 tracks • Platform Engineering 2026 Predictions: Agentic infrastructure becomes standard In this deep-dive episode, we cover: WHAT IT PROMISES: • Always-on AI that investigates incidents 24/7 • Automatic root cause analysis across logs, metrics, traces, and deployments • Mitigation plan generation with step-by-step remediation • Integration with CloudWatch, Datadog, Dynatrace, New Relic, Splunk WHAT IT ACTUALLY DELIVERS: • Agent...

2026-01-0526 min

Platform Engineering Playbook Podcast AWS Graviton5: 192 Cores, 5x Cache - ARM Takes Over the Data CenterAWS doubled the core count on their flagship ARM processors with Graviton5—192 cores in a single socket, 5x L3 cache (180MB), and 3nm fabrication. We go deep on ARM vs x86 architecture, cache hierarchy latencies, NUMA elimination benefits, formal verification security proofs, and a complete migration framework with multi-arch CI/CD patterns. With 98% of top EC2 customers already on Graviton, the ARM tipping point is now. Duration: ~22 minutes This episode covers: - 192-core single socket design eliminating NUMA overhead - 180MB L3 cache enabling database working sets to fit entirely in cache ...

2026-01-0423 min

Platform Engineering Playbook Podcast Can OpenTelemetry Save Observability in 2026?OpenTelemetry has won the instrumentation wars with 95% adoption predicted for 2026. But winning data collection doesn't solve observability's real problems: spiraling costs, signal-to-noise ratios declining, and too much distance between seeing a problem and fixing it. In this episode, we break down: • Netflix's evolution to high-cardinality analytics processing 1M+ spans per episode • The cost-control chokepoint that OTel enables for telemetry optimization • Why 40% of organizations are targeting autonomous remediation by end of 2026 • How SLOs are becoming business conversations, not just engineering metrics Plus news on GitHub Actions 39% pricing reduction and Jaeger v2.14.0 legacy removal...

2026-01-0317 min

Platform Engineering Playbook Podcast When Serverless Fails: Unkey's 6x Performance Migration to ContainersWhy did an API key management platform abandon edge serverless for stateful containers? Unkey hit 30ms p99 cache latency when they needed sub-10ms—so they rebuilt everything on AWS Fargate. This episode covers the technical decision-making framework for choosing between serverless and containers, plus a deep dive into Kubernetes 1.35's new structured z-pages for debugging. In This Episode: - The serverless constraint: stateless = network request for every cache read - Unkey's complexity tax: Workers, Durable Objects, Queues, custom proxies - The container solution: Fargate + Global Accelerator = 6x performance - Decision framework: latency ta...

2026-01-0219 min

Platform Engineering Playbook Podcast From Alert Fatigue to Signal-Driven Ops: The Observability ShiftWhy do 73% of organizations experience outages from alerts they ignored? This episode breaks down the technical shift from reactive thresholds to SLO-driven observability. Learn multi-window burn-rate alerting patterns, AIOps implementations that actually work, and an 8-week migration path to cut alert noise by 80%. In This Episode: - The alert fatigue paradox: 2000+ weekly alerts with only 3% actionable - Technical causes: static thresholds, compound rule blind spots, alert storms - SLO-driven observability: error budgets and multi-window burn-rate alerting - AIOps patterns that work: anomaly detection, event correlation, RCA acceleration - Practical 8-week migration path...

2026-01-0121 min

Platform Engineering Playbook Podcast Security Ops Specialty: The Underrated Skill Every Platform Engineer Needs in 2026Platform engineers who understand security operations—secrets management, vulnerability scanning, and compliance automation—are commanding premium salaries in 2026. This episode breaks down the security ops specialty: what it includes, why organizations are desperate for it, and how to build these skills alongside your existing platform engineering expertise. In this episode: • Security ops specialty encompasses secrets management, vulnerability scanning, policy-as-code, and compliance automation • Organizations are struggling to find platform engineers with security depth—creating a supply-demand gap • The 2025 State of Secrets report shows 70% of organizations experienced a secrets-related incident • Key tools include HashiCorp Vault, Trivy, OPA/Gat...

2025-12-3119 min

Platform Engineering Playbook Podcast Agentic AI Foundation - MCP and the Future of AI-Native Platform EngineeringThe Linux Foundation announced the Agentic AI Foundation (AAIF) on December 9, 2025, bringing together AWS, Anthropic, Google, Microsoft, OpenAI, Block, Cloudflare, and Bloomberg. This episode breaks down MCP (Model Context Protocol) - the "HTTP for AI" with 97M+ monthly downloads. 📰 NEWS: Docker hardened images now free, MongoBleed CVE patch alert, Cloudflare "Fail Small" resilience plan, DORA metrics with Process Behavior Charts 🎯 Key Topics: • What AAIF and MCP mean for platform teams • MCP architecture: Hosts, Clients, and Servers • The N×M to N+M integration simplification • Security: OAuth flows, permission scopes, audit logging • Practical...

2025-12-3014 min

Platform Engineering Playbook Podcast FinOps 2026 for Platform Engineers: The Complete Skills GuideFinOps is becoming an essential skill for platform engineers in 2026. This episode provides a complete guide to the skills, certifications, and tools you need to add cloud cost management to your platform engineering toolkit. 📰 News Segment: • GPG.fail documents 14 critical GnuPG vulnerabilities - check your signing tools • MongoBleed CVE-2025-14847: Critical MongoDB exploit - patch immediately • The Dangers of SSL Certificates: Catastrophic failure modes in automation • Google Multi-Cluster Orchestrator: Cross-region K8s management (KubeCon 2025) • GPG cleartext signature parsing vulnerabilities found 💡 Key Takeaways: • Platform teams own 70%+ of cloud spending decisions • FinOps + Pla...

2025-12-2916 min

Platform Engineering Playbook Podcast Platform Engineering Salary Report 2026: Skills That PayPlatform engineers are commanding $172K-$207K in 2026, a 13-27% premium over DevOps roles. This episode breaks down salary benchmarks from Dice, Motion Recruitment, and Levels.fyi, revealing which skills are S-tier ($200K+) and which are table stakes. We cover: - Platform Engineer vs DevOps salary gap (13-27% premium) - S-tier skills: LLM/GenAI ($195K-$312K), Platform Engineering, DevSecOps, MLOps - A-tier skills: Kubernetes + CKA, Go/Golang, FinOps, OpenTelemetry - Entry-level hiring crisis (-25% to -50% at major tech) - Geographic salary shifts: Atlanta +13.9%, Silicon Valley -7.3% - Top certification ROI: CKA...

2025-12-2817 min

Platform Engineering Playbook Podcast Platform Engineering 2026 Predictions Roundup (Platform Engineering 2026 Look Forward Series - Part 5/5)The series finale of our five-part Platform Engineering 2026 Look Forward Series. We synthesize everything from agentic AI operations, mainstream adoption, developer experience metrics, and boring Kubernetes into ten concrete predictions for 2026. Learn what to invest in versus ignore, and discover our 2026 platform engineering thesis. In this episode: - High confidence predictions: IDP market consolidates into 3 tiers, AI-assisted operations becomes table stakes, policy-as-code becomes table stakes - Medium confidence predictions: Talent gap peaks H1 2026 then stabilizes, "Platform team of one" becomes technically viable - INVEST IN: Developer experience measurement, self-service capabilities, golden paths, AI-assisted incident...

2025-12-2716 min

Platform Engineering Playbook Podcast Kubernetes Enters the Boring Era (Platform Engineering 2026 Look Forward Series - Part 4/5)The best thing happening to Kubernetes in 2026 is that it's becoming boring. After a decade of explosive innovation, Kubernetes is entering its "mature infrastructure" phase - stable, predictable, and increasingly invisible. Like Linux and PostgreSQL before it, boring Kubernetes enables platform teams to build abstractions without worrying about breaking changes. Part of the Platform Engineering 2026 Look Forward Series. In this episode: - Boring infrastructure is mature infrastructure - Linux and PostgreSQL became boring, then conquered the world - K8s 1.32-1.35 pattern: incremental stability, small refinements, no paradigm shifts - Innovation is moving up...

2025-12-2614 min

Platform Engineering Playbook Podcast Developer Experience Metrics Beyond DORA (Platform Engineering 2026 Look Forward Series - Part 3/5)DORA metrics revolutionized how we measure DevOps performance, but they have a critical blind spot: they tell you how your delivery pipeline is performing, but not how your people are doing. This episode explores the SPACE framework, DX Core 4, cognitive load measurement, and the HEART framework for platform teams. Part of the Platform Engineering 2026 Look Forward Series. In this episode: - DORA tells you the what but not the how or the at what cost - teams can hit every DORA metric while engineers burn out - SPACE framework: Satisfaction, Performance, Activity, Communication, and Efficiency...

2025-12-2413 min

Platform Engineering Playbook Podcast Platform Engineering Goes Mainstream in 2026 (Platform Engineering 2026 Look Forward Series - Part 2/5)Episode 2 of our 5-part "Platform Engineering 2026 Look Forward Series" examines the macro trend: platform engineering crossing the chasm to mainstream adoption. Gartner predicts 80% of software engineering organizations will have platform teams by 2026. The CNPE certification launched at KubeCon 2025. But there's a 56% talent gap and nearly half of initiatives run on under $1M annually. We address the "DevOps rebranding" debate with a 5-question litmus test: 1. Do you have internal customers (developers)? 2. Do you measure developer satisfaction? 3. Do you have a product roadmap? 4. Can developers self-serve without tickets? 5. Do you deprecate platform...

2025-12-2316 min

Platform Engineering Playbook Podcast Agentic AI Transforms Platform Operations in 2026 (Platform Engineering 2026 Look Forward Series - Part 1/5)Episode 1 of our 5-part "Platform Engineering 2026 Look Forward Series" tackles the hottest debate in platform engineering: will AI agents replace us or amplify us? AWS Frontier Agents can reason across 30+ steps. The MLOps market hits $129 billion by 2028. Netflix AI triage cuts MTTR by 40%. But where are the hard limits? We introduce the 60/30/10 Framework: - 60% Delegate: Log analysis, runbook execution, cost optimization - 30% Augment: Incident response, capacity planning (AI suggests, human confirms) - 10% Guard: Architecture decisions, security posture, novel failures The key insight: the 20% AI can't do is 80% of the value.

2025-12-2221 min

Platform Engineering Playbook Podcast CNPE (Certified Cloud Native Platform Engineer) Certification Study GuideThe CNPE (Certified Cloud Native Platform Engineer) exam launched November 11, 2025 at KubeCon Atlanta, becoming the first hands-on platform engineering certification in five years. This deep dive covers exam format, all five domains, and a complete study guide. Key Points: • CNPE is hands-on: 17 tasks in 2 hours, 64% pass score • Five domains: GitOps/CD (25%), Platform APIs (25%), Observability (20%), Architecture (15%), Security (15%) • BACK stack: Backstage, Argo CD, Crossplane, Kyverno • Golden Kubestronaut requires CNPE after March 2026 • Career impact: Platform engineer salaries $160K-$220K Resources: • Episode page: https://platformengineering.org/podcasts/00066-cnpe-certification-study-guide • CNPE Exam: https://training.linuxfoundatio...

2025-12-2118 min

Platform Engineering Playbook Podcast Kubernetes 1.35 Timbernetes Deep Dive: Breaking Changes, In-Place Resize GA, Gang SchedulingKubernetes 1.35 "Timbernetes" dropped on December 17, 2025, fundamentally changing how we operate clusters. This deep dive covers the 60 enhancements, 3 breaking changes that will bite you if unprepared, and in-place pod resize graduating to GA after six years of development. What You'll Learn: • Breaking Changes: cgroup v1 REMOVED (not deprecated), containerd 1.x EOL, IPVS deprecated • In-Place Pod Resize GA: Resize CPU/memory without pod restart - 6 years from KEP to stable • Pod Certificates Beta: Native kubelet-managed mTLS for zero-trust pod-to-pod auth • Gang Scheduling Alpha: Native all-or-nothing scheduling for AI/ML distributed training • Alpha Features: Node Declared F...

2025-12-2019 min

Platform Engineering Playbook Podcast Terraform Stacks + Native Monorepo Support: HashiCorp's Answer to IaC ComplexityNo more copy-paste configs. No more manual state management. Terraform just went component-based. HashiCorp released native monorepo support and Terraform Stacks to GA on September 25, 2025. This is the biggest architectural shift since Terraform modules. Instead of directory-per-environment with duplicate configurations, you define components once and deploy multiple times with isolated state. We explain components (lifecycle-aware resource groups in .tfstack.hcl files), deployments (isolated instances with separate state), orchestration rules (context-aware automated approvals), linked stacks (declarative cross-stack dependencies), migration paths from Terragrunt, and when platform teams should adopt. NEWS SEGMENT: • Terraform Stacks + Monorepo (GA...

2025-12-2017 min

Platform Engineering Playbook Podcast 95% Fewer CVEs, $0 Cost: Docker Just Open-Sourced Enterprise SecuritySupply chain attacks cost $60 billion in 2025. Docker just made the solution free. On December 17, Docker released 1,000+ hardened container images under Apache 2.0—previously a paid offering. Independent penetration testing by SRLabs confirmed 95% CVE reduction and found NO root escapes or container breakouts. These images use distroless runtime: no shell, no package manager, no attack surface. We break down how distroless actually works (why removing /bin/sh matters), SLSA Level 3 cryptographic provenance, SBOM/VEX for killing alert fatigue, multi-stage build migration patterns, debugging without a shell (kubectl debug), and how Docker compares to Chainguard Wolfi, Google distroless, an...

2025-12-1918 min

Platform Engineering Playbook Podcast Kubernetes 1.35 "Timbernetes" - The End of the Pod Restart EraKubernetes 1.35 is here, and it changes everything about pod lifecycle management. In this episode, we break down the release that finally lets you scale pods without restarting them. In This Episode: - In-Place Pod Vertical Scaling goes GA - adjust CPU/memory without pod restarts - Breaking changes: cgroup v1 removed, containerd 1.x EOL, IPVS deprecated - Pod Certificates (beta) for native workload identity without cert-manager - 60 enhancements: what matters for platform teams - Practical upgrade checklist and timing guidance News Segment: - Docker makes 1,000+ hardened container images free (95...

2025-12-1815 min

Platform Engineering Playbook Podcast 40,000x Fewer Deployment Failures: How Netflix Adopted TemporalNetflix reduced their deployment failures by 40,000x using Temporal. In this episode, we break down how they achieved this remarkable improvement and what it means for your platform engineering practice. In This Episode: - Netflix's deployment reliability problem: 4% failure rate from transient cloud operations - What is durable execution? Write code as if failures don't exist - Temporal vs AWS Step Functions vs Apache Airflow vs Cadence comparison - Netflix's Spinnaker/Clouddriver implementation with 2-hour fix-forward window - When Temporal is (and isn't) the right choice for your organization Key...

2025-12-1717 min

Platform Engineering Playbook Podcast Kubernetes: Helm vs Crossplane vs kro (Honest Comparison)48% of Kubernetes users struggle with tool choice. That's nearly half of us paralyzed by options. So when AWS adopted kro alongside Argo CD, we had to ask: is this the Goldilocks solution we've been waiting for? In this episode, Jordan and Alex tackle the composition tool landscape with an honest decision framework. We dive deep into CEL expressions, resource graph mechanics, and GitOps integration. We also give Viktor Farcic's criticism a fair hearing, and explain exactly when kro makes sense - and when it doesn't. News Segment: • Shai-Hulud npm supply chain attack postmortem - 500+ pa...

2025-12-1622 min

Platform Engineering Playbook Podcast Platform Engineering 2025 Year in Review2025 was the year platform engineering grew up—and got a reality check. AI entered infrastructure in ways we couldn't ignore. Industry consensus finally emerged on what platforms should actually do. And Cloudflare went down six times to remind us that concentration risk isn't just theoretical. In this special year-in-review episode, we look back at the ten stories that defined platform engineering in 2025: ✅ AI-native Kubernetes arrived (DRA GA, AI Conformance v1.0) ✅ Platform engineering reached consensus—but 70% still fail ✅ Infrastructure concentration risk became undeniable (AWS + Cloudflare) ✅ IngressNightmare exposed 43% of cloud environments ✅ Open source sustainability...

2025-12-1519 min

Platform Engineering Playbook Podcast Okta's GitOps Journey - Scaling ArgoCD from 12 to 1,000 ClustersIn five years, Okta scaled Auth0's private cloud from 12 to 1,000+ Kubernetes clusters using ArgoCD. At KubeCon 2025, engineers Jérémy Albuixech and Kahou Lei shared their hard-won lessons. This episode breaks down the challenges, solutions, and practical wisdom for scaling GitOps to enterprise levels. Full episode page: https://platformengineeringplaybook.com/podcasts/00058-okta-gitops-argocd-1000-clusters In this episode, we cover: - The 83x scaling journey: from 12 clusters in 2020 to 1,000+ in 2025 - Five major challenges at scale: controller degradation, centralized bottlenecks, application explosion, global latency, observability gaps - Five key solutions: controller sharding, ArgoCD Ag...

2025-12-1415 min

Platform Engineering Playbook Podcast Platform Engineering Team Structures That WorkNinety percent of organizations now have platform teams, but most just renamed their ops team and expected different results. This episode breaks down the team sizes, reporting structures, and interaction patterns backed by DORA 2025 data that separate successful platform teams from glorified ticket handlers. Full episode page: https://platformengineeringplaybook.com/podcasts/00057-platform-engineering-team-structures In this episode, we cover: - DORA 2025 shows 90% of orgs have platforms, 76% have dedicated teams—when done right, 8% individual productivity boost and 10% team productivity boost - Optimal team size is 6-12 people (Spotify squads, Microsoft 5-9)—small enough for ownership, large enou...

2025-12-1317 min

Platform Engineering Playbook Podcast CDKTF Deprecated - The End of HashiCorp's Programmatic IaC ExperimentHashiCorp (now IBM) has officially archived the CDK for Terraform project, ending a five-year experiment in programmatic infrastructure-as-code. Full episode page: https://platformengineeringplaybook.com/podcasts/00056-cdktf-deprecated-iac-migration In this episode, we break down: - Why CDKTF failed to find product-market fit (243K downloads vs Pulumi's 1.1M) - The four key factors behind the deprecation: Pulumi's head start, JSII complexity, HCL "good enough", IBM acquisition timing - Community reaction and the "rug pull" sentiment - Migration paths: HCL (cdktf synth --hcl), Pulumi, OpenTofu, or AWS CDK - What platform engineers should learn...

2025-12-1214 min

Platform Engineering Playbook Podcast stern v1.33.1 - Listen to the Docs with AudioDocs🎧 AUDIODOCS: Official documentation of popular open-source projects, adapted and narrated for audio. Learn while commuting, exercising, or doing chores. Stop juggling terminal windows to tail Kubernetes logs. stern lets you tail multiple pods and containers simultaneously with regex queries, auto-detection of new pods, and color-coded output. This episode covers everything from basic usage to advanced templates and filtering. WHAT YOU'LL LEARN: 00:00 - Introduction & The Problem stern Solves 01:30 - Basic Usage: Regex and Resource Queries 03:00 - Multi-Container Tailing & Filtering 04:30 - Namespace, Label, and Node Filtering 06:00 - Output Formatting & Custom Templates 07:30 - T...

2025-12-1115 min

Platform Engineering Playbook Podcast CoreDNS v1.13.1 - Listen to the Docs with AudioDocs🎧 AUDIODOCS: Official documentation of popular open-source projects, adapted and narrated for audio. Learn while commuting, exercising, or doing chores. Master CoreDNS, the default DNS server for Kubernetes clusters. This 72-minute episode covers the complete v1.13.1 documentation - from plugin architecture to production configuration. Every time a pod looks up a service, every time kubectl exec needs to find a pod - CoreDNS handles that resolution. If you're debugging DNS issues or optimizing cluster performance, this comprehensive audio guide has you covered. WHAT YOU'LL LEARN: 00:00 - Introduction & Overview 02:30 - Project Context: CNCF Gra...

2025-12-111h 13

Platform Engineering Playbook Podcast kubectx & kubens v0.9.5 - Listen to the Docs with AudioDocs🎧 AUDIODOCS: Official documentation of popular open-source projects, adapted and narrated for audio. Learn while commuting, exercising, or doing chores. Stop typing long kubectl config commands! kubectx and kubens are essential CLI tools that let you switch Kubernetes contexts and namespaces instantly with tab completion and fuzzy search. This 10-minute episode covers everything you need to know about v0.9.5 - from installation to power-user workflows. If you work with multiple Kubernetes clusters, these tools will save you hours every week. WHAT YOU'LL LEARN: 00:00 - The Problem: Why kubectl Context Switching is Painful 01:30 - k...

2025-12-1110 min

Platform Engineering Playbook Podcast AWS re:Invent 2025 Recap 4/4 - Data & AI Wrap-UpPart 4 of 4 in our AWS re:Invent 2025 series (finale). The data and AI services that tie everything together. S3 Tables with Apache Iceberg hits GA with Intelligent-Tiering and cross-region replication. Aurora DSQL delivers distributed SQL with GPS atomic clocks. S3 Vectors supports 2 billion vectors at 90% lower cost. Clean Rooms ML enables privacy-enhanced synthetic datasets. Plus a comprehensive wrap-up connecting 50+ announcements across all four episodes. News: Envoy CVE-2025-0913, Rust in Linux kernel permanent, Let's Encrypt 10 years. In this episode: - S3 Tables GA with Intelligent-Tiering (80% cost savings) and automatic cross-region replication for Iceberg tables ...

2025-12-1124 min

Platform Engineering Playbook Podcast AWS re:Invent 2025 Recap Part 3/4 - EKS & Cloud OperationsPart 3 of our AWS re:Invent 2025 series. AWS transforms Kubernetes into an AI infrastructure platform with massive scale and AI-native operations. In this episode: - EKS Ultra Scale: 100,000 nodes per cluster (vs 15K GKE, 5K AKS)—1.6 million Trainium accelerators or 800K GPUs in a single cluster - AWS replaced etcd's Raft consensus with their internal "journal" system and moved to in-memory storage for 500 pods/sec at 100K scale - Anthropic using EKS Ultra Scale for Claude training, improving latency KPIs from 35% to 90%+ - EKS Capabilities: Fully managed Argo CD, AWS Controllers for Kubernetes (200+ CR...

2025-12-1017 min

Platform Engineering Playbook Podcast AWS re:Invent 2025 Part 2/4 - Infrastructure & Developer ExperienceAWS re:Invent 2025 Series (Part 2 of 4) AWS announces Graviton5 with 192 cores (3x previous gen) and 40% better price-performance vs x86. Trainium 3 delivers 4.4x performance at 50% lower cost, with NeuronLink eliminating 50% network overhead. Lambda Durable Functions enable year-long workflows. Werner Vogels introduces the "Renaissance Developer" framework for the AI era. Plus: BellSoft's hardened Java images cut CVEs by 95%, GitHub Actions package management security gaps exposed, and Proxmox releases VMware escape hatch. Links & Resources: - Full episode page: https://platformengineering.org/podcasts/00050-aws-reinvent-2025-infrastructure-developer-experience - BellSoft Hardened Images: https://www.infoq.com/news/2025/12/bellsoft-hardened-images/ ...

2025-12-0914 min

Platform Engineering Playbook Podcast AWS re:Invent 2025 Part 1/4 - The Agentic AI RevolutionAWS announces autonomous AI agents that can work for days without human intervention. The DevOps Agent is an always-on incident responder. The Security Agent understands your application architecture. And Kiro is already used by hundreds of thousands of developers. This is part 1 of our 4-part AWS re:Invent 2025 coverage series. KEY TOPICS: • Frontier Agents: DevOps Agent, Security Agent, and Kiro • DevOps Agent: 24/7 incident response with human-in-the-loop approval • Security Agent: Context-aware security from design through deployment • Kiro: GA autonomous developer agent used internally at Amazon • Bedrock AgentCore: Policy controls, memory, and 13 evaluation...

2025-12-0817 min

Platform Engineering Playbook Podcast Developer Experience Metrics Beyond DORADORA metrics revolutionized how we measure DevOps performance, but are we missing the bigger picture? This episode explains DORA from the ground up—the four key metrics, how they're measured, and why elite teams deploy more AND fail less. Then we explore what DORA misses: developer satisfaction, cognitive load, and flow state. From SPACE to DevEx to DX Core 4, discover the frameworks changing how we measure developer productivity. In this episode: - DORA's Four Key Metrics: Deployment Frequency, Lead Time, Change Failure Rate, and MTTR (now Failed Deployment Recovery Time) - Elite vs Low performers: El...

2025-12-0713 min

Platform Engineering Playbook Podcast Cloudflare's Trust Crisis - December 2025 Outage and the Human CostThree weeks after their worst outage since 2019, Cloudflare went down again. On December 5, 2025, a Lua code bug took down 28% of HTTP traffic for 25 minutes - the sixth major outage of 2025. Beyond the technical postmortem, this episode examines the pattern of repeated failures, community reactions, and the often-overlooked human cost to on-call engineers. 📰 News Segment Links: • KubeCon Survey: How Platform Teams Are Adopting AI and IDPs https://thenewstack.io/kubecon-survey-how-platform-teams-are-adopting-ai-and-idps/ • GitHub Actions workflow dispatch now supports 25 inputs https://github.blog/changelog/2025-12-04-actions-workflow-dispatch-workflows-now-support-25-inputs • Hybrid Cloud-Native Networking in Enterprise - Louis Ryan (Google)...

2025-12-0611 min

Platform Engineering Playbook Podcast Cloud Cost Quick Wins for Year-EndGlobal cloud spend hits $720 billion in 2025—and organizations waste 20-30% on unused resources. Year-end is the perfect time to show savings before budgets reset. In this episode, Jordan and Alex deliver six actionable quick wins you can implement THIS WEEK: 💰 The Six Wins: 1️⃣ Scheduling non-prod environments → 70% savings 2️⃣ Right-sizing oversized instances → 25-40% per instance 3️⃣ Reserved Instances/Savings Plans → Up to 72% discount 4️⃣ Spot instances for CI/CD → 60-90% savings 5️⃣ Storage tiering → Move cold data to Glacier 6️⃣ Zombie resource hunt → $500-2K/month per account 📋 Monday Morning Checklist: • Run cloud cost analyzer (30 min) • Find top 5...

2025-12-0512 min

Platform Engineering Playbook Podcast Platform Engineering vs DevOps vs SRE - The Identity CrisisPlatform Engineer roles pay 20% more than DevOps Engineer roles, but job descriptions are 90% identical. Is Platform Engineering just DevOps with better marketing? In this episode, we cut through the confusion with origin stories, philosophy comparisons, and practical career advice. Key insights: • Platform Engineer job postings grew 40% YoY while DevOps postings declined 15% • DevOps (2009) was a movement—never meant to be a job title • SRE (2003/2016) introduced Google's 50% engineering time rule • Platform Engineering (2018-2020) brought product thinking to internal tools • The 20% salary premium is for product thinking, not the title Decision framework: S...

2025-12-0416 min

Platform Engineering Playbook Podcast Platform Engineering Certification Tier List 2025Are certifications worth it? The answer is: it depends. And that's precisely the problem. In this episode, Jordan and Alex rank 25+ certifications using a data-driven 60/40 framework (60% skill-building, 40% market signal). 🎯 The Certification Dilemma: • Platform engineers span Kubernetes, cloud, observability, security, and developer experience • No single certification captures that breadth • Most certifications prove you can cram for exams, not solve production problems 📊 Key Statistics: • Platform engineers earn $172K vs DevOps $152K (13% premium) • CKA appears in 45,000+ job postings globally • Average certification investment: $800-1,200/year • CKA pass rate: 66% (hands-on exam, production-relevant...

2025-12-0325 min

Platform Engineering Playbook Podcast Kubernetes AI Conformance - The End of AI Infrastructure ChaosThe Wild West of AI infrastructure just ended. CNCF launched the Certified Kubernetes AI Conformance Program at KubeCon Atlanta on November 11, 2025. In this episode, Jordan and Alex break down: 🎯 The Problem AI Teams Faced: • GPU scheduling worked differently on GKE vs EKS vs OpenShift • Training on one platform, deploying on another = rewriting code • GPU utilization stuck at 45-60% without standardization • 82% of organizations building custom AI, 58% using Kubernetes ⚡ The 5 Core Certification Requirements: • Dynamic Resource Allocation (DRA) - request GPUs with specific VRAM, interconnect requirements • Intelligent Autoscaling - cluster and pod scaling b...

2025-12-0214 min

Platform Engineering Playbook Podcast Helm 4 - The Definitive Guide to the Biggest Update in 6 YearsHelm 4.0 dropped at KubeCon Atlanta 2025, marking the biggest update in 6 years. Server-Side Apply finally ends the GitOps ownership wars. WASM plugins bring sandboxed security. But what breaks? This is the definitive guide covering SSA deep-dive, migration timeline, and the full breaking changes analysis. In this episode: - Server-Side Apply (SSA) replaces three-way merge - field ownership tracked at API server level via managedFields - SSA delivers 40-60% faster deployments by reducing API calls (1 PATCH vs 2+ GET/PATCH per resource) - WASM plugins via Extism runtime are optional but recommended - existing Go binaries and...

2025-12-0123 min

Platform Engineering Playbook Podcast CNPE Certification Guide - The First Platform Engineering CredentialCNCF just launched the first-ever hands-on platform engineering certification at KubeCon Atlanta 2025. But with beta testers reporting 29% scores, is CNPE worth pursuing? In this episode, Jordan and Alex break down everything you need to know: 🎯 What CNPE Tests: • GitOps & Continuous Delivery (25%) • Platform APIs & Self-Service (25%) • Observability & Operations (20%) • Platform Architecture (15%) • Security & Policy Enforcement (15%) 📊 Career Impact: • Platform engineers earn $219K average (US) • 20% higher than DevOps engineers • Second most popular K8s role at 11.47% of job postings 🛤️ Three Certification Paths: • Traditional: CKA → CKS → CNPA → CNPE • Fast-track: CNPA → CNPE • Full co...

2025-11-3014 min

Platform Engineering Playbook Podcast 10 Platform Engineering Anti-Patterns That Kill Developer ProductivityDORA 2024 found organizations with platform teams saw throughput decrease by 8% and stability decrease by 14%. Wait—isn't platform engineering supposed to help? In this episode, Jordan and Alex unpack the 10 anti-patterns sabotaging platform engineering initiatives: ORGANIZATIONAL ANTI-PATTERNS: 1. Ticket Ops - The bottleneck factory where developers wait a week for tasks that should take minutes 2. Ivory Tower Platform - Teams disconnected from developer reality creating standards no one follows 3. Platform as Bucket - When platform scope grows 3x without corresponding team growth 4. Mandatory Adoption - Forcing usage hides resistance and breeds resentment ...

2025-11-2912 min

Platform Engineering Playbook Podcast Black Friday War Stories: Lessons from E-Commerce's Worst DaysWhy do major retailers with unlimited budgets still crash on Black Friday? This episode dives into the graveyard of e-commerce outages—from J.Crew's $775,000 five-hour crash to the AWS typo that cost $150 million. In this Black Friday special episode, we examine: 📊 THE HALL OF FAME CRASHES • J.Crew 2018: 323,000 shoppers affected, $775,000 lost in 5 hours • Walmart 2018: $9 million lost before Black Friday even started • Best Buy 2014: Infrastructure optimized for desktop, got 78% mobile • Cloudflare 2024: 99.3% of Shopify stores frozen (6M+ domains) 💥 THE FAMOUS NON-BLACK-FRIDAY DISASTERS • AWS S3 2017: One typo took down half the internet for 4+ ho...

2025-11-2811 min

Platform Engineering Playbook Podcast Giving Thanks to Your Dependencies: A Platform Engineer's Gratitude GuideThis Thanksgiving, let's talk about the people you've never thanked. 60% of open source maintainers are unpaid. 60% have left or considered leaving. Your infrastructure runs on their free time. In this episode: - Gratitude tools: npx thanks, npm fund, cargo-thanks, thanks-stars - Happiness Packets: Send anonymous thank-you notes to developers - Beyond stars: Why specific use case emails matter more than generic thanks - Company-level: Open Source Pledge ($2K/dev/year), GitHub Sponsors - Your 5-minute Thanksgiving challenge Perfect for platform engineers, developers, and engineering leadership who want to support the...

2025-11-2708 min

Platform Engineering Playbook Podcast KubeCon Atlanta 2025 Part 3: Community at 10 Years - The Sustainability QuestionCNCF celebrates 10 years with 300,000 contributors and 230+ projects—but the hallway track told a different story. 60% of maintainers unpaid. 60% have left or considered leaving. The XZ Utils backdoor showed what happens when isolated maintainers burn out. Han Kang's passing reminds us of the human cost behind the code. In this episode: - Technical breakout sessions: CiliumCon (TikTok IPv6, 60K node clusters), in-toto graduation, Gateway API convergence, OpenTelemetry eBPF - Open Source Pledge: Antithesis $110K, Convex $100K - real cash to maintainers - Kat Cosgrove survival strategies: "When you're an open source maintainer, you don't get to...

2025-11-2614 min

Platform Engineering Playbook Podcast KubeCon Atlanta 2025 Part 2: Platform Engineering Consensus and Community Reality CheckAfter years of "what even IS platform engineering" debates, KubeCon 2025 delivered consensus: three non-negotiable principles, real-world adoption at Intuit/Bloomberg/ByteDance scale, and the honest truth about maintainer burnout. Cat Cosgrove's "ready to abandon ship" quote reveals the human cost of building the infrastructure everyone depends on. In this episode: - Three platform principles emerged: API-first self-service, business relevance (not just tech metrics), and managed service approach (not templates) - The "puppy for Christmas" anti-pattern explains 70% platform team failure rate - templates without ongoing operational support - Intuit migrated Mailchimp's 11M users and 700M...

2025-11-2517 min

Platform Engineering Playbook Podcast KubeCon Atlanta 2025 Part 1: AI Goes Native and the 30K Core LessonGoogle donates a GPU driver live on stage. OpenAI saves $2.16M/month with one line of code. Kubernetes rollback finally works after 10 years. What changed at KubeCon Atlanta 2025 that proves Kubernetes isn't adapting to AI—it's being rebuilt for it? This is Part 1 of our three-part deep dive into KubeCon Atlanta 2025 (November 12-21). Over three episodes, we're covering the CNCF's 10-year anniversary, the announcements reshaping platform engineering, and the honest conversations about ecosystem sustainability. Key Topics Covered: • Dynamic Resource Allocation (DRA) reaches GA in Kubernetes 1.34 - prevents 10-40% GPU performance loss from NUMA misalignment ($200K/da...

2025-11-2419 min

Platform Engineering Playbook Podcast The $4,350/Month GPU Waste Problem: How Kubernetes Architecture Creates Massive Cost InefficiencyYour H100 costs $5,000 per month, but you're only using it at 13% capacity—wasting $4,350 monthly per GPU. Analysis of 4,000+ Kubernetes clusters reveals 60-70% of GPU budgets burn on idle resources because Kubernetes treats GPUs as atomic, non-shareable resources. Discover why this architectural decision creates massive waste, and the five-layer optimization framework (MIG, time-slicing, VPA, Spot, regional arbitrage) that recovers 75-93% of lost capacity in 90 days. 🔗 Full episode page: https://platformengineeringplaybook.com/podcasts/00034-kubernetes-gpu-cost-waste-finops 📝 See a mistake or have insights to add? This podcast is community-driven - open a PR on GitHub! Keywords: kubernetes gpu, gpu cost...

2025-11-2327 min

Platform Engineering Playbook Podcast Service Mesh Showdown: Why User-Space Beat eBPFKernel-level eBPF should beat user-space proxies—but Istio Ambient delivers 8% mTLS overhead while Cilium shows 99%. Academic benchmarks reveal why architecture boundaries matter more than execution location, and what that means for your service mesh choice in 2025. In this episode: - Istio Ambient (user-space) achieves 8% mTLS overhead vs Cilium (kernel eBPF) at 99%—counterintuitive result explained by L7 processing boundaries requiring kernel/user-space transitions - 50,000-pod stability test shows Cilium's distributed control plane crashed API server under churn while Istio's centralized control handled it—20% per-core efficiency, 56% total throughput advantage - Decision framework: Ambient for 1,000+ nodes with mixed...

2025-11-2220 min

Platform Engineering Playbook Podcast The Terraform vs OpenTofu Debate - Why "Just Switch" Is Bad AdviceHashiCorp's license change and IBM's $6.4B acquisition created the "you must migrate" narrative—but 70% of teams using Terraform in-house aren't legally affected. Jordan and Alex challenge the binary thinking with Fidelity's 50,000 state file migration case study, a three-factor decision framework, and the truth nobody talks about: migration is 90% organizational change management, not technology. In this episode: - 70% of teams using Terraform in-house are unaffected by BSL license restrictions, yet face strategic vendor lock-in risk with IBM's $6.4B acquisition - Fidelity migrated 50,000 state files managing 4M resources in 2 quarters—technical migration is trivial, organizational change management is t...

2025-11-2117 min

Platform Engineering Playbook Podcast Agentic DevOps: GitHub Agent HQ and the Autonomous Pipeline RevolutionGitHub Universe 2025 announced Agent HQ—mission control for orchestrating AI agents from OpenAI, Anthropic, Google, and more. Azure SRE Agent saved Microsoft 20,000+ engineering hours. But 80% of companies report agents executing unintended actions, and only 44% have agent-specific security policies. Jordan and Alex break down what agentic DevOps actually means, the architectural shift from automation to autonomy, and the tiered adoption framework for deploying agents without creating catastrophic risk. In this episode: - GitHub Agent HQ enables multi-agent orchestration (OpenAI, Anthropic, Google, Cognition) with Enterprise Control Plane for governance - Copilot coding agent works asynchronously—spins up GitH...

2025-11-2018 min

Platform Engineering Playbook Podcast Cloudflare Outage November 2025: When a Rust Panic Took Down 20% of the InternetA routine database permissions change triggered Cloudflare's worst outage since 2019—taking down ChatGPT, X, Shopify, Discord, and 20% of the internet for nearly 6 hours. Jordan and Alex dissect the technical chain reaction from ClickHouse metadata exposure to a Rust panic in the FL2 proxy, examining how ~60 features became >200 and exceeded a hardcoded memory limit. The third major cloud outage in 30 days—after AWS and Azure—raises critical questions about infrastructure concentration risk and why internal configuration needs the same defensive programming as external input. Perfect for senior platform engineers, SREs, DevOps engineers with 5+ years experience looking to level up the...

2025-11-1912 min

Platform Engineering Playbook Podcast Ingress NGINX Retirement: The March 2026 Migration DeadlineThe de facto standard Kubernetes ingress controller will stop receiving security patches in March 2026—and only 1-2 people have been maintaining it for years. Jordan and Alex unpack why this happened, examine the security implications of unpatched CVEs on internet-facing infrastructure, and provide a four-phase migration framework to Gateway API. Includes controller comparison (Envoy Gateway, Cilium, Kong, Traefik, NGINX Gateway Fabric) and immediate actions for this week. Perfect for senior platform engineers, SREs, DevOps engineers with 5+ years experience looking to level up their platform engineering skills. Episode URL: https://platformengineeringplaybook.io/podcasts/00029-ingress-nginx-retirement

2025-11-1912 min

Platform Engineering Playbook Podcast OpenTelemetry eBPF Instrumentation: Zero-Code Observability Under 2% OverheadWhat if you could achieve complete observability coverage—every HTTP request, database query, and gRPC call—without touching application code? Jordan and Alex investigate eBPF instrumentation for OpenTelemetry, revealing how kernel-level hooks deliver under 2% CPU overhead versus traditional APM agents' 10-50%. Discover the May 2025 inflection point, the TLS encryption challenge, and a practical framework for combining eBPF with SDK instrumentation. In this episode: - eBPF instrumentation achieves under 2% CPU overhead by observing kernel operations already happening—versus 10-50% for traditional APM agents - Grafana donated Beyla to OpenTelemetry in May 2025, making eBPF instrumentation part of the co...

2025-11-1814 min

Platform Engineering Playbook Podcast The Open Source Observability Showdown: When "Free" Costs $12K/MonthPrometheus is free, Grafana is free, Loki is free—yet Datadog posted $2.3B in revenue and Shopify runs a 15-person team just to manage their observability stack. We decode which open source tools (Prometheus, Loki, Tempo, VictoriaMetrics) actually deliver on their promises, which hide massive operational complexity, and when the "free" option costs more than paying a vendor. Learn the decision framework that matches observability architecture to your team's operational maturity. In this episode: - Single-cluster Prometheus costs ~5 hrs/month ($750-1500 equivalent), but multi-cluster federation jumps to 40-80 hrs/month ($6K-12K)—know your tier before comm...

2025-11-1719 min

Platform Engineering Playbook Podcast The Kubernetes Complexity Backlash: When Simpler Infrastructure WinsKubernetes commands 92% market share, yet 88% report year-over-year cost increases and 25% plan to shrink deployments. We unpack the 3-5x cost underestimation problem, the cargo cult adoption pattern, and when alternatives like Docker Swarm, Nomad, ECS, or PaaS platforms deliver better ROI. From the 200-node rule to 37signals' $10M+ five-year savings leaving AWS, this is your data-driven framework for right-sizing infrastructure decisions in 2025. 🔗 Full episode page: https://platformengineeringplaybook.com/podcasts/00026-kubernetes-complexity-backlash 📝 See a mistake or have insights to add? This podcast is community-driven - open a PR on GitHub! Summary: • 88% of Kubernetes adopters report y...

2025-11-1615 min

Platform Engineering Playbook Podcast SRE Reliability Principles: The 26% Problem - Error Budgets, SLOs, Platform EngineeringOnly 26% of organizations actively use SLOs after a decade of Google's SRE principles being gospel. We explore why adoption is so low despite 49% saying they're more relevant than ever, which principles remain timeless (error budgets, embracing risk, blameless postmortems), and how to adapt SRE for 2025's complexity of AI/ML systems, Platform Engineering collaboration, and multi-cloud chaos. Includes practical playbooks for starting from zero, fixing ignored SLOs, and ML-specific adaptations. The key insight: it's not that SRE principles are wrong—implementation is harder than anticipated, but the philosophy remains timeless when properly adapted. In this episode: ...

2025-11-1615 min

Platform Engineering Playbook Podcast Internal Developer Portal Showdown 2025: Backstage vs Port vs Cortex vs OpsLevelYour team spent 6 months implementing Backstage. Adoption? 8%. The CFO asks: "Why didn't we buy a solution?" Here's the 2025 comparison with real pricing, real timelines, and the counterintuitive truth: commercial platforms are 8-16x cheaper than "free" Backstage for most teams. OpsLevel $39/user/month delivers in 30-45 days. Port $78/month offers flexibility without coding. Cortex $65-69/month enforces standards. We break down the decision framework by team size—under 200? OpsLevel. 200-500? Port or OpsLevel. 500+? Backstage viable with dedicated platform team. The key insight: it's not open-source free vs commercial expensive—it's transparent licensing vs hidden $150K/20-developer engineering costs. In...

2025-11-1424 min

Platform Engineering Playbook Podcast DNS for Platform Engineering: The Silent KillerWhy does a forty-year-old protocol keep taking down billion-dollar infrastructure? The October 2025 AWS outage lasted fifteen hours because of a DNS race condition. Kubernetes defaults create 5x query amplification. We investigate how DNS really works in modern platforms—CoreDNS plugin chains, the ndots:5 trap, GSLB failover—and deliver the five-layer defensive playbook to prevent your platform from becoming the next postmortem. 🔗 Full episode page: https://platformengineeringplaybook.com/podcasts/00023-dns-platform-engineering 📝 See a mistake or have insights to add? This podcast is community-driven - open a PR on GitHub!

2025-11-1319 min