Listen

Description

The Wild West of AI infrastructure just ended. CNCF launched the Certified Kubernetes AI Conformance Program at KubeCon Atlanta on November 11, 2025.

In this episode, Jordan and Alex break down:

🎯 The Problem AI Teams Faced:
• GPU scheduling worked differently on GKE vs EKS vs OpenShift
• Training on one platform, deploying on another = rewriting code
• GPU utilization stuck at 45-60% without standardization
• 82% of organizations building custom AI, 58% using Kubernetes

⚡ The 5 Core Certification Requirements:
• Dynamic Resource Allocation (DRA) - request GPUs with specific VRAM, interconnect requirements
• Intelligent Autoscaling - cluster and pod scaling based on GPU metrics
• Rich Accelerator Metrics - memory, bandwidth, temperature, NVLink stats
• AI Operator Support - Kubeflow, Ray, KServe compatibility
• Gang Scheduling - all-or-nothing pod startup for distributed training

📊 The Impact:
• GPU utilization: 45-60% → 70-85%
• Job queue times: 15-45 min → 3-10 min
• Monthly GPU costs: 30-40% reduction

🏢 Certified Vendors (11+):
AWS EKS, Google GKE, Microsoft Azure, Red Hat OpenShift, Oracle OCI, CoreWeave, Akamai, VMware/Broadcom, Giant Swarm, Kubermatic, Sidero Labs

🔮 What's Coming in v2.0 (2026):
• Topology-aware scheduling
• Multi-node NVLink standardization
• Model serving standards
• Cost attribution for GPU chargeback

📖 Full blog post: https://platformengineering.org/blog/kubernetes-ai-conformance-program-cncf-standardization-guide

🔗 Resources:
• CNCF Announcement: https://www.cncf.io/announcements/2025/11/11/cncf-launches-certified-kubernetes-ai-conformance-program-to-standardize-ai-workloads-on-kubernetes/
• GitHub: https://github.com/cncf/k8s-ai-conformance
• GKE Implementation: https://opensource.googleblog.com/2025/11/ai-conformant-clusters-in-gke.html

#Kubernetes #AI #CNCF #PlatformEngineering #DevOps #MLOps #GPU #CloudNative