Unstuck: The Karpenter Lifecycle

Description

SCENARIO:
You deploy a new ML training job requiring 8 GPUs, but pods are stuck in Pending. The K8s Scheduler logs show 'no nodes available'. Walk me
through exactly what Karpenter does to resolve this, step by step.
WHAT THEY'RE TESTING: K8s Scheduler vs Karpenter's role, the 4-step lifecycle
THE ANSWER:
• WATCH: Karpenter controller watches for pods marked 'unschedulable' by K8s scheduler
• EVALUATE: Reads ALL constraints from Pod Spec:
- Resource requests (8 GPUs, memory, CPU)
- nodeSelector, nodeAffinity, tolerations
- Topology spread constraints
• PROVISION: Calls AWS EC2 API to launch instance matching ALL requirements
- Selects p3.16xlarge (8 GPUs) in correct zone
- Applies NodePool's taints, labels, kubelet config
• RESULT: Node joins cluster, K8s scheduler binds the pod
→ Key insight: Karpenter provisions, K8s scheduler still does final binding!

Listen

Description

Want to check another podcast?