Auto-scaling AI Workloads in Kubernetes (KEDA & GPU Nodes)

Introduction

AI workloads are bursty and resource-intensive. Static provisioning of GPU nodes leads to either high costs or poor performance during traffic spikes. This lesson covers auto-scaling AI workloads in Kubernetes.

Managing GPUs in Kubernetes

Kubernetes requires specialized device plugins (like the NVIDIA Device Plugin) to schedule workloads onto GPU nodes. We will cover node selectors, taints, and tolerations for AI workloads.

KEDA (Kubernetes Event-driven Autoscaling)

Standard Horizontal Pod Autoscalers (HPA) rely on CPU or memory metrics, which are often poor indicators for LLM scaling. KEDA allows scaling based on external metrics like message queue length or custom Prometheus metrics (e.g., token generation latency).

Autoscaling Strategies

We will design autoscaling policies that spin up new GPU pods when the inference queue grows and scale down to zero when idle, minimizing cloud costs while maintaining responsiveness.

Challenges with GPU Autoscaling

GPU nodes take time to provision, and large models take time to load into memory. We will discuss strategies like over-provisioning, pre-warming, and multi-model serving to mitigate these cold-start delays.

Production LLM Serving with vLLM & Continuous Batching Monitoring LLM Latency, Token Throughput & GPU Utilization