Project: Production LLM Inference Cluster with vLLM and KEDA
Overview
This capstone project bridges the gap between running models locally and serving them in a highly available, production-grade cloud environment. You will deploy a vLLM server to a Kubernetes cluster equipped with GPU nodes, implement autoscaling with KEDA based on queue length, and instrument the entire stack for observability.
Architecture
You will build an architecture consisting of:
- A Kubernetes cluster with at least one GPU-enabled node pool (simulated or real).
- A vLLM deployment serving an open-weight model.
- A queueing mechanism (e.g., Redis or RabbitMQ) simulating incoming inference requests.
- KEDA configured to scale the vLLM pods based on the queue depth.
- Prometheus scraping metrics from vLLM and Grafana visualizing the data.
Requirements
- Deployment: Create Kubernetes manifests (Deployments, Services) for vLLM. Ensure the correct container image and GPU resource requests are specified.
- Autoscaling: Write a KEDA
ScaledObjectthat targets your vLLM deployment, scaling from 0 to 3 replicas based on the external queue metric. - Load Testing: Use a simple load-generation script to populate the queue and trigger a scale-up event.
- Observability: Apply the Prometheus ServiceMonitor for vLLM and import a community dashboard into Grafana to visualize token throughput and latency.
Deliverables
- A GitHub repository containing all Kubernetes YAML manifests or Helm charts.
- A README file documenting the deployment steps.
- Screenshots of the Grafana dashboard during the load test, clearly showing the scaling events and token throughput.