Project: Production LLM Inference Cluster with vLLM and KEDA

Overview

This capstone project bridges the gap between running models locally and serving them in a highly available, production-grade cloud environment. You will deploy a vLLM server to a Kubernetes cluster equipped with GPU nodes, implement autoscaling with KEDA based on queue length, and instrument the entire stack for observability.

Architecture

You will build an architecture consisting of:

A Kubernetes cluster with at least one GPU-enabled node pool (simulated or real).
A vLLM deployment serving an open-weight model.
A queueing mechanism (e.g., Redis or RabbitMQ) simulating incoming inference requests.
KEDA configured to scale the vLLM pods based on the queue depth.
Prometheus scraping metrics from vLLM and Grafana visualizing the data.

Requirements

Deployment: Create Kubernetes manifests (Deployments, Services) for vLLM. Ensure the correct container image and GPU resource requests are specified.
Autoscaling: Write a KEDA ScaledObject that targets your vLLM deployment, scaling from 0 to 3 replicas based on the external queue metric.
Load Testing: Use a simple load-generation script to populate the queue and trigger a scale-up event.
Observability: Apply the Prometheus ServiceMonitor for vLLM and import a community dashboard into Grafana to visualize token throughput and latency.

Deliverables

A GitHub repository containing all Kubernetes YAML manifests or Helm charts.
A README file documenting the deployment steps.
Screenshots of the Grafana dashboard during the load test, clearly showing the scaling events and token throughput.

Capstone: Global E-Commerce Platform Architecture Capstone Project: The AI Platform Portfolio