Cheat Sheet: AI Infrastructure & LLM Serving

Core Terminology

CUDA: NVIDIA’s parallel computing platform.
VRAM: Video RAM; crucial memory on a GPU where model weights are loaded.
Quantization: Reducing the precision of model weights (e.g., from 16-bit to 4-bit) to save memory at the cost of slight accuracy degradation.
Continuous Batching: Dynamically scheduling token generation at the iteration level for high throughput.
TTFT (Time to First Token): Latency metric for responsiveness.
TPOT (Time per Output Token): Latency metric for generation speed.

PagedAttention: vLLM’s memory management technique that reduces KV cache fragmentation.
GPU Memory Utilization: Configure the fraction of GPU memory allocated to the KV cache vs. model weights (typically --gpu-memory-utilization 0.9).

Standard HPAs struggle with scale-to-zero and event-driven spikes.
Use KEDA to scale based on external metrics (e.g., Prometheus queries on inference queue length or API Gateway traffic).
Cold Starts: Scaling from 0 to 1 involves downloading massive weights and loading them into VRAM. Use node pre-warming or persistent volume caching to mitigate.

Monitor GPU memory utilization (OOM errors will kill inference).
Monitor Token Throughput.
Ensure Prometheus is configured to scrape the /metrics endpoint of your serving engine (vLLM natively supports this).