Cheat Sheet: AI Infrastructure & LLM Serving
Core Terminology
- CUDA: NVIDIA’s parallel computing platform.
- VRAM: Video RAM; crucial memory on a GPU where model weights are loaded.
- Quantization: Reducing the precision of model weights (e.g., from 16-bit to 4-bit) to save memory at the cost of slight accuracy degradation.
- Continuous Batching: Dynamically scheduling token generation at the iteration level for high throughput.
- TTFT (Time to First Token): Latency metric for responsiveness.
- TPOT (Time per Output Token): Latency metric for generation speed.
Ollama Quick Commands
ollama run <model>: Pulls (if needed) and runs a model interactively.ollama pull <model>: Downloads the model weights to the local cache.ollama list: Shows installed models.ollama rm <model>: Deletes a model to free up storage.
vLLM Deployment Considerations
- PagedAttention: vLLM’s memory management technique that reduces KV cache fragmentation.
- GPU Memory Utilization: Configure the fraction of GPU memory allocated to the KV cache vs. model weights (typically
--gpu-memory-utilization 0.9).
Kubernetes Scaling (KEDA)
- Standard HPAs struggle with scale-to-zero and event-driven spikes.
- Use KEDA to scale based on external metrics (e.g., Prometheus queries on inference queue length or API Gateway traffic).
- Cold Starts: Scaling from 0 to 1 involves downloading massive weights and loading them into VRAM. Use node pre-warming or persistent volume caching to mitigate.
Observability Checklist
- Monitor GPU memory utilization (OOM errors will kill inference).
- Monitor Token Throughput.
- Ensure Prometheus is configured to scrape the
/metricsendpoint of your serving engine (vLLM natively supports this).