AI Infrastructure, GPU Slicing, CNI Injection & DCGM Observability
Version: 2.0.0
Purpose: Canonical standalone hands-on lab structure.
Required Inputs: Associated lesson, lab objective, environment details.
Outputs: Reproducible, independently testable hands-on lab markdown.
Lab Metadata
- Lab ID:
LAB-AI-01 - Associated Lesson: Module 14 (
MOD-AI: AI Infrastructure & GPU Management) - Objective: Author a declarative Time-Slicing ConfigMap manifest, create a declarative ClusterPolicy Custom Resource manifest with DKMS, architect a HostDevice CNI configuration manifest for InfiniBand Virtual Function injection, and author a PrometheusRule manifest for DCGM XID hardware faults and VRAM exhaustion.
- Estimated Time: 45 minutes
- Difficulty: Advanced
Prerequisites
- Completion of Module 13 (
MOD-SRE: Site Reliability Engineering) and Module 12 (MOD-OBS: Observability & Reliability). - Foundational understanding of YAML Custom Resources, Kubernetes DaemonSets, CNI conflist specifications, and PromQL time-series mathematics.
- Access to a local bash terminal environment (with standard tools like
mkdir,cat, andgrep).
Environment Setup
Prepare your local terminal sandbox environment by setting up the required directory structure for your enterprise AI infrastructure manifests, GPU operator configurations, high-throughput CNI definitions, and DCGM hardware alert rules.
# Create the parent directory for the AI infrastructure and GPU management lab manifests
mkdir -p ~/enterprise-ai-lab/sharing
mkdir -p ~/enterprise-ai-lab/operator
mkdir -p ~/enterprise-ai-lab/networking
mkdir -p ~/enterprise-ai-lab/monitoring
cd ~/enterprise-ai-lab
# Verify the directory structure was created successfully
pwdStep-by-Step Instructions
Step 1: Authoring a Declarative Time-Slicing ConfigMap Manifest
In this step, you will author a declarative Time-Slicing ConfigMap manifest (sharing/time-slicing-config.yaml) that configures the NVIDIA k8s-device-plugin to multiply a single physical GPU into 4 virtual nvidia.com/gpu resources (replicas: 4), eliminating hardware stranding for trusted internal development environments.
cat << 'EOF' > sharing/time-slicing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-time-slicing-config
namespace: kube-system
data:
any: |-
version: v1
flags:
# Do not rename resource strings; keep advertising generic nvidia.com/gpu!
renameByDefault: false
sharing:
timeSlicing:
# Enable Time-Slicing! Allow up to 4 pods to share a single physical GPU concurrently!
# (Multiplies 1 physical GPU into 4 virtual nvidia.com/gpu resources!)
resources:
- name: nvidia.com/gpu
replicas: 4
EOFStep 2: Creating a Declarative ClusterPolicy Manifest with DKMS
In this step, you will author a declarative ClusterPolicy Custom Resource manifest (operator/cluster-policy.yaml) that configures the NVIDIA GPU Operator to deploy containerized drivers (driver.version: 550.54.15), enable Dynamic Kernel Module Support (ENABLE_DKMS: "true") to survive Linux kernel updates, mount your Time-Slicing ConfigMap, and enable dcgm-exporter.
cat << 'EOF' > operator/cluster-policy.yaml
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: cluster-policy
spec:
# 1. Containerized NVIDIA Driver Configuration!
driver:
enabled: true
version: "550.54.15" # Specify exact proprietary driver version to deploy across nodes!
image: "driver"
repository: "nvcr.io/nvidia"
manager:
env:
# Enable Dynamic Kernel Module Support (DKMS) to ensure driver survives kernel updates!
- name: ENABLE_DKMS
value: "true"
# 2. NVIDIA Container Toolkit Configuration!
toolkit:
enabled: true
version: "v1.15.0-ubuntu22.04"
installDir: "/usr/local/nvidia" # Mount path on host worker node!
# 3. NVIDIA k8s-device-plugin Configuration!
devicePlugin:
enabled: true
version: "v0.15.0"
config:
# Mount our Time-Slicing ConfigMap to eliminate hardware stranding!
name: "nvidia-time-slicing-config"
default: "any"
# 4. NVIDIA DCGM Exporter Configuration (GPU Hardware Telemetry)!
dcgmExporter:
enabled: true
version: "3.3.5-3.4.0-ubuntu22.04"
serviceMonitor:
enabled: true # Automatically generate Prometheus Operator ServiceMonitor!
# 5. Multi-Instance GPU (MIG) Configuration!
mig:
strategy: single # Set to single or mixed depending on cluster multi-tenancy requirements
EOFStep 3: Architecting a HostDevice CNI Configuration Manifest for InfiniBand VF Injection
In this step, you will author a declarative HostDevice CNI plugin configuration manifest (networking/ib-sriov-cni.conflist) that bypasses virtual ethernet (veth) pairs entirely (type: host-device) to inject physical InfiniBand Virtual Functions (ib0) directly into container pods, unlocking kernel-bypassing GPUDirect RDMA.
cat << 'EOF' > networking/ib-sriov-cni.conflist
{
"cniVersion": "0.3.1",
"name": "ib-sriov-network",
"plugins": [
{
# Utilize HostDevice CNI plugin! Bypasses virtual ethernet (veth) pairs entirely!
"type": "host-device",
# Specify exact physical InfiniBand Virtual Function (VF) device name on host!
"device": "ib0",
"ipam": {
"type": "host-local",
"subnet": "192.168.100.0/24",
"routes": [
{ "dst": "0.0.0.0/0" }
]
}
}
]
}
EOFStep 4: Authoring a PrometheusRule Manifest for DCGM XID Hardware Faults
In this step, you will author a declarative Prometheus Operator PrometheusRule manifest (monitoring/dcgm-alerts.yaml) that enforces PromQL mathematical expressions for VRAM exhaustion (DCGM_FI_DEV_FB_USED / ... > 95), XID hardware faults (DCGM_FI_DEV_XID_ERRORS > 0), and thermal throttling (DCGM_FI_DEV_GPU_TEMP > 85).
cat << 'EOF' > monitoring/dcgm-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: production-gpu-hardware-alerts
namespace: production
labels:
release: prometheus-stack # Mandatory label matching Prometheus Operator selector!
spec:
groups:
- name: dcgm-gpu-alerts
rules:
# ==============================================================================
# ALERT 1: VRAM EXHAUSTION (Alerts BEFORE a fatal CUDA OOM crash occurs!)
# ==============================================================================
- alert: GpuVramExhaustionWarning
# PromQL Math: Used VRAM / Total VRAM * 100 > 95%
expr: >-
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL * 100 > 95.0
for: 2m
labels:
severity: warning # Skips PagerDuty! Sends peaceful Slack warning!
tier: ai-platform
annotations:
summary: "GPU VRAM utilization exceeded 95% on node {{ $labels.instance }} (GPU {{ $labels.gpu }})"
description: "AI workload is approaching maximum High Bandwidth Memory capacity. Danger of CUDA Out-Of-Memory (OOM) crash."
runbook_url: "https://wiki.mycompany.com/runbooks/gpu-vram-exhaustion"
# ==============================================================================
# ALERT 2: XID HARDWARE FAULT (Fatal Silicon Error -> PAGERDUTY PHONE CALL!)
# ==============================================================================
- alert: GpuXidHardwareFault
# PromQL Math: Any active XID error code > 0 (e.g., XID 43: GPU fell off PCIe bus!)
expr: DCGM_FI_DEV_XID_ERRORS > 0
for: 1m
labels:
severity: critical # Master routing label! Triggers PagerDuty phone call!
tier: ai-platform
annotations:
summary: "Fatal NVIDIA XID Error Code {{ $value }} detected on node {{ $labels.instance }} (GPU {{ $labels.gpu }})"
description: "Underlying physical GPU silicon encountered a fatal hardware or kernel driver fault (e.g., XID 43 bus disconnect or XID 48 ECC corruption)."
runbook_url: "https://wiki.mycompany.com/runbooks/gpu-xid-faults"
# ==============================================================================
# ALERT 3: THERMAL THROTTLING (Overheating -> PEACEFUL SLACK WARNING!)
# ==============================================================================
- alert: GpuThermalThrottlingWarning
# PromQL Math: Physical GPU core temperature > 85 Celsius
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 3m
labels:
severity: warning
tier: ai-platform
annotations:
summary: "GPU Core Temperature exceeded 85C on node {{ $labels.instance }} (GPU {{ $labels.gpu }})"
runbook_url: "https://wiki.mycompany.com/runbooks/gpu-thermal-throttling"
EOFVerification
To verify that your enterprise AI infrastructure, GPU slicing, CNI injection, and DCGM observability lab was completed successfully, execute the following verification commands to inspect your manifest contents and verify Time-Slicing multipliers, DKMS enablers, HostDevice injection, and DCGM mathematical strings.
# 1. Verify Time-Slicing multiplier in the sharing ConfigMap manifest
cat sharing/time-slicing-config.yaml | grep -E "replicas:.*4"
# 2. Verify DKMS enabler in the ClusterPolicy manifest
cat operator/cluster-policy.yaml | grep -E "ENABLE_DKMS.*true"
# 3. Verify HostDevice plugin type in the CNI configuration manifest
cat networking/ib-sriov-cni.conflist | grep -E "type.*host-device"
# 4. Verify InfiniBand Virtual Function device name in the CNI manifest
cat networking/ib-sriov-cni.conflist | grep -E "device.*ib0"
# 5. Verify XID hardware fault PromQL math in the DCGM alerting manifest
cat monitoring/dcgm-alerts.yaml | grep -E "DCGM_FI_DEV_XID_ERRORS.*>.*0"Expected Output:
replicas: 4
- name: ENABLE_DKMS
value: "true"
"type": "host-device",
"device": "ib0",
expr: DCGM_FI_DEV_XID_ERRORS > 0Troubleshooting
-
Symptom:
cat sharing/time-slicing-config.yaml | grep -E "replicas:.*4"returns no output.- Cause: You authored a ConfigMap without defining the
replicas: 4multiplication parameter undersharing.timeSlicing. - Solution: Add
replicas: 4to your manifest to eliminate hardware stranding for trusted internal development environments.
- Cause: You authored a ConfigMap without defining the
-
Symptom:
cat operator/cluster-policy.yaml | grep -E "ENABLE_DKMS.*true"returns no output.- Cause: You authored a ClusterPolicy manifest without the
ENABLE_DKMSenvironment variable block underspec.driver.manager.env. - Solution: Ensure your
driverblock explicitly passesENABLE_DKMS: "true"to guarantee the driver survives host Linux kernel updates.
- Cause: You authored a ClusterPolicy manifest without the
-
Symptom:
cat networking/ib-sriov-cni.conflist | grep -E "type.*host-device"returns no output.- Cause: You authored a CNI conflist configured with
type: calicoortype: bridge, forcing traffic throughvethpairs. - Solution: Update your manifest to use
type: host-deviceanddevice: ib0to unlock kernel-bypassing GPUDirect RDMA.
- Cause: You authored a CNI conflist configured with
-
Symptom:
cat monitoring/dcgm-alerts.yaml | grep -E "DCGM_FI_DEV_XID_ERRORS.*>.*0"returns no output.- Cause: You authored a PrometheusRule manifest without the
DCGM_FI_DEV_XID_ERRORS > 0hardware fault equation. - Solution: Add the
DCGM_FI_DEV_XID_ERRORS > 0expression to ensure Prometheus fires critical PagerDuty calls for silicon disconnects.
- Cause: You authored a PrometheusRule manifest without the
Cleanup
Safely remove the enterprise AI infrastructure lab directory and temporary manifest files from your terminal environment.
# Safely remove the enterprise AI infrastructure lab directory
rm -rf ~/enterprise-ai-lab
# Verify the directory was removed successfully
ls -la ~/enterprise-ai-lab 2>/dev/null || echo "Cleanup complete. Directory successfully removed."