Observability, Prometheus Monitoring & Grafana Dashboards
Version: 2.0.0
Purpose: Canonical standalone hands-on lab structure.
Required Inputs: Associated lesson, lab objective, environment details.
Outputs: Reproducible, independently testable hands-on lab markdown.
Lab Metadata
- Lab ID:
LAB-OBS-01 - Associated Lesson: Module 12 (
MOD-OBS: Observability with Prometheus, Grafana, OpenTelemetry & AlertManager) - Objective: Author a declarative Prometheus ServiceMonitor manifest, create a Grafana dashboard JSON schema with template variables, architect an OpenTelemetry Collector configuration with Tail-Based Sampling, and configure AlertManager grouping and inhibition rules.
- Estimated Time: 45 minutes
- Difficulty: Intermediate to Advanced
Prerequisites
- Completion of Module 10 (
MOD-K8S: Kubernetes Engineering) and Module 11 (MOD-CICD: CI/CD Pipelines & Automation). - Foundational understanding of YAML syntax, Linux bash execution, HTTP endpoints, and PromQL vector mathematics.
- Access to a local bash terminal environment (with standard tools like
mkdir,cat, andgrep).
Environment Setup
Prepare your local terminal sandbox environment by setting up the required directory structure for your enterprise observability, monitoring, tracing, and alerting manifests.
# Create the parent directory for the observability lab manifests
mkdir -p ~/enterprise-observability-lab/monitoring
mkdir -p ~/enterprise-observability-lab/dashboards
mkdir -p ~/enterprise-observability-lab/tracing
mkdir -p ~/enterprise-observability-lab/alerting
cd ~/enterprise-observability-lab
# Verify the directory structure was created successfully
pwdStep-by-Step Instructions
Step 1: Authoring a Declarative Prometheus ServiceMonitor with Cardinality Defense
In this step, you will author a declarative Prometheus Operator ServiceMonitor manifest (monitoring/payment-servicemonitor.yaml) that dynamically discovers Kubernetes Service endpoints and implements absolute cardinality defense (action: labeldrop) to protect Prometheus against Out-Of-Memory (OOM) server crashes.
cat << 'EOF' > monitoring/payment-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: production-payment-api-monitor
namespace: production
labels:
release: prometheus-stack # Mandatory label matching Prometheus Operator selector!
spec:
selector:
matchLabels:
app: payment-api # Target Kubernetes Service Label Selector!
endpoints:
- port: metrics # Target Service Port Name
interval: 15s # Execute HTTP GET scrape every 15 seconds
path: /metrics
scheme: http
scrapeTimeout: 10s
metricRelabelings:
# Protect against Cardinality Explosions by dropping dynamic user_id labels at scrape time!
- action: labeldrop
regex: user_id|client_ip|session_id|transaction_id
EOFStep 2: Creating a Declarative Grafana Dashboard JSON Schema with Template Variables
In this step, you will author a declarative Grafana dashboard JSON schema (dashboards/master-dashboard.json) that utilizes dynamic template variables ($cluster, $service) to scale across thousands of microservices and implements the RED Method (Rate, Errors, Duration) visual hierarchy.
cat << 'EOF' > dashboards/master-dashboard.json
{
"title": "Master RED Method Microservice Dashboard",
"timezone": "browser",
"refresh": "10s",
"schemaVersion": 38,
"templating": {
"list": [
{
"name": "cluster",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(up, cluster)",
"refresh": 1
},
{
"name": "service",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(up{cluster=\"$cluster\"}, service)",
"refresh": 1
}
]
},
"panels": [
{
"title": "Active Microservice Errors (Hero Row)",
"type": "stat",
"gridPos": { "h": 4, "w": 24, "x": 0, "y": 0 },
"targets": [
{
"datasource": "Prometheus",
"expr": "sum(rate(http_requests_total{cluster=\"$cluster\", service=\"$service\", status=~\"5.*\"}[5m]))"
}
]
},
{
"title": "P99 Request Duration (RED Method: Duration)",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 },
"targets": [
{
"datasource": "Prometheus",
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{cluster=\"$cluster\", service=\"$service\"}[5m])) by (le))",
"legendFormat": "{{service}} - P99 Latency"
}
]
}
]
}
EOFStep 3: Architecting an OpenTelemetry Collector Configuration with Tail-Based Sampling
In this step, you will author a declarative OpenTelemetry Collector configuration manifest (tracing/otel-collector-config.yaml) that defines OTLP receivers, protects active RAM using memory_limiter, and implements advanced tail_sampling to drop 95% of healthy traces while preserving 100% of error traces.
cat << 'EOF' > tracing/otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317 # Master OTLP gRPC ingestion port
http:
endpoint: 0.0.0.0:4318 # Master OTLP HTTP ingestion port
processors:
# Memory limiter processor protects the Collector against Out-Of-Memory (OOM) crashes!
memory_limiter:
check_interval: 1s
limit_percentage: 80
spike_limit_percentage: 20
# Advanced Tail-Based Sampling processor slashes storage billing while keeping 100% of errors!
tail_sampling:
decision_wait: 30s # Buffer trace spans in active RAM for 30 seconds to evaluate outcome
num_traces: 50000
expected_new_traces_per_sec: 2000
policies:
# Policy 1: Preserve 100% of trace spans containing errors!
- name: preserve-errors
type: status_code
status_code: {status_codes: [ERROR]}
# Policy 2: Preserve 100% of trace spans where latency exceeds 2 seconds (2000ms)!
- name: preserve-latency-bottlenecks
type: latency
latency: {threshold_ms: 2000}
# Policy 3: Probabilistically sample exactly 5% of perfectly healthy, fast traces!
- name: sample-healthy-traces
type: probabilistic
probabilistic: {sampling_percentage: 5.0}
# Batch processor batches telemetry packets before exporting to optimize network throughput
batch:
send_batch_size: 8192
timeout: 5s
exporters:
# Multi-Destination OTLP Routing: Zero Vendor Lock-In!
otlp/tempo:
endpoint: http://tempo-prod.monitoring.svc.cluster.local:4317
tls: {insecure: true}
otlp/prometheus:
endpoint: http://prometheus-k8s.monitoring.svc.cluster.local:9090/api/v1/otlp
tls: {insecure: true}
otlp/loki:
endpoint: http://loki-prod.monitoring.svc.cluster.local:3100/otlp
tls: {insecure: true}
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [otlp/tempo]
EOFStep 4: Configuring AlertManager Grouping, Inhibition Rules, and Routing Trees
In this step, you will author a declarative AlertManager configuration manifest (alerting/alertmanager.yml) that collapses duplicate alert storms into clean groups (group_by), implements inhibition rules (inhibit_rules) to suppress pod warnings during physical node crashes, and routes critical alerts to PagerDuty.
cat << 'EOF' > alerting/alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/T000/B000/XXXXXX'
# Define the master routing tree! Matches alert labels to specific engineering teams!
route:
group_by: ['cluster', 'alertname'] # Collapse related alerts into a single clean group!
group_wait: 30s # Pause for 30s to allow related alerts to arrive before firing the pager
group_interval: 5m # Wait 5m before sending an updated notification for the same group
repeat_interval: 12h # Wait 12h before repeating a notification for an unresolved alert
receiver: 'default-cluster-admins' # Default fallback receiver
routes:
# Route 1: High-priority critical alerts for Payment microservice -> PagerDuty!
- matchers:
- service = "payment-api"
- severity = "critical"
receiver: 'payment-team-pagerduty'
continue: false # Stop evaluating routing tree; match is absolute!
# Route 2: Low-priority warning alerts -> Peaceful Slack message (No PagerDuty!)
- matchers:
- severity = "warning"
receiver: 'peaceful-slack-warnings'
# Define advanced inhibition rules! Suppress downstream warnings during fatal outages!
inhibit_rules:
# Suppress InstanceDown pod alerts if the underlying physical NodeDown alert is actively firing!
- source_matchers:
- alertname = "NodeDown"
- severity = "critical"
target_matchers:
- alertname = "InstanceDown"
equal: ['node', 'cluster'] # Labels that MUST match between source and target alerts!
# Define secure external notification receivers!
receivers:
- name: 'default-cluster-admins'
slack_configs:
- channel: '#k8s-admins'
send_resolved: true
- name: 'payment-team-pagerduty'
pagerduty_configs:
- service_key: 'pd-live-key-payment-team-12345'
send_resolved: true
description: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }} in {{ .GroupLabels.cluster }}'
- name: 'peaceful-slack-warnings'
slack_configs:
- channel: '#k8s-warnings'
send_resolved: false
EOFVerification
To verify that your enterprise observability, monitoring, tracing, and alerting lab was completed successfully, execute the following verification commands to inspect your manifest contents and verify cardinality defense, dynamic PromQL variables, Tail-Based Sampling, and AlertManager inhibition rules.
# 1. Verify cardinality defense in the ServiceMonitor manifest
cat monitoring/payment-servicemonitor.yaml | grep -E "action.*labeldrop"
# 2. Verify dynamic PromQL variables in the Grafana dashboard schema
cat dashboards/master-dashboard.json | grep -E "service=\\\"\\\$service\\\""
# 3. Verify Tail-Based Sampling processor in the OpenTelemetry Collector configuration
cat tracing/otel-collector-config.yaml | grep -E "tail_sampling:"
# 4. Verify AlertManager grouping mechanics in the alerting configuration
cat alerting/alertmanager.yml | grep -E "group_by:.*cluster.*alertname"
# 5. Verify AlertManager inhibition rules in the alerting configuration
cat alerting/alertmanager.yml | grep -E "inhibit_rules:"Expected Output:
- action: labeldrop
"expr": "sum(rate(http_requests_total{cluster=\"$cluster\", service=\"$service\", status=~\"5.*\"}[5m]))"
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{cluster=\"$cluster\", service=\"$service\"}[5m])) by (le))",
tail_sampling:
group_by: ['cluster', 'alertname'] # Collapse related alerts into a single clean group!
inhibit_rules:Troubleshooting
-
Symptom:
cat monitoring/payment-servicemonitor.yaml | grep -E "action.*labeldrop"returns no output.- Cause: You completely omitted the
metricRelabelings:block in your ServiceMonitor YAML file. - Solution: Add
metricRelabelings: - action: labeldropto your manifest to ensure Prometheus scrubs dynamic labels during scraping.
- Cause: You completely omitted the
-
Symptom:
cat tracing/otel-collector-config.yaml | grep -E "tail_sampling:"returns no output.- Cause: You authored a basic OTel Collector configuration without the Tail-Based Sampling processor block.
- Solution: Ensure your
processors:block containstail_sampling: ...and yourservice.pipelines.tracesarray liststail_sampling.
-
Symptom:
cat alerting/alertmanager.yml | grep -E "group_by:.*cluster"returns no output.- Cause: You omitted the
group_by:configuration under the masterroute:block. - Solution: Add
group_by: ['cluster', 'alertname']to your AlertManager manifest to enable alert deduplication.
- Cause: You omitted the
Cleanup
Safely remove the enterprise observability lab directory and temporary manifest files from your terminal environment.
# Safely remove the enterprise observability lab directory
rm -rf ~/enterprise-observability-lab
# Verify the directory was removed successfully
ls -la ~/enterprise-observability-lab 2>/dev/null || echo "Cleanup complete. Directory successfully removed."