SRE Cheat Sheet
Core Metrics
- SLI (Service Level Indicator): What you measure (e.g., latency).
- SLO (Service Level Objective): Your target (e.g., 99.9% < 200ms).
- SLA (Service Level Agreement): The business contract with consequences.
- Error Budget:
100% - SLO(The allowance for unreliability).
Incident Command Roles
- Incident Commander: Leads the response, does not perform hands-on fixes.
- Ops / SME: Investigates and mitigates the issue.
- Communications: Handles external and internal stakeholder updates.
Automation & Toil
- Toil: Manual, repetitive work scaling linearly.
- Goal: Cap toil at 50% of SRE time. Use automation to eliminate it.