SRE Cheatsheet: Reliability Concepts, SLOs, and Incident Response
Site Reliability Engineering (SRE) is a discipline that applies software engineering practices to operations problems. This page is a quick reference covering the terminology, metrics, and frameworks you will use in an SRE role.
SRE Terms Glossary#
| Term | Definition |
|---|---|
| SLI (Service Level Indicator) | A specific metric that measures one aspect of service reliability (e.g., request success rate, latency) |
| SLO (Service Level Objective) | An internal target for an SLI (e.g., 99.9% of requests succeed over a 30-day window) |
| SLA (Service Level Agreement) | A contractual commitment to customers, usually less aggressive than the SLO, with financial consequences for breach |
| Error Budget | The allowable amount of unreliability derived from the SLO (e.g., 0.1% of requests may fail) |
| Toil | Manual, repetitive, automatable operational work that scales with traffic but produces no lasting value |
| Reliability | The probability that a system performs its intended function under specified conditions |
| Availability | The proportion of time a system is operational and accessible |
| MTTR | Mean Time To Recovery — average time to restore service after an incident |
| MTBF | Mean Time Between Failures — average time between incidents |
| RTO | Recovery Time Objective — the maximum acceptable downtime after a failure |
| RPO | Recovery Point Objective — the maximum acceptable data loss measured in time |
SLO vs SLA: The Key Distinction#
An SLO is an internal engineering target. It is tighter than what you promise customers so that you have a buffer before breaching contractual commitments. When you are close to consuming the full error budget, engineering teams focus on reliability rather than shipping new features.
An SLA is an external commitment — a legal or contractual agreement with customers. Breaching an SLA typically results in service credits or financial penalties. Because of this, SLAs are always set lower than SLOs.
Example:
- SLO: 99.95% availability over 30 days (internal target)
- SLA: 99.9% availability over 30 days (customer commitment)
This gives a buffer of 0.05 percentage points to absorb incidents before contractual obligations are at risk.
Availability Nines#
| Availability | Downtime per year | Downtime per month |
|---|---|---|
| 99% | ~87.6 hours | ~7.3 hours |
| 99.9% | ~8.7 hours | ~43.8 minutes |
| 99.95% | ~4.4 hours | ~21.9 minutes |
| 99.99% | ~52.6 minutes | ~4.4 minutes |
| 99.999% | ~5.3 minutes | ~26 seconds |
Moving from 99.9% to 99.99% is a 10x improvement in uptime. Each additional nine is progressively harder and more expensive to achieve.
Error Budgets#
How it is calculated — If your SLO is 99.9% availability over 30 days, you have 0.1% of requests that may fail. For a service handling 1 million requests per day, that is 1,000 failed requests per day, or ~30,000 per month.
Burning the budget — When incidents or bad deployments cause more failures than the SLO allows, you are “burning” the error budget. SRE teams track budget consumption in real time.
Why error budgets matter — The error budget creates a shared language between development and operations teams:
- If the budget is healthy (plenty remaining), development teams can move fast and deploy frequently.
- If the budget is nearly exhausted, the focus shifts to reliability work: fixing bugs, improving tests, reducing toil.
- This removes the adversarial dynamic where ops says “no deploys” and dev says “we need to ship.”
Error budget policies should be written down. A common policy: if the error budget is >50% consumed in the first half of the window, freeze feature releases and focus on reliability.
Key SRE Metrics#
| Metric | Why it matters |
|---|---|
| Request rate (RPS) | Baseline traffic; anomalies indicate problems or attacks |
| Error rate | Proportion of requests that return an error; directly feeds SLI calculations |
| Latency p50 | Median response time; how the typical user experiences the service |
| Latency p95 | 95th percentile; captures the slower tail of requests |
| Latency p99 | 99th percentile; the slowest 1% of requests; important for user experience |
| Saturation | How close a resource is to its capacity limit (CPU, memory, queue depth) |
| Availability | Percentage of time the service is reachable and returning valid responses |
Track p95 and p99 latency, not just averages. Averages hide tail latency, which is often what causes user complaints.
The Four Golden Signals#
Defined in the Google SRE book, these four metrics cover most service health scenarios:
Latency — The time it takes to serve a request. Track both success latency and error latency separately. Slow errors are different from fast errors.
Traffic — The demand on the system. Requests per second, queries per second, transactions per second — whichever unit is most meaningful for your service.
Errors — The rate of failed requests. Include both explicit failures (HTTP 5xx) and implicit failures (HTTP 200 with wrong content, requests that time out).
Saturation — How full the service is. A 90% saturated service is approaching its limit; small spikes will cause degradation. Also consider leading indicators like memory pressure and queue depth.
Incident Response Framework#
| Phase | Actions |
|---|---|
| Detection | Alert fires (or a user reports a problem); on-call engineer is paged |
| Triage | Confirm the incident is real; assess severity and user impact |
| Mitigation | Reduce or stop customer impact as fast as possible (rollback, failover, redirect traffic) |
| Resolution | Fix the underlying cause; restore full service; confirm metrics return to normal |
| Post-mortem | Document what happened, why, and what will be done to prevent recurrence |
Mitigation before root cause — Do not spend 30 minutes debugging while users are affected. Rollback first, investigate second.
Runbook Structure#
A runbook is a document that guides an on-call engineer through diagnosing and resolving a specific alert. A good runbook contains:
- Alert name and context — Which alert triggered this? What does it mean?
- Service overview — What does this service do? What depends on it?
- Diagnosis steps — What dashboards to check, what log queries to run, what to look for
- Mitigation steps — Concrete commands or actions to reduce impact
- Resolution steps — How to fully fix the problem
- Escalation path — Who to contact if the runbook does not resolve the issue
- Related links — Dashboard URLs, other runbooks, architecture diagrams
Post-Mortem Structure#
A post-mortem (also called an incident review) should be written within 24–48 hours of an incident being resolved.
| Section | Content |
|---|---|
| Timeline | Chronological log of events: when was the problem introduced, when detected, when mitigated, when resolved |
| Impact | Who was affected, how many users, financial or reputational impact, duration |
| Root cause | The underlying technical reason the incident occurred |
| Contributing factors | Other conditions that made the incident worse or harder to detect |
| Action items | Concrete tasks with owners and due dates to prevent recurrence |
Blameless culture — A blameless post-mortem focuses on systems and processes, not people. The assumption is that engineers acted in good faith with the information they had. The question is not “who broke it?” but “what allowed this to happen, and how do we make the system safer?”
Toil#
Toil is manual, repetitive, automatable work that scales with the size of the system. Examples: manually restarting a service when it crashes, manually provisioning user accounts, manually reviewing logs for known errors.
SRE teams aim to keep toil below 50% of their working time. The other 50% should be engineering work that reduces toil or improves reliability. Tracking toil is important because it surfaces automation opportunities and prevents the team from being consumed by operational busywork.
Common SRE Interview Questions#
| Question | Short answer |
|---|---|
| What is the difference between SLO and SLA? | SLO is an internal target; SLA is an external contractual commitment. SLOs are stricter than SLAs to provide a buffer. |
| What is an error budget? | The acceptable amount of unreliability defined by 100% minus the SLO. When it is exhausted, reliability work takes priority over new features. |
| What are the Four Golden Signals? | Latency, Traffic, Errors, Saturation. |
| What is toil and why does it matter? | Repetitive manual work that scales with traffic. SREs track and reduce it because it crowds out engineering work. |
| What is a blameless post-mortem? | A post-incident review focused on improving systems, not assigning blame to individuals. |
| What is the difference between RTO and RPO? | RTO is the maximum acceptable downtime; RPO is the maximum acceptable data loss. |
| What is p99 latency? | The latency at the 99th percentile — 99% of requests are faster than this value. |