SRE Cheatsheet: Reliability Concepts, SLOs, and Incident Response

Site Reliability Engineering (SRE) is a discipline that applies software engineering practices to operations problems. This page is a quick reference covering the terminology, metrics, and frameworks you will use in an SRE role.

SRE Terms Glossary#

Term	Definition
SLI (Service Level Indicator)	A specific metric that measures one aspect of service reliability (e.g., request success rate, latency)
SLO (Service Level Objective)	An internal target for an SLI (e.g., 99.9% of requests succeed over a 30-day window)
SLA (Service Level Agreement)	A contractual commitment to customers, usually less aggressive than the SLO, with financial consequences for breach
Error Budget	The allowable amount of unreliability derived from the SLO (e.g., 0.1% of requests may fail)
Toil	Manual, repetitive, automatable operational work that scales with traffic but produces no lasting value
Reliability	The probability that a system performs its intended function under specified conditions
Availability	The proportion of time a system is operational and accessible
MTTR	Mean Time To Recovery — average time to restore service after an incident
MTBF	Mean Time Between Failures — average time between incidents
RTO	Recovery Time Objective — the maximum acceptable downtime after a failure
RPO	Recovery Point Objective — the maximum acceptable data loss measured in time

SLO vs SLA: The Key Distinction#

An SLO is an internal engineering target. It is tighter than what you promise customers so that you have a buffer before breaching contractual commitments. When you are close to consuming the full error budget, engineering teams focus on reliability rather than shipping new features.

An SLA is an external commitment — a legal or contractual agreement with customers. Breaching an SLA typically results in service credits or financial penalties. Because of this, SLAs are always set lower than SLOs.

Example:

SLO: 99.95% availability over 30 days (internal target)
SLA: 99.9% availability over 30 days (customer commitment)

This gives a buffer of 0.05 percentage points to absorb incidents before contractual obligations are at risk.

Availability Nines#

Availability	Downtime per year	Downtime per month
99%	~87.6 hours	~7.3 hours
99.9%	~8.7 hours	~43.8 minutes
99.95%	~4.4 hours	~21.9 minutes
99.99%	~52.6 minutes	~4.4 minutes
99.999%	~5.3 minutes	~26 seconds

Moving from 99.9% to 99.99% is a 10x improvement in uptime. Each additional nine is progressively harder and more expensive to achieve.

Error Budgets#

How it is calculated — If your SLO is 99.9% availability over 30 days, you have 0.1% of requests that may fail. For a service handling 1 million requests per day, that is 1,000 failed requests per day, or ~30,000 per month.

Burning the budget — When incidents or bad deployments cause more failures than the SLO allows, you are “burning” the error budget. SRE teams track budget consumption in real time.

Why error budgets matter — The error budget creates a shared language between development and operations teams:

If the budget is healthy (plenty remaining), development teams can move fast and deploy frequently.
If the budget is nearly exhausted, the focus shifts to reliability work: fixing bugs, improving tests, reducing toil.
This removes the adversarial dynamic where ops says “no deploys” and dev says “we need to ship.”

Error budget policies should be written down. A common policy: if the error budget is >50% consumed in the first half of the window, freeze feature releases and focus on reliability.

Key SRE Metrics#

Metric	Why it matters
Request rate (RPS)	Baseline traffic; anomalies indicate problems or attacks
Error rate	Proportion of requests that return an error; directly feeds SLI calculations
Latency p50	Median response time; how the typical user experiences the service
Latency p95	95th percentile; captures the slower tail of requests
Latency p99	99th percentile; the slowest 1% of requests; important for user experience
Saturation	How close a resource is to its capacity limit (CPU, memory, queue depth)
Availability	Percentage of time the service is reachable and returning valid responses

Track p95 and p99 latency, not just averages. Averages hide tail latency, which is often what causes user complaints.

The Four Golden Signals#

Defined in the Google SRE book, these four metrics cover most service health scenarios:

Latency — The time it takes to serve a request. Track both success latency and error latency separately. Slow errors are different from fast errors.

Traffic — The demand on the system. Requests per second, queries per second, transactions per second — whichever unit is most meaningful for your service.

Errors — The rate of failed requests. Include both explicit failures (HTTP 5xx) and implicit failures (HTTP 200 with wrong content, requests that time out).

Saturation — How full the service is. A 90% saturated service is approaching its limit; small spikes will cause degradation. Also consider leading indicators like memory pressure and queue depth.

Incident Response Framework#

Phase	Actions
Detection	Alert fires (or a user reports a problem); on-call engineer is paged
Triage	Confirm the incident is real; assess severity and user impact
Mitigation	Reduce or stop customer impact as fast as possible (rollback, failover, redirect traffic)
Resolution	Fix the underlying cause; restore full service; confirm metrics return to normal
Post-mortem	Document what happened, why, and what will be done to prevent recurrence

Mitigation before root cause — Do not spend 30 minutes debugging while users are affected. Rollback first, investigate second.

Runbook Structure#

A runbook is a document that guides an on-call engineer through diagnosing and resolving a specific alert. A good runbook contains:

Alert name and context — Which alert triggered this? What does it mean?
Service overview — What does this service do? What depends on it?
Diagnosis steps — What dashboards to check, what log queries to run, what to look for
Mitigation steps — Concrete commands or actions to reduce impact
Resolution steps — How to fully fix the problem
Escalation path — Who to contact if the runbook does not resolve the issue
Related links — Dashboard URLs, other runbooks, architecture diagrams

Post-Mortem Structure#

A post-mortem (also called an incident review) should be written within 24–48 hours of an incident being resolved.

Section	Content
Timeline	Chronological log of events: when was the problem introduced, when detected, when mitigated, when resolved
Impact	Who was affected, how many users, financial or reputational impact, duration
Root cause	The underlying technical reason the incident occurred
Contributing factors	Other conditions that made the incident worse or harder to detect
Action items	Concrete tasks with owners and due dates to prevent recurrence

Blameless culture — A blameless post-mortem focuses on systems and processes, not people. The assumption is that engineers acted in good faith with the information they had. The question is not “who broke it?” but “what allowed this to happen, and how do we make the system safer?”

Toil#

Toil is manual, repetitive, automatable work that scales with the size of the system. Examples: manually restarting a service when it crashes, manually provisioning user accounts, manually reviewing logs for known errors.

SRE teams aim to keep toil below 50% of their working time. The other 50% should be engineering work that reduces toil or improves reliability. Tracking toil is important because it surfaces automation opportunities and prevents the team from being consumed by operational busywork.

Common SRE Interview Questions#

Question	Short answer
What is the difference between SLO and SLA?	SLO is an internal target; SLA is an external contractual commitment. SLOs are stricter than SLAs to provide a buffer.
What is an error budget?	The acceptable amount of unreliability defined by 100% minus the SLO. When it is exhausted, reliability work takes priority over new features.
What are the Four Golden Signals?	Latency, Traffic, Errors, Saturation.
What is toil and why does it matter?	Repetitive manual work that scales with traffic. SREs track and reduce it because it crowds out engineering work.
What is a blameless post-mortem?	A post-incident review focused on improving systems, not assigning blame to individuals.
What is the difference between RTO and RPO?	RTO is the maximum acceptable downtime; RPO is the maximum acceptable data loss.
What is p99 latency?	The latency at the 99th percentile — 99% of requests are faster than this value.