SRE Roadmap: How to Become a Site Reliability Engineer

Site reliability engineering is one of the more senior-skewed roles in cloud infrastructure — most people arrive at SRE after several years in software engineering, DevOps, or cloud operations. This roadmap covers realistic entry paths, the skills that matter, and the reliability engineering mindset that defines the role.

What SRE actually is

Site reliability engineering was created at Google in the early 2000s. The core idea, described in Google’s SRE book (freely available online), is to apply software engineering principles to operations problems. Instead of managing infrastructure manually, SREs write software to automate it. Instead of reacting to outages, they design systems to be inherently more reliable.

The Google SRE book defines the role precisely: SREs should spend no more than 50% of their time on operational work (on-call, incident response, manual tasks). The rest goes to engineering work that reduces that operational burden. When the 50% threshold is crossed, the SRE team escalates back to the development team. This is a hard constraint, not an aspiration.

In practice, most organisations that have “SRE” roles are not running the pure Google model. The title spans a range from “DevOps engineer with a reliability focus” to “software engineer who works on production systems.” What is consistent across most SRE roles: a stronger emphasis on software engineering than pure operations, and a specific focus on reliability over feature delivery.

DevOps or SRE: a decision guide based on your background

This is the question people ask most when they discover both paths. The right answer depends on where you are starting from.

Work through these questions honestly:

What is your stronger background — development or operations?

If you lean toward development: SRE is the more natural path. SRE roles at most organisations require genuine software engineering ability — writing production-quality Go, Python, or Java code, not just Bash scripts and Terraform. Engineers with a software background who want to focus on reliability fit the SRE mold well.

If you lean toward operations: DevOps is typically the more accessible path. DevOps roles require coding competence but are more forgiving about depth. The toolchain focus (CI/CD, containers, IaC) plays to operations strengths while adding the software automation layer.

Do you want to work on reliability as a specialisation, or on developer productivity?

SREs are fundamentally reliability specialists. They think in uptime, latency, error rates, and failure modes. If the idea of designing systems that stay available under failure conditions is intellectually interesting to you, SRE is a good fit.

DevOps engineers tend to focus on shipping velocity: making it faster and safer to get code from developer laptop to production. If that workflow optimisation appeals more, the DevOps engineer roadmap is the better path.

How much production experience do you have?

SRE roles at most organisations expect 3–5 years of production experience before considering candidates. You need to have seen real production systems fail, participated in incident response, and developed intuition about how distributed systems behave under load. This cannot be faked or shortcut.

If you are early in your career, aim for cloud engineering or software engineering first, then move toward SRE once you have genuine production depth.

The reliability engineering mental model

SRE has a specific vocabulary and framework for thinking about reliability. Understanding this is essential — it is not just terminology, it is a genuinely useful way of reasoning about systems.

Service Level Indicators (SLIs)

An SLI is a measurable property of a service that reflects how well it is working from a user’s perspective. Good SLIs are quantifiable and user-facing. Examples:

  • The proportion of HTTP requests that returned a successful response (not a 5xx error)
  • The proportion of requests that completed in under 500 milliseconds
  • The proportion of background jobs that completed successfully within their expected window

The key discipline is choosing SLIs that reflect what users actually care about, not what is easy to measure. CPU utilisation is not an SLI. Whether the user’s request succeeded is.

Service Level Objectives (SLOs)

An SLO is a target for an SLI. If your SLI is “proportion of successful requests,” your SLO might be “99.9% of requests succeed over a rolling 28-day window.” This is the reliability target the team commits to meeting.

SLOs should be set at the level of reliability that users actually need — not the maximum possible. A service that handles internal batch jobs probably does not need the same SLO as a payment processing API. Setting SLOs too high is a mistake: it creates unnecessary operational pressure and makes it harder to deploy new features.

Error budgets

An error budget is the amount of unreliability your SLO permits. If your SLO is 99.9% availability over 30 days, your error budget is 0.1% — which translates to roughly 43 minutes of allowed downtime per month.

The error budget is shared between SRE and development teams. When the budget is healthy, development teams can ship freely — the system has headroom. When the budget is nearly exhausted, new deployments are paused until it recovers. This aligns the incentives of reliability and development work without requiring constant negotiation.

Error budgets are the mechanism that makes SRE teams genuinely useful — they translate abstract reliability goals into concrete development constraints.

Blameless postmortems

When something fails, an SRE team runs a blameless postmortem. The goal is to understand what happened, why the system allowed it to happen, and what changes will prevent recurrence. The focus is always on systemic factors — process gaps, missing alerts, unclear runbooks, unexpected dependencies — rather than individual mistakes.

Writing a good postmortem is a skill. A useful postmortem describes the incident timeline clearly, identifies multiple contributing factors (not just “the engineer deployed bad code”), and produces action items that address root causes rather than symptoms.

Realistic entry paths into SRE

There is no single route into SRE, but most SREs arrive through one of these paths:

Path 1: Software engineer who moves toward reliability

Software engineers with 3–5 years of experience who have been on-call for production services, participated in incident response, and developed an interest in how systems fail are strong SRE candidates. The coding depth is already there; the reliability mindset develops through production exposure.

This is the path Google originally intended for SRE: hire software engineers and teach them reliability engineering, rather than hire operations engineers and teach them to code. In practice, this path requires deliberately seeking production exposure — joining teams with on-call responsibilities, volunteering for incident response, and writing the tooling that automates operational tasks.

Path 2: Cloud or DevOps engineer who develops software engineering depth

Cloud and DevOps engineers who invest seriously in software engineering — moving beyond scripting into writing production services, building internal tooling, and contributing to open source — can transition into SRE roles. The infrastructure knowledge is an asset; the gap is usually software engineering rigour.

This path takes longer than path 1, typically 5–7 years before reaching SRE-eligible experience levels, because it requires building two separate bodies of knowledge. But it is a genuine path, and engineers who follow it often bring valuable perspective that pure software engineers lack.

Path 3: Direct entry at smaller organisations

Some smaller companies hire more junior engineers into “SRE” roles that are closer to cloud or platform engineering in practice. These roles are a useful stepping stone if you approach them deliberately — focusing on reliability principles, SLI/SLO implementation, and building automation. They should not be mistaken for mature SRE practice, but they can build relevant experience.

Skills expected in SRE roles

Software engineering

This is the differentiator. SRE roles at most organisations — especially the better-paid ones at tech companies — expect genuine software engineering ability. You need to write readable, testable, production-quality code in at least one language. Python is the most common choice; Go is increasingly valued for performance-sensitive tooling; Java is expected in some enterprise environments.

“I can write scripts” is not enough. Aim to contribute meaningfully to a software project of real complexity — open source, a side project with real users, or internal tooling with significant scope.

Distributed systems understanding

SREs need to reason about how distributed systems fail. Concepts that matter: consensus and coordination, eventual consistency, CAP theorem trade-offs, thundering herd, cascading failures, circuit breakers, and graceful degradation. You do not need a computer science degree, but you do need to have read about and worked with these patterns.

Observability

Deep knowledge of metrics, logging, and tracing. Not just using dashboards — designing what to measure, writing PromQL or similar query languages, building alerting that is actually actionable (not just noisy), and using distributed tracing to debug multi-service requests.

Incident response

Real experience running or participating in production incidents. Knowing how to communicate during an incident (clear, concise status updates), how to triage effectively, and how to coordinate across teams. This is genuinely hard to learn without real production exposure.

Infrastructure and cloud platform knowledge

SREs need to understand the infrastructure their services run on. Kubernetes operations, cloud networking, load balancers, storage systems, and database reliability patterns. Weaker than a dedicated cloud engineer’s knowledge but not superficial — enough to diagnose infrastructure-level causes of reliability problems.

SRE career stages

Junior SRE (rare, 0–3 years)

Genuinely junior SRE roles are uncommon. When they exist, they typically require 1–3 years of software engineering or cloud experience as a prerequisite. Expectations: can write production code, understands containers and Kubernetes, is learning the reliability engineering framework, can participate in on-call with guidance.

SRE (3–6 years)

The core of the role. You own reliability for a set of services: define SLIs and SLOs, build and maintain monitoring, respond to incidents, run postmortems, and build automation that reduces operational burden. You participate in production on-call and are the team’s first responder for reliability issues.

Senior SRE (6–10 years)

Senior SREs own reliability strategy across a wider scope. They define how the team measures reliability, establish SLO frameworks for multiple services, build shared tooling that other SREs use, and drive reliability improvements that require cross-team coordination. They mentor junior SREs and run the team’s incident response improvement programme.

Staff SRE / SRE Manager

Staff SREs define the reliability engineering programme for an organisation — how SLOs are set, how the error budget policy works, what reliability standards new services must meet before launch. SRE managers focus on team-level concerns: hiring, growth, workload balance, and organisational representation.

SRE compensation

SRE roles command a premium over equivalent-experience cloud engineering or DevOps roles at most organisations, reflecting the software engineering depth required. At tech companies, SRE compensation often equals or exceeds software engineering compensation at the same level.

The SRE salary guide covers specific UK and US ranges by experience level, and how SRE pay compares to DevOps and cloud engineering at the same seniority.

On-call compensation is a meaningful component at many SRE employers. Some organisations pay explicit on-call premiums; others build it into the base salary. Understand the on-call expectations before accepting an SRE role — the on-call burden varies enormously between organisations and can significantly affect quality of life.