On-Call Life in Cloud Engineering: What the Rotation Actually Looks Like
On-call is part of most cloud engineering roles above entry level, and it is one of the most poorly explained aspects of the job. Some on-call rotations are manageable. Some are brutal. Knowing the difference before you accept an offer matters more than people realise.
What on-call actually means
Being on-call means you are the first person to be paged if a production alert fires during your rotation. You are expected to respond quickly — typically within 5 to 15 minutes — and to either resolve the issue or escalate it to someone who can.
The rotation defines who is on-call at any given time. A team of six engineers might rotate so each person is primary on-call for one week in every six. Some teams use a follow-the-sun model where the on-call shifts between geographic regions to avoid middle-of-the-night pages.
On-call does not mean you work extra hours during your week. It means you are reachable and must respond to pages. If it is a quiet week, you do your normal work. If it is a rough week, you spend extra hours on incidents.
Quiet weeks and rough weeks
The range is wide, and understanding it helps set expectations.
A quiet on-call week
Monday through Sunday, three alerts total. One was a false positive from a threshold set too low — you acknowledge it, confirm there is no real problem, and file a ticket to fix the alert. One was a staging environment issue with no user impact. One was a real but minor production issue that you resolved in twenty minutes at 7 PM on Wednesday.
Total extra time outside business hours: forty minutes. Total disruption to your personal life: minimal. Many on-call weeks look like this in well-maintained systems.
A rough on-call week
You are paged at 1 AM on Tuesday. A database failover did not complete cleanly, and the application cannot write. You spend ninety minutes recovering it, write a quick incident summary, and go back to sleep. You are paged again at 6 AM — a different issue in a different service. By Thursday you have handled five incidents. Friday morning you are tired and have not been able to focus properly on your regular work all week.
Rough weeks like this are abnormal in healthy systems, but they happen. If they happen consistently, that is a signal about the quality of the infrastructure, the alerting, or the team’s investment in reliability work.
The on-call maturity spectrum
The experience of being on-call varies enormously based on how mature the team’s approach to reliability is.
| Team maturity | What on-call looks like |
|---|---|
| Early / chaotic | Frequent pages, many false positives, no runbooks, escalation is unclear |
| Developing | Some runbooks exist, alerts are mostly meaningful, support is available |
| Mature | Alert volume is low, runbooks cover common cases, incidents are rare, postmortems happen |
| Advanced | SLO-based alerting, burn rates tracked, on-call rarely wakes you up |
The key question when evaluating a role is: which stage is this team at? A mature on-call setup at a large company can be less disruptive than a chaotic one at a startup, even if the startup has fewer services.
On-call compensation
Whether and how on-call is compensated varies by company and country. Common models:
- No extra compensation: On-call is considered part of the job and is factored into base salary. Common at larger tech companies with low page volumes.
- On-call allowance: A fixed payment per week on-call (sometimes £200–£500 per week in UK companies, though amounts vary widely).
- Time off in lieu: Particularly after rough on-call weeks with significant out-of-hours incidents, some teams offer compensatory time off.
- Incident pay: Payment per incident handled, or per hour worked outside business hours during incidents.
Before accepting any role with an on-call component, ask clearly:
- How frequent is the rotation? (1 in 4, 1 in 6, 1 in 8?)
- What was the average page volume last quarter?
- What is the escalation path if I cannot resolve something?
- How is on-call compensated?
- What happens when someone is on holiday during their on-call week?
A team that cannot answer these questions clearly — or that gives vague answers — is giving you information about how seriously they take on-call as a shared responsibility.
Your first on-call week
Most teams shadow before going solo. You are on-call alongside an experienced engineer for one or two rotations before you are primary. If your team does not offer this, ask for it — shadowing is a reasonable expectation.
When you are primary for the first time:
- Read the runbooks before your week starts, not during an incident
- Know the escalation path — who to call if you are stuck or the incident is above your capability
- Keep your phone charged and notifications on during the week
- When a page fires, acknowledge it promptly, then work methodically. Speed comes from clarity, not panic.
- Post updates in the incident channel even when you do not have answers yet — “still investigating, update in 10 minutes” is better than silence
The first few incidents feel high-stakes. After handling ten or twenty, the pattern becomes familiar enough that most pages feel more like puzzles than crises. Understanding how incidents unfold is covered in more detail in the production incidents explained page.
Making on-call sustainable
Alert fatigue is a real problem. When engineers are paged too frequently — especially for things they cannot fix or that are not real problems — it erodes trust in the system and leads to burnout. There are practical ways to address this:
- Reduce false positives: Review every alert that fires during a rotation. If it was not actionable, adjust the threshold or remove it. Noisy alerts are worse than no alerts for those cases.
- Write runbooks: Every repeated incident pattern should have a runbook. Time spent writing a runbook saves time and cognitive load during every future incident.
- Fix the root cause, not just the symptom: If the same alert fires every month, that is a project, not a pattern to accept.
- Protect post-incident recovery: A rough on-call week should be followed by lighter workload, not the same sprint commitments. Teams that do not acknowledge this create cumulative exhaustion.
The relationship between on-call and cloud engineer burnout is direct. Managing it well is a team responsibility, not an individual one.
Summary
- On-call means being reachable and responsive to production alerts during your rotation, not working extra hours by default
- The experience ranges from barely noticeable to genuinely disruptive depending on system maturity and team culture
- Compensation models vary widely — ask specific questions before accepting an offer
- Shadow first, know your escalation path, and prioritise clear communication over quick fixes
- Sustainable on-call requires actively reducing alert noise and fixing recurring root causes