Google Cloud Incident Response with Cloud Monitoring: Runbook, Alerts, SLO Burn Rate, and Postmortems

Incident response is the process of detecting, diagnosing, mitigating, and learning from failures in production. This page covers how to do that on Google Cloud using Cloud Monitoring, from the moment an alert fires to the blameless postmortem.

This guide is for engineers who want a concrete runbook, not just a concept overview. Whether you are setting up incident response for the first time or tightening an existing process, you will find specific steps, real tool names, and worked examples here.

Simple explanation

Incident response is what you do when something breaks in production and users are affected. It follows four phases:

  1. Detect: an alerting policy fires, Cloud Monitoring creates an incident, and the on-call engineer gets notified.
  2. Triage: the engineer quickly assesses severity. How many users are affected? Which service? Does it need immediate action or can it wait?
  3. Mitigate: restore service first. Roll back a deployment, reroute traffic, or scale up. Diagnose the root cause after the bleeding stops.
  4. Postmortem: once stable, document what happened, why, and what will be done to prevent a repeat.

Cloud Monitoring is the central tool for each phase. It creates the incident, hosts the dashboards, lets you query logs and traces, and tracks notification history all in one place.

Analogy

Think of incident response like emergency medicine. A paramedic does not diagnose the root cause before stabilizing the patient. First they assess severity (is the patient breathing?). Then they triage (what is most critical right now?). Then they treat (mitigate). Then they write up the case notes (postmortem). Incident response follows the same logic. Working through the phases in order prevents chaos and gets you to resolution faster.

Why this matters in production

A monitoring setup without a response process is just data collection. Structured incident response turns that data into action:

  • Faster detection. Alerts tied to SLO burn rates catch real user impact before customers report it.
  • Lower MTTR. Pre-written runbooks eliminate the “where do I look first?” delay when someone is paged at 2am.
  • Less panic. Knowing the process in advance means engineers respond, not react. Each phase has a clear objective and a known set of tools.
  • Better handoffs. Real-time notes in the incident record give incoming responders full context without a verbal briefing.
  • Fewer repeat failures. Postmortems with concrete action items reduce recurrence. Teams that skip them keep firefighting the same problems.

How incident response works in Google Cloud

Here is the end-to-end flow, from alert to postmortem:

  1. Alerting policy fires. A metric condition in an alerting policy is met for the configured duration, for example error rate above 5% for 2 consecutive minutes.
  2. Incident is created. Cloud Monitoring automatically creates an incident record under Alerting > Incidents. It captures the time the condition was first met, the metric value, and a link to the chart.
  3. Notifications go out. Configured notification channels receive the alert. PagerDuty for high-severity pages, Slack for team visibility, or both. The notification includes alert documentation if you have linked a runbook.
  4. Responder opens dashboards, logs, and traces. The on-call engineer opens the service dashboard, then checks Logs Explorer and Cloud Trace to understand scope and root cause.
  5. Severity is assessed. How many users are affected? Is it a full outage or degraded performance? Does the SLO burn rate indicate immediate action?
  6. Mitigation is applied. Rollback, traffic rerouting, configuration change, or scaling. Stabilize first, diagnose second.
  7. Recovery is confirmed. Watch error rate, latency, and success rate charts for at least 10 minutes after mitigation before standing down.
  8. Postmortem and action items. Document what happened, identify root cause, and write actionable follow-ups with owners and deadlines.

The four-phase incident runbook

Phase 1: Detect

Objective: confirm the alert is real and the incident record is open.

What to check first: open Alerting > Incidents in Cloud Monitoring. Confirm the incident is active. Note the exact time the condition was first met. This is your incident start time.

Tools: Cloud Monitoring Incidents view, notification channel (PagerDuty, Slack, email).

Output: incident start time is recorded. On-call engineer is confirmed as the active responder. If the alert is a false positive (condition already resolved), close the incident and add a note. Otherwise, move to triage.

Phase 2: Triage

Objective: assess severity in under 2 minutes. Do not diagnose root cause yet.

What to check first: open the service dashboard. Look at request rate, error rate, and latency for the last 30 minutes. Check for other open incidents on related services.

Four triage questions:

  • Complete outage or degraded performance?
  • All users or a subset (single region, single service, specific client)?
  • When did it start, and did it correlate with a deployment?
  • What is the SLO burn rate? Does this need immediate action or can it wait?

Tools: service dashboard, Metrics Explorer, alerting policy context.

Output: severity level (critical, high, medium) is assigned. Responders are scaled to match. Real-time notes are started in the incident record.

Tip

Answer the four triage questions before doing anything else. Triage is a time-boxed decision, not an investigation. Give yourself 2 minutes, make the severity call, then act.

Phase 3: Mitigate

Objective: restore service. Root cause analysis can wait.

What to check first: if the incident started after a deployment, roll back before investigating. Recovery time matters more than understanding why. Use rollbacks in Cloud Deploy for managed rollouts, or the Cloud Run console for direct revision traffic splits.

Tools used during mitigation:

  • Metrics: use dashboards to confirm error rate is dropping after each mitigation step.
  • Logs: use Logs Explorer to find specific error messages. Filter to the affected service, the incident time window, and severity>=ERROR. Use structured log fields for request IDs or user IDs to narrow scope.
  • Traces: use Cloud Trace to find which service or dependency in the call chain is slow or failing.
Warning

Do not start diagnosing root cause while users are actively impacted. Diagnosing before stabilizing extends the outage. Mitigate first, even if it means reverting work you are proud of. The investigation happens after the error rate drops.

Write one sentence in the incident notes field per action, in real time. Example: “14:32: rolled back api-service to revision 00041. Error rate dropped from 8% to 0.2%.” This is the raw material for the postmortem.

Output: service is restored. Metrics confirm recovery. Cause is understood well enough to prevent accidental re-introduction.

Phase 4: Postmortem

Objective: learn from the incident and reduce recurrence.

What to check first: after 10 minutes of stable metrics, confirm recovery is real. Then start the postmortem while the timeline is still fresh.

A postmortem should include:

  • Plain-language summary (what happened, not who is to blame)
  • Incident timeline from first detection to recovery
  • Root cause and contributing factors
  • Review of the alerting: did it fire at the right time? Were the dashboards useful?
  • Action items with assigned owners and deadlines, each closing a specific gap
Tip

A five-minute postmortem is better than no postmortem. You do not need a ten-page document for a minor incident. Write the timeline, one root cause sentence, and two action items. That alone creates a searchable record that prevents the same mistake six months from now.

Output: postmortem document is shared with the team. Action items are tracked to completion.

How to use Cloud Monitoring during an incident

Cloud Monitoring gives you multiple views into the same problem. Here is how to use each one:

Alerting / Incidents view

Go to Alerting > Incidents. This is your command center. It shows all open incidents, which policy triggered each one, the time of first condition, and the notification history. Use the notes field to document actions in real time.

Dashboards

Pre-built service dashboards are the fastest way to confirm scope. A good incident dashboard shows error rate, request volume, latency percentiles (p50/p95/p99), and system resource usage in a single view. If you do not have one, build it before the next incident.

Logs Explorer

Logs Explorer lets you filter by service, severity, and time range to find the specific errors behind the metric spike. Start with severity>=ERROR and the affected service name. Use structured log fields to drill into individual requests. If you have log-based metrics configured, they will also appear on your dashboards.

Cloud Trace

Cloud Trace shows which service in a distributed call chain is responsible for latency or errors. Filter to recent traces with high latency or error status. Look for the first span in the chain that shows elevated duration or a failure code.

Notification context and escalation

When an alert fires, the notification includes the alerting policy name, the metric value that triggered it, and (if configured) the documentation field content. This is where runbook links land. Every high-severity alert should have documentation configured so the on-call engineer has context the moment the page arrives.

Tip

Work through the tools in order: Incidents view first to confirm scope and timeline, then dashboard to see which metrics spiked, then logs to find the error messages, then traces to find where in the call chain the failure lives. Jumping straight to logs without checking the dashboard first often leads you to the wrong service.

Example incident walkthrough: Cloud Run latency spike

Here is a realistic scenario showing the full process in action.

What triggered the alert

At 14:28, an SLO burn rate alert fires for the payments-api Cloud Run service. Burn rate has crossed 14.4 over the last hour on a 30-day, 99.9% availability SLO. A PagerDuty page goes out. The notification includes a link to the payments-api dashboard and a runbook URL from the alert documentation field.

What the dashboard showed

The Cloud Run service dashboard shows request latency jumping from a p99 of 180ms to over 4,000ms starting at 14:24, four minutes before the alert fired. Error rate went from 0.1% to 6.3%. A deployment completed at 14:21.

What logs and traces revealed

In Logs Explorer, filtering to resource.type=“cloud_run_revision” severity>=ERROR shows a flood of connection pool exhausted errors starting at 14:22. Cloud Trace shows most slow requests are stalled in the database span, with connection wait times of 3,000 to 4,000ms. The new revision has a higher concurrency setting that is saturating the database connection pool.

Mitigation chosen

At 14:33, the on-call engineer uses Cloud Deploy to roll back to the previous revision. The Cloud Run console confirms traffic is fully shifted within 90 seconds. Notes added to the incident: “14:33: rollback to revision 00041 initiated. 14:35: traffic fully on 00041. Error rate dropping.”

Recovery confirmed

By 14:38, error rate is back below 0.1% and p99 latency is back to 180ms. The team watches the dashboard for 10 more minutes. At 14:48, the incident is closed. Total impact: roughly 22 minutes of degraded service.

Note

The 4-minute gap between the degradation starting (14:24) and the alert firing (14:28) is normal. Alerting policies require a condition to hold for a configured duration before creating an incident, which filters out brief transient spikes. If this latency in detection is too long for your SLO, reduce the alert duration or switch to a shorter lookback window.

Postmortem highlights

The postmortem identified three action items: (1) add a connection pool limit check to the deployment pipeline, (2) add a load test for the new concurrency setting in staging, (3) lower the alert duration threshold on this service from 5 minutes to 2 minutes so future regressions are caught sooner.

When to use this runbook

This four-phase runbook fits most common production incidents:

  • Latency spikes on a Cloud Run or GKE service
  • Error rate increases after a deployment
  • SLO burn rate alerts indicating real user impact
  • Regional degradation affecting a subset of users
  • Recurring noisy alerts that need better triage and clearer ownership
Warning

This runbook is not a substitute for a disaster recovery plan. Major regional failures, security breaches, and data loss events require separate, dedicated processes. For those scenarios, see Disaster Recovery Strategies. Using an incident runbook in place of a DR plan during a true disaster will slow you down.

Other situations where you need more than this page:

  • Security incidents: intrusions, data exfiltration, and compromised credentials require a security-specific response with different stakeholders and legal considerations.
  • Service-specific failures: incidents involving GKE node pool failures, database corruption, or network partitions need service-specific runbooks alongside this general framework.

Threshold alerts vs SLO burn-rate alerts

Both types appear in Cloud Monitoring. They answer different questions and should route differently.

Threshold alerts

A threshold alert fires when a metric crosses a fixed value, for example CPU usage above 80% for 5 minutes. They are simple to configure and useful for resource limits. The limitation is that a threshold alone does not tell you whether the error level is actually threatening your reliability target. A 2% error rate might be acceptable for one service and critical for another.

Analogy

A threshold alert is like a speedometer warning that fires when you hit 80 mph. It tells you something about your current state, but not how long you can sustain it. An SLO burn-rate alert is more like a fuel gauge. It tells you how fast you are burning through your budget and when you will run out if you keep going at this rate. That is the question that actually matters for reliability.

SLO burn-rate alerts

An SLO burn-rate alert fires when your error budget is being consumed too quickly relative to your SLO window. A burn rate of 1.0 means you are consuming budget at exactly the rate that exhausts it by the end of the window. A burn rate above 1.0 means you are consuming it faster than planned.

For a 30-day SLO window, here is what different burn rates mean:

  • Burn rate 14.4 over a 1-hour lookback: your error budget would be exhausted in roughly 50 hours at this rate. Page immediately.
  • Burn rate 6 over a 6-hour lookback: budget exhausted in roughly 5 days. Notify the team. Handle during business hours unless it worsens.
  • Burn rate 1 or below: consuming budget at or below the expected rate. Monitor, but no immediate action needed.
Note

Use longer lookback windows (1 hour, 6 hours) rather than very short ones (5 minutes) when alerting on burn rate. A 5-minute window can fire on a transient spike that never actually threatened the SLO. A 1-hour window means the condition has been sustained long enough to represent a real problem. See Creating Alerts for setup steps.

Routing recommendation

  • Fast burn (burn rate > 14.4, 1-hour window): page immediately via PagerDuty or equivalent.
  • Slow burn (burn rate > 6, 6-hour window): route to Slack or a ticket. Handle during business hours unless it worsens.
  • Threshold alerts: use for infrastructure limits (CPU, memory, disk). Route to Slack or a ticket unless the condition is business-critical.

Severity levels and notification routing

Define severity before incidents happen. When everyone uses the same definitions, triage is faster and escalations are clearer.

Severity levels

SeverityTypical impactExpected responseNotification channel
SEV-1 CriticalComplete outage or data loss. All users affected.Immediate page. Respond within 15 min. Escalate if unacknowledged.PagerDuty
SEV-2 HighMajor feature broken or significant degradation affecting many users.Page during hours. Page on-call at night if burn is fast. Respond within 30 min.PagerDuty + Slack
SEV-3 MediumPartial degradation on a non-critical path. Subset of users affected.Notify team. Handle same day.Slack #incidents
SEV-4 LowMinor issue or slow burn with plenty of error budget remaining.Create a ticket. Handle in the next sprint.Ticketing system
Tip

Keep your severity definitions short enough to fit on a sticky note. If engineers need to read a paragraph to decide between SEV-2 and SEV-3, the definitions are too complicated. The goal is a call made in 30 seconds under pressure.

Notification channels by severity

  • PagerDuty (or equivalent): SEV-1 and fast-burn SEV-2 only. Reserve for alerts that genuinely need a human awake at 3am.
  • Slack #incidents: all severity levels. Gives the whole team visibility. Use for lower-severity alerts so engineers can monitor without being paged.
  • Pub/Sub into a ticketing system: SEV-3 and SEV-4. Automatically create a ticket and assign to the service team for next-business-day handling.
Danger

Alert fatigue is a real incident risk. When every alert pages the same person at the same urgency, engineers start ignoring pages. Reserve paging for SEV-1 and fast-burn SEV-2 only. If you are unsure whether an alert warrants a page, it probably does not. Route it to Slack first and upgrade the severity if it worsens.

Runbooks linked from alerts

Every Cloud Monitoring alerting policy has a documentation field. Use it. When the alert fires, the documentation is included in the notification, so the on-call engineer has the runbook link before they even open a browser tab.

A good runbook entry for a single alert should include:

  • What this alert means: which service, what threshold, which SLO it protects, and why this threshold matters.
  • Dashboard link: direct URL to the service dashboard with relevant charts already visible.
  • Log query: a pre-written Logs Explorer query for the most likely error messages. Copy-paste ready.
  • Trace starting point: how to filter Cloud Trace for this service and failure mode.
  • Common causes: ranked by frequency. “Usually this is X; occasionally Y; rarely Z.”
  • First mitigation steps: ordered. “1. Check for a recent deployment. 2. Roll back if yes. 3. Check database connection count if no.”
  • Service owner and escalation path: who to contact if standard steps do not resolve it.

Three to five bullet points readable in 60 seconds under pressure are more useful than a ten-page document. Keep runbooks short and specific per alert.

Analogy

A runbook linked to an alert is like a fire escape plan posted next to the alarm. You do not want to be reading it for the first time while the building is on fire. The value is that it was written calmly, in advance, by someone who already knew all the exits. That is exactly what a well-written runbook does for an on-call engineer paged at 2am.

Common beginner mistakes

  1. No severity definition. When every alert is treated as equally urgent, engineers either ignore all of them or panic over minor ones. Define SEV-1 through SEV-4 before the first incident. Keep the definitions simple enough that anyone can apply them in 30 seconds.
  2. No clear owner. When an alert fires and no one is sure who responds, engineers either all jump in at once or all wait for someone else to act. Every alerting policy should have a named service owner and on-call rotation in the alert documentation field.
  3. Routing all alerts to the same channel. If every alert pages the same person at the same urgency, alert fatigue sets in fast. Segment by severity. Reserve paging for events that genuinely need immediate human response.
  4. Diagnosing before stabilizing. Root cause analysis during active impact extends the outage. Mitigate first: rollback, traffic reroute, failover. Diagnosis happens after the error rate drops.
  5. Not documenting actions in real time. It feels faster to just fix things. But after recovery, you have lost the timeline. Write one sentence in the incident notes field per action, with a timestamp. It saves hours in the postmortem.
  6. Closing the incident too early. An alert condition can temporarily stop triggering while the underlying problem continues. After mitigation, watch metrics for at least 10 minutes before closing. Recovery is confirmed by stable metrics, not a quiet alert.
  7. Skipping postmortems for small incidents. Small incidents often have systemic causes. A five-minute postmortem after every incident builds a pattern library that prevents future failures. Teams that skip them keep fighting the same fires.

Frequently asked questions

What is the difference between an alert and an incident in Google Cloud?

An alert is a notification that fires when a metric condition is met. An incident is a record created automatically in Cloud Monitoring when that condition persists for the configured duration. Alerts tell you something might be wrong. Incidents track the lifecycle of the response: when it started, what was affected, what actions were taken, and when recovery was confirmed.

How do I link a runbook to an alerting policy?

Every Cloud Monitoring alerting policy has a documentation field. Add a runbook URL or markdown instructions there. When the alert fires, the notification message includes this documentation, so the on-call engineer has the runbook link immediately without searching for it.

When should I use fast-burn vs slow-burn SLO alerts?

Fast-burn alerts (burn rate greater than 14.4 over the last hour) catch outages consuming your 30-day error budget in under 50 hours and should trigger an immediate page. Slow-burn alerts (burn rate greater than 6 over the last 6 hours) catch sustained degradation on track to exhaust the budget in under 5 days. Route these to Slack or a ticket rather than paging. Use longer lookback windows to reduce false positives from transient spikes.

How do I know an incident is really resolved?

An incident is resolved when the alerting condition is no longer true and has been stable for at least 10 minutes. Watch error rate, latency, and request success rate in Cloud Monitoring. If the condition auto-closes but metrics are still elevated, keep the incident open manually. Recovery is confirmed by the metrics, not by the alert going quiet.

What should a postmortem include?

A blameless postmortem should include: a plain-language summary of what happened, a timeline from first detection to full recovery, the root cause, a review of whether the alert fired at the right time, and specific action items with owners and deadlines. Each action item should address a concrete gap in alerting, deployment safety, runbook quality, or infrastructure. A postmortem without action items is just a report.

Last verified: 25 March 2026 Cloud services change frequently. Verify details against official documentation before making infrastructure decisions.