Production Incidents Explained: What Actually Happens When Things Break

Production incidents are one of the most misunderstood parts of cloud engineering. They sound catastrophic from the outside, but they follow a pattern. Understanding that pattern — before you are in the middle of one — makes you more effective and less anxious when they happen.

What counts as a production incident

An incident is any unplanned disruption that affects users or internal services. The severity varies enormously:

  • P1 (Critical): The product is down for all or most users. Revenue is directly impacted. Every minute counts.
  • P2 (High): A significant feature is broken or performance is severely degraded. Most users are affected.
  • P3 (Medium): A subset of users is affected, or a non-critical feature is broken. The team investigates during business hours.
  • P4 (Low): Minor issues, cosmetic problems, or single-user edge cases. Fixed as normal tickets.

Different companies use different names (Sev 1, Sev 2, or just Priority levels), but the structure is similar. Most cloud engineers deal with P3s regularly and P2s occasionally. P1s are rare but intense.

How incidents start

Incidents begin in one of three ways:

  1. Monitoring alert fires. A threshold was crossed — error rate above 5%, latency p99 above 2 seconds, a health check failing — and PagerDuty (or OpsGenie, or whatever tool the team uses) calls or pages the on-call engineer.
  2. A user or customer reports it. Support gets a ticket, someone posts in Slack, or a customer emails sales. Often you find out about problems this way before monitoring catches them, which tells you something about monitoring coverage.
  3. Someone notices it. An engineer sees something wrong in a dashboard during normal work, or a deployment goes out and they watch the error rate climb.

The first is the ideal case. The second means your alerting has a gap. The third is how most incidents were caught before monitoring cultures matured.

A realistic incident timeline

This is a composite of what a P2 incident looks like at a company with a reasonable incident process. The product is a SaaS platform, and checkout is failing for a portion of users.

T+0 — Alert fires

PagerDuty pages the on-call engineer at 2:17 PM on a Tuesday. The alert: “checkout-service error rate 18% (threshold: 5%).” The engineer acknowledges the alert, which stops the escalation timer.

T+3 minutes — Initial triage

The engineer opens the monitoring dashboard. Error rate started climbing about six minutes ago. They check the deployment history — a release went out nine minutes ago. Almost certainly related. They post in the incident Slack channel: “Investigating elevated errors in checkout-service. Possible connection to the 2:08 PM deploy.”

T+8 minutes — Investigation

They pull the logs from the checkout service in Cloud Logging. The errors are all the same: “connection refused” to an internal payment validation service. They check the payment validation service — it is running, health checks are green. They check network policy — no recent changes. They look at the application code change in the deploy. The new release changed the internal hostname for the payment service from payment-validation to payment-svc — a rename that was not coordinated with the checkout service team.

T+15 minutes — Mitigation decision

Two options: roll back the checkout service deploy, or roll forward with a hotfix. Rolling back is faster and lower risk. The engineer posts the plan in the incident channel, gets a quick thumb-up from the team lead who has joined the channel, and initiates the rollback.

T+22 minutes — Resolution

Rollback complete. Error rate drops to zero within ninety seconds of the rollback completing. They post the all-clear in the incident channel and update the status page. Total user impact: 22 minutes.

T+24 hours — Postmortem

The team writes a postmortem. More on that below.

Tools used during incidents

During an incident you will be moving quickly between:

  • Monitoring dashboards (Grafana, Datadog, Cloud Monitoring) — to understand what is broken and when it started
  • Log search (Cloud Logging, CloudWatch Logs Insights, Kibana) — to find the specific errors that tell you why
  • Deployment history (GitHub, GitLab, ArgoCD) — to correlate the start of the problem with a code or config change
  • Communication (Slack, Teams) — to keep the team informed without people interrupting each other
  • Status pages (Statuspage.io or similar) — to update users if the impact is customer-facing
  • Runbooks — pre-written procedures for common incident types. If the team has a good runbook for this type of failure, use it. If not, write one after the incident.

The blameless postmortem

A postmortem (sometimes called an incident review or a PIR — Post-Incident Review) is a written analysis of what happened, why, and what the team will do to prevent it from happening again.

“Blameless” means the document does not name individuals as the cause of the incident. People make decisions based on the information they had at the time. The goal is to fix the system — the process, the tooling, the communication — not to assign fault to a person. A good incident culture makes it safe to be honest about mistakes, which is the only way to learn from them.

A typical postmortem includes:

  • Summary: What happened, how long it lasted, what the impact was
  • Timeline: A chronological log of events, detections, and actions
  • Root cause: The underlying reason the incident happened — not just “the deploy failed” but why the failure was possible
  • Contributing factors: Things that made the incident worse or longer — missing alerts, unclear runbooks, slow communication
  • Action items: Specific, owned, time-bound tasks to address the root cause and contributing factors. “Improve monitoring” is not an action item. “Add alert for payment-svc connection errors with a 1% threshold by 2026-04-01, owned by Alex” is.

Writing postmortems well is a skill. Junior engineers often produce lists of facts. Senior engineers produce analysis that connects causes to effects and identifies systemic improvements rather than tactical patches.

Incidents as a junior engineer

If you are new to cloud engineering, your first on-call week can feel high-stakes. A few things to know:

  • You will not be alone. Most teams have escalation paths — if you are stuck, you page the next person in the chain.
  • The goal during an incident is to restore service, not to understand every detail of why it happened. That comes after in the postmortem.
  • Rolling back is almost always safer than pushing a rushed fix forward. When in doubt, roll back first.
  • Communicate frequently in the incident channel. “Still investigating, update in 10 minutes” is better than silence.
  • Every incident you handle teaches you something about the system. After your first year of on-call, your mental model of how things fail will be dramatically better than before.

To understand what on-call rotation looks and feels like over time, the on-call life page covers the structure and emotional reality of rotation in more detail.