Documentation Skills for Cloud Engineers

Documentation is one of the most underrated engineering skills. It does not show up in certifications. It rarely gets mentioned in job descriptions. But teams that document well move faster, have fewer repeated incidents, onboard new engineers more smoothly, and have an institutional memory that survives people leaving. And the engineers who write well get noticed.

Types of documentation cloud engineers write

Cloud engineers produce several distinct types of documentation, each with a different audience and purpose. Knowing what you are writing and who it is for is the first step to writing it well.

Runbooks

Step-by-step procedures for handling specific operational situations — responding to an alert, performing a deployment, rotating credentials, scaling a service. Runbooks are written for an on-call engineer who may be unfamiliar with the specific system, working under pressure, possibly at an inconvenient time. They need to be clear, specific, and executable without deep prior knowledge.

Architecture decision records (ADRs)

Documents that capture a significant technical decision — what options were considered, what was chosen, and why. ADRs are written once (when the decision is made) and remain as a historical record. They answer the question future engineers always ask: “why does this work this way?”

Post-mortems

Structured accounts of incidents — what happened, what the impact was, what the root cause was, and what will change as a result. Post-mortems are both a communication tool (sharing what happened with the wider team) and a learning record.

Onboarding documentation

Guides for new engineers joining a team — how to set up their development environment, how to get access to systems, how the deployment process works, who to ask about what. Often neglected, always invaluable.

API and infrastructure documentation

Reference documentation for the infrastructure you build and maintain — how services connect, what the expected behaviour of an API is, what configuration options are available and what they do.

What makes a good runbook

The test for a good runbook: can a competent engineer who has never seen this system before follow it successfully at 2am, under pressure, without asking anyone any questions?

Good runbooks:

Are scoped to a specific situation — one runbook per alert type, per procedure, per failure mode. A 40-page general operations guide is not a runbook.
Start with context — what is this runbook for? What does the alert or situation mean? This helps the operator know they are in the right place before they start following steps.
Use numbered steps, not paragraphs — numbered steps can be followed sequentially and tracked (“I am on step 4”). Prose paragraphs cannot.
Include actual commands — not “check the database logs” but the specific command that retrieves them, for the specific tool in use.
Describe expected outcomes — after each significant step, state what success looks like. “The command should return ‘OK’. If it returns an error, proceed to step 8.”
Include escalation paths — if the runbook does not resolve the issue, who should be contacted? What is the escalation procedure?

A simple runbook template

# Alert: High Database CPU

## What this alert means
Database CPU has exceeded 80% for more than 5 minutes. This may indicate
a runaway query, missing index, or traffic spike.

## Initial assessment (5 minutes)
1. Check the RDS Performance Insights dashboard in the AWS console
   URL: https://console.aws.amazon.com/rds/...
   Expected: You should see a breakdown of top SQL queries by CPU contribution.

2. Identify the top consuming query.
   If it is a known periodic job: see the "Scheduled jobs" section below.
   If it is an unexpected query: proceed to step 3.

## If an unexpected query is consuming CPU
3. Get the query text from Performance Insights.
4. Check if a recent deployment could have introduced this query pattern.
   Run: `git log --oneline --since "2 hours ago"` in the app repository.
5. If a deployment correlates: notify the relevant team in #incidents and
   consider rolling back. Proceed to runbook: Emergency Rollback.

## Escalation
If CPU does not reduce within 20 minutes of your interventions, or if you
cannot identify the cause, page the senior DBA on-call via PagerDuty.

## Related runbooks
- Emergency Rollback: [link]
- Database Read Replica Failover: [link]

Architecture decision records (ADRs)

An architecture decision record is a short document that captures a significant technical decision. The goal is to preserve the reasoning — not just what was decided, but why, and what alternatives were rejected.

Without ADRs, institutional knowledge lives only in people’s heads. When those people leave, the knowledge leaves with them. The team is left with infrastructure they do not fully understand and cannot safely change because they do not know why it was built the way it was.

What an ADR covers

# ADR-0042: Use GCP Cloud SQL over self-managed PostgreSQL

Date: 2026-01-15
Status: Accepted
Deciders: Platform team

## Context
We need a relational database for the new billing service. We are evaluating
whether to use GCP Cloud SQL (managed) or run PostgreSQL on Compute Engine (self-managed).

## Decision
Use Cloud SQL (PostgreSQL).

## Rationale
- Managed backups and point-in-time recovery are handled by GCP, reducing operational overhead
- Automatic minor version patching reduces security maintenance burden
- High availability configuration is simpler to set up and test than self-managed replication
- The cost premium over self-managed (~15%) is justified by reduced on-call burden

## Alternatives considered
**Self-managed PostgreSQL on Compute Engine**: rejected because it requires
managing backups, replication, failover, and patching ourselves. Given current
team size, this is not a worthwhile trade-off for a billing system.

## Consequences
- We are tied to Cloud SQL's supported PostgreSQL versions and feature set
- Some advanced PostgreSQL extensions may not be available
- Egress costs apply for queries from outside the same region

ADRs do not need to be long. One to two pages covering the context, the decision, the reasoning, and what was rejected is sufficient. The most important section is the rationale — that is the information that depreciates most quickly in people’s memories.

Post-mortem writing

A good post-mortem does three things: it accurately describes what happened, it identifies the systemic causes (not individual blame), and it produces specific actions that will make recurrence less likely.

Post-mortems should be written within 24–72 hours of an incident, while memories are fresh and timelines can be reconstructed accurately.

A post-mortem structure that works

# Post-mortem: Checkout Service Outage — 2026-03-18

## Summary
A deployment of v2.4.1 introduced a database query that ran without an index,
causing database CPU saturation and a 35-minute partial outage affecting checkout.
Impact: approximately 2,400 failed transactions during the window.

## Timeline
- 14:10 — v2.4.1 deployed to production
- 14:15 — Database CPU alert fires (threshold: 80%)
- 14:18 — On-call acknowledges alert, begins investigation
- 14:28 — Root cause identified: missing index on orders.customer_id
- 14:35 — Decision made to roll back v2.4.1
- 14:45 — Rollback complete, CPU returns to normal, service restored

## Root cause
The v2.4.1 deployment added a new query that joins on orders.customer_id.
The index on this column was missing. Under production query volume, this
caused a full table scan on every checkout request, saturating database CPU.

## Contributing factors
- The query executed correctly in staging (lower data volume masked the missing index)
- No query performance testing exists as part of the deployment pipeline
- The database CPU alert threshold was 80%; the issue began affecting users at ~60%

## What went well
- Alert fired within 5 minutes of the issue beginning
- Root cause was identified within 10 minutes of investigation starting
- Rollback procedure worked correctly and was completed in under 10 minutes

## Action items
| Action | Owner | Due |
|--------|-------|-----|
| Add EXPLAIN ANALYZE check to CI pipeline for new queries | Backend team | 2026-03-25 |
| Lower database CPU alert threshold from 80% to 60% | Platform team | 2026-03-20 |
| Add index on orders.customer_id and re-attempt v2.4.1 deployment | Backend team | 2026-03-22 |

The action items table is the most important part of a post-mortem. Actions that are specific, assigned to named people, and have due dates get done. Actions that are vague, unassigned, and undated do not.

Why documentation quality affects your career

Documentation quality is visible to senior engineers in ways that junior engineers often do not anticipate. When a senior engineer reviews a pull request and sees a well-written description explaining what the change does, why it was made, what the risk is, and what was tested — they form a clear impression of that engineer’s quality. The same is true for tickets, post-mortems, and runbooks.

More concretely: when promotion decisions are made, senior engineers think about whether they would trust someone to work independently on important projects. An engineer who documents well is clearly thinking about the people who come after them — a mark of seniority. An engineer who never documents forces others to ask questions or reverse-engineer their work — a sign of immaturity.

The engineers who get picked for high-visibility projects are often not the most technically brilliant — they are the ones who are reliable, communicate clearly, and leave things better than they found them. Good documentation is evidence of all three.

Documentation anti-patterns to avoid

Documentation can fail in several ways. Knowing the failure modes helps you avoid them.

No audience in mind

Documentation written for nobody in particular usually serves nobody well. Before writing, ask: who will read this? A new engineer? An on-call responder? A product manager? A future version of yourself? Write for that person. The vocabulary, detail level, and format should match what they need.

Too long, too much context

A 30-page runbook is not more useful than a 2-page one — it is less useful, because the person who needs to follow it under pressure cannot find the relevant steps. Documentation should be as long as it needs to be and no longer. Edit ruthlessly.

Not updated as systems change

Outdated documentation is worse than no documentation in some ways — it misleads. A runbook that describes steps for a system that was decommissioned six months ago wastes time and creates false confidence. Make updating documentation part of the definition of done for infrastructure changes. If you change a system, update its runbook.

The knowledge dump

Documentation that represents a data transfer from the author’s head to the page without considering the reader. Often characterised by excessive detail about implementation, little context about purpose, and no clear structure. Before publishing documentation, read it as if you are encountering it for the first time.

Documentation as a substitute for conversation

Some decisions and designs need discussion before documentation. Writing a 10-page architecture document and sending it to stakeholders for “review” when the design has never been discussed is a documentation anti-pattern. Use documents to capture decisions after alignment, not to substitute for the conversation that should precede alignment.

Building good documentation habits

Good documentation is much easier to maintain if it is built as a habit rather than treated as a separate task.

Write the runbook as you do the thing for the first time — the first time you respond to an alert type or perform an operation, take notes. Convert those notes into a runbook immediately after. The context is fresh and the effort is minimal.
Update runbooks when you find discrepancies — if you follow a runbook step and the system behaves differently from what was described, update the runbook before moving on.
Write ADRs at decision time — the context and reasoning are clearest when the decision is being made. Writing an ADR six months later is harder and produces worse output.
Treat documentation as part of “done” — a ticket that delivers infrastructure without a runbook or architecture notes is not really done. Include documentation in your personal definition of done for every task.
Review documentation regularly — a quarterly pass through your team’s runbook library to remove outdated documents and update stale ones takes a few hours and maintains trust in the library.

Documentation Skills for Cloud Engineers

Types of documentation cloud engineers write

Runbooks

Architecture decision records (ADRs)

Post-mortems

Onboarding documentation

API and infrastructure documentation

What makes a good runbook

A simple runbook template

Architecture decision records (ADRs)

What an ADR covers

Post-mortem writing

A post-mortem structure that works

Why documentation quality affects your career

Documentation anti-patterns to avoid

No audience in mind

Too long, too much context

Not updated as systems change

The knowledge dump

Documentation as a substitute for conversation

Building good documentation habits

Summary

Related topics to read next