Disaster Recovery in the Cloud: RTO, RPO, and Real Implementation
Disaster recovery planning is one of those engineering responsibilities that gets postponed until something goes catastrophically wrong. The cloud makes many DR strategies dramatically cheaper and more achievable than they were on-premises — but only if someone actually implements them. This page explains what you need to know to do DR properly.
RTO and RPO explained in plain terms
Two numbers define every disaster recovery strategy. Everything else flows from these.
RTO — Recovery Time Objective
How long can the service be unavailable before the business is significantly harmed? This is the maximum acceptable downtime. An RTO of 4 hours means the service must be restored within 4 hours of the disaster occurring.
Example: A B2B SaaS product used by enterprise customers during business hours might have an RTO of 4 hours — losing 4 hours of availability is damaging but survivable. A payment processing service might have an RTO of 15 minutes — any longer is commercially unacceptable.
RPO — Recovery Point Objective
How much data loss is acceptable? This is the maximum acceptable age of the most recent backup you can recover from. An RPO of 24 hours means that if a disaster occurs, you might lose up to 24 hours of data.
Example: A content management system where articles are published occasionally might accept an RPO of 24 hours — losing a day of draft content is inconvenient but tolerable. A financial transactions database might have an RPO of 0 — zero data loss is acceptable, ever.
The relationship between RTO/RPO and cost
Lower RTO and lower RPO both cost more to achieve. Zero RPO requires synchronous replication to a second site — data is written to both locations before an operation is confirmed, which adds latency and complexity. A 1-hour RTO requires standby infrastructure that is ready to take over quickly. A 24-hour RTO might be achievable by restoring from a backup, which requires no standby infrastructure at all.
The business decides what RTO and RPO are acceptable. The engineers figure out how to achieve them at a reasonable cost. If the desired RTO/RPO is technically achievable but prohibitively expensive, that is a conversation worth having with the business before committing to the architecture.
DR strategies: four approaches with different trade-offs
There is no single right DR strategy. The four main approaches span a spectrum from low-cost/high-recovery-time to high-cost/low-recovery-time.
| Strategy | RTO | RPO | Cost | Complexity |
|---|---|---|---|---|
| Backup and restore | Hours | Hours | Low | Low |
| Pilot light | 30–60 min | Minutes | Medium-low | Medium |
| Warm standby | Minutes | Seconds–minutes | Medium-high | Medium-high |
| Multi-region active-active | Near-zero | Near-zero | High | High |
Backup and restore
The simplest DR strategy: take regular backups, store them durably (cross-region), and restore from backup in a disaster. The recovery process involves provisioning new infrastructure (which can be automated with Terraform), restoring data from backup, updating DNS or load balancer configs to point at the new environment, and verifying the restored service.
This is appropriate for non-critical services, internal tools, or systems where several hours of downtime is acceptable. The cost is minimal — you only pay for backup storage, not for standby infrastructure.
Pilot light
A minimal version of the core infrastructure runs continuously in the recovery region, but at a fraction of production capacity. Data is continuously replicated to this environment. In a disaster, you scale up the pilot light to full production size and redirect traffic.
The “pilot light” keeps the most critical components running so startup time is much shorter than starting from scratch. A typical pilot light might run the database replication endpoint and minimal compute, ready to scale up within 30–60 minutes.
Warm standby
A scaled-down but fully functional version of the production environment runs in the recovery region. Data is continuously replicated. In a disaster, you scale the standby up to production capacity (often a matter of minutes with auto-scaling) and redirect traffic.
The warm standby can handle some traffic in normal operation — useful for serving some users from a different region or for running DR tests against live traffic. The cost is higher than pilot light because you are running actual compute continuously.
Multi-region active-active
Full production capacity runs in multiple regions simultaneously, serving live traffic. There is no “recovery” — if a region fails, traffic is redistributed to the remaining regions automatically by the global load balancer.
This is the most resilient and most expensive approach. It requires careful design to handle data consistency across regions (a write in region A must eventually reach region B, and conflicts must be resolved). It is appropriate for high-revenue, globally-distributed services where downtime is genuinely not acceptable.
Backup vs disaster recovery — they are not the same thing
Backups protect against data loss. Disaster recovery protects against service unavailability. They overlap but are not identical.
A backup gives you a copy of your data at a point in time. Restoring from backup requires somewhere to restore it to — an environment, infrastructure, configuration. If you have excellent backups but no DR plan, you have data protection but not service recovery.
Conversely, DR infrastructure with no backups protects against infrastructure failures (region outage, Kubernetes cluster going down) but not against logical data corruption or accidental deletion. Both are needed.
Common backup practices for cloud environments:
- Automated database snapshots on a schedule (daily at minimum, hourly for critical databases)
- Cross-region backup replication — a backup in the same region as the original data is not protected against regional failures
- Retention policies — how long to keep backups. Longer retention enables recovery from mistakes discovered late
- Object versioning on S3/GCS for critical storage buckets — protects against accidental deletion or overwrite
- Infrastructure-as-code (Terraform) stored in version control — your infrastructure definition is a form of backup for the environment itself
Testing DR plans — why most teams don’t, and why they should
The uncomfortable reality: most teams have a DR plan that has never been tested. They know the plan exists. They believe it will work. They have never actually tried to fail over to the recovery region.
Untested DR plans fail in practice at an alarming rate. Reasons real DR tests fail:
- The backup restoration process takes 6 hours, not the 30 minutes estimated
- The Terraform that provisions the recovery environment was written for the production environment, not the recovery region, and fails with configuration errors
- The database backup was running but nobody noticed the backup process had been failing silently for three months
- DNS TTL is set to 3600 seconds, so redirecting traffic to the recovery environment takes an hour
- Application configuration references the production region endpoint explicitly and needs manual updates to work in the recovery region
These problems are easily fixed before a disaster. They are very hard to fix during one.
How to test DR
Start small and low-risk. Test the backup restoration process first — can you actually restore a database backup to a clean environment and have a working database? Do this in staging. Then test the full recovery process in a scheduled, low-stakes exercise. Announce to the team: “On Thursday at 2pm we are going to test the DR process. The staging environment will be unavailable.”
At least once a year, test the real DR process against production (during a low-traffic window, with full team awareness). The goal is not to find that everything works — it is to find the things that do not work before a real disaster does.
Cost vs availability trade-offs
DR decisions are ultimately cost decisions. A few examples to make the trade-offs concrete:
- Moving from backup-and-restore to warm standby for a critical database might increase infrastructure cost by $2,000/month and reduce RTO from 4 hours to 5 minutes. Is that worth it? For a revenue-generating service losing $50,000 per hour of downtime, the answer is yes.
- Cross-region database replication with synchronous writes adds 60–100ms of write latency (due to cross-region network distance). For most applications, this is acceptable. For high-throughput, latency-sensitive applications, it may not be.
- Running a warm standby in a second region doubles infrastructure cost. Using spot instances for the standby reduces that cost significantly at the risk of the standby being unavailable right when you need it — which defeats the purpose.
The right level of DR investment is proportional to the cost of the disruption it prevents. Spending $10,000/month to protect a service that generates $5,000/month in revenue is not a good trade. Spending $5,000/month to protect a service that generates $500,000/month probably is.
What to document in a DR plan
A DR plan is only useful if it can be executed by someone under pressure, possibly at 3am, possibly not by the person who wrote it.
A useful DR plan document includes:
- Service inventory — what services are covered, their RTO and RPO targets
- Recovery triggers — what conditions warrant declaring a disaster and initiating DR (not every outage needs full DR activation)
- Decision-making authority — who can declare a disaster and authorise the DR process
- Step-by-step recovery runbook — the specific commands and procedures to execute, in order. Assume the person following it is competent but has never done this before
- Dependencies and contacts — third-party services with their status page URLs, support contacts for critical vendors
- Test history — when the plan was last tested and what the result was
- Known gaps — what the plan does not cover, honestly documented
Junior vs senior ownership of DR
Junior engineers do not typically own DR strategy — that is a senior responsibility. But junior engineers absolutely need to understand the DR plan for systems they work on, be able to execute the runbook, and contribute to testing and documentation.
Building DR awareness early makes you a better engineer and a more valuable team member. If you are new to a team, ask: “What is the DR plan for the services I will be working on? When was it last tested?” If the answer is vague, offer to help document it. That contribution will be noticed and remembered.
As a senior engineer, you own the DR posture — setting the RTO/RPO targets, designing the recovery architecture, ensuring DR plans exist and are tested, and representing the reliability position to business stakeholders.
Summary
- RTO is the maximum acceptable recovery time; RPO is the maximum acceptable data loss — the business sets these, engineers achieve them
- Four DR strategies span a spectrum from low-cost backup-and-restore to high-cost multi-region active-active — choose based on your RTO/RPO requirements and budget
- Backups protect data; DR plans protect service availability — both are needed, and they are not interchangeable
- Most DR plans have never been tested, and untested DR plans routinely fail when needed — test regularly, starting with backup restoration
- A DR plan is only useful if it can be executed under pressure by someone who did not write it — write it accordingly