High Availability Design: How Cloud Engineers Build Reliable Systems
High availability does not mean your system never fails. It means when individual components fail — and they will — the overall service keeps running. Understanding how to design for this is a fundamental cloud engineering skill, and it shapes almost every infrastructure decision you make.
What availability numbers actually mean
Availability is usually expressed as a percentage of time the service is accessible. It sounds abstract until you translate it to downtime:
| Availability | Annual downtime | Monthly downtime |
|---|---|---|
| 99% | 87.6 hours | 7.3 hours |
| 99.9% (“three nines”) | 8.7 hours | 43.8 minutes |
| 99.95% | 4.4 hours | 21.9 minutes |
| 99.99% (“four nines”) | 52.6 minutes | 4.4 minutes |
| 99.999% (“five nines”) | 5.3 minutes | 26.3 seconds |
Most business-facing services need to be in the 99.9–99.99% range. Achieving better than 99.99% requires significant architectural investment and is usually reserved for payment systems, healthcare platforms, and critical infrastructure.
The most important thing about these numbers: you cannot achieve them with a single instance of anything. A single VM, a single database, a single availability zone — any single point of failure caps your achievable availability at whatever that component’s availability is. Building for high availability means systematically eliminating single points of failure.
Availability zones and regions
Cloud providers divide their infrastructure into regions (geographically separate data centre clusters) and availability zones (physically separate data centres within a region, with independent power, cooling, and networking).
Availability zones within a region are close enough to have low latency between them (typically under 2ms), but physically separate enough that a flood, fire, or power outage affecting one zone does not affect the others. This makes multi-AZ deployment the standard approach for HA within a region.
The practical implication: any resource you want to be highly available should be deployed in at least two availability zones. This includes:
- Application servers — run instances in multiple AZs behind a load balancer
- Databases — use multi-AZ deployments with automatic failover
- Caches (Redis, Memcached) — use replicas in separate AZs
- Load balancers — managed load balancers are automatically multi-AZ; make sure you have healthy targets in each zone
- NAT Gateways — create one per AZ rather than routing all traffic through a single NAT Gateway
Health checks and automatic failover
Health checks are how the infrastructure knows whether a component is working. Without health checks, a load balancer will continue sending traffic to an instance that is running but returning errors. With health checks, it detects the problem and routes around it automatically.
Application-layer health checks
A good health check endpoint does more than just confirm the server is running. It confirms the application is working — connected to the database, connected to its dependencies, able to process requests. A simple /health endpoint that returns 200 when healthy and 503 when not is the minimum. A detailed health check might return the status of each dependency individually.
# Example health check endpoint (Flask)
@app.route('/health')
def health():
checks = {
'database': check_db_connection(),
'cache': check_cache_connection(),
}
all_healthy = all(checks.values())
status_code = 200 if all_healthy else 503
return jsonify({'status': 'ok' if all_healthy else 'degraded', 'checks': checks}), status_codeLoad balancer health check configuration
Configure load balancers to perform health checks frequently enough to detect failures quickly, but not so frequently that healthy instances are under constant load from health check traffic. Common settings: check every 10–30 seconds, require 2–3 consecutive failures before marking unhealthy, require 2–3 consecutive successes before marking healthy again.
The “unhealthy threshold” is particularly important. A threshold of 2 with a 10-second interval means a failed instance is removed from rotation within 20 seconds of the failure. A threshold of 10 means it takes over 100 seconds — during which users are hitting a broken instance.
Load balancer routing patterns
Load balancers distribute traffic across healthy instances. The routing algorithm you choose affects how evenly load is distributed.
- Round-robin — requests go to each instance in turn. Simple and works well when requests are similar in cost
- Least connections — sends each new request to the instance with the fewest active connections. Better for long-lived connections or variable request duration
- IP hash / sticky sessions — routes the same client consistently to the same instance. Useful for session-based applications, but reduces the benefit of horizontal scaling and creates hotspots
- Weighted round-robin — sends proportionally more traffic to certain instances. Useful for canary deployments where you want a small percentage of traffic going to the new version
For most cloud-native, stateless applications: round-robin or least connections is appropriate. If your application relies on sticky sessions, that is a warning sign that it needs to be redesigned for stateless operation before you can scale horizontally.
Active-active vs active-passive
These terms describe how redundant systems handle load in normal operation.
Active-active
Both (or all) instances actively handle traffic at the same time. If one fails, the others absorb its share of traffic. This is the most efficient model — all your capacity is in use, there is no idle standby, and failover is instantaneous because the other instances are already running.
Example: two application servers behind a load balancer, each handling 50% of traffic. If one fails, the load balancer routes all traffic to the remaining server. Recovery is immediate and automatic.
Active-passive
One instance handles all traffic (the active), while the other waits on standby (the passive). If the active fails, the passive is promoted. This wastes the capacity of the standby instance, and failover takes some time (promoting the passive involves a switchover process that typically takes seconds to minutes).
Active-passive is common for databases where write consistency is critical and running two simultaneous writers would create conflicts. The primary handles all writes; the replica is ready to be promoted if the primary fails.
Database high availability
Databases are the hardest part of high availability design because they are stateful — data written to the primary needs to reach the replica before a failover is safe.
Read replicas
A read replica is a continuously updated copy of the primary database. It can serve read queries (SELECT statements), reducing load on the primary and providing a target for failover. Read replicas have a small lag behind the primary — replication is near-real-time but not instantaneous. For most read operations this is acceptable. For reads that must reflect the very latest write, you still need to query the primary.
Automatic failover
Most managed database services (AWS RDS Multi-AZ, GCP Cloud SQL HA, Azure SQL Database) handle failover automatically. When the primary becomes unhealthy, the managed service promotes the standby and updates the connection endpoint. Application connection strings that point to the managed endpoint (rather than a specific instance IP) will automatically connect to the new primary.
Typical managed failover time: 30–60 seconds. Your application should handle this gracefully — retry logic on database connections is essential for HA applications.
Connection string resilience
Use the database endpoint URL provided by the managed service (which automatically routes to the current primary), not the IP address of a specific instance. Add connection retry logic with exponential backoff. Set appropriate connection pool settings to handle the brief unavailability during a failover window.
Dependency failure chains
A system can be highly available in isolation but fail in practice because it depends on something that is not. If your application requires a connection to a third-party API on every request, and that API goes down, your application is effectively down too — regardless of how many application instances you are running.
Design for dependency failures:
- Circuit breakers — after N consecutive failures to a dependency, stop trying and return an error immediately (rather than timing out slowly on every request). This prevents a slow dependency from making your service slow too
- Graceful degradation — decide in advance what your service should do when a dependency is unavailable. Can you serve a cached response? Show reduced functionality? Return a meaningful error rather than a generic 500?
- Timeouts on all outbound calls — every request to a database, API, or service should have an explicit timeout. Without timeouts, a slow dependency can exhaust your connection pool or thread pool, bringing your service down
- Bulkheads — isolate connections to different dependencies so that a slow downstream service only consumes its own connection pool, not the shared pool for everything
A practical HA checklist
Before declaring a service production-ready, work through this checklist:
- Are application instances deployed across at least two availability zones?
- Is there a load balancer distributing traffic across those instances?
- Are health checks configured with appropriate thresholds and intervals?
- Does the health check endpoint test application-layer health (database connectivity etc.), not just “is the server running”?
- Is the database using a managed multi-AZ or HA configuration?
- Does the application retry database connections with backoff?
- Are NAT Gateways deployed per-AZ (not shared across all AZs)?
- Do all outbound calls to dependencies have explicit timeouts?
- Is there a defined behaviour for when each dependency is unavailable?
- Has failover been tested — manually or automatically — in a staging environment?
That last point is the one most teams skip. An untested failover is a failover you should assume will not work when you need it.
Summary
- High availability means the service survives component failures — it requires eliminating single points of failure, not just adding redundancy as an afterthought
- Multi-AZ deployment is the standard approach: run instances across at least two availability zones behind a load balancer
- Health checks must test actual application health, not just server responsiveness — misconfigured health checks route traffic to broken instances
- Active-active deployments use all capacity and fail over instantly; active-passive wastes standby capacity but is appropriate for databases
- All outbound calls need timeouts and retry logic — slow dependencies without timeouts can cascade into full service outages