Scaling Systems in the Cloud: A Practical Engineering Guide

Scaling is one of the most frequently misunderstood concepts in cloud engineering. It sounds simple — add more servers when you need them — but the decisions around when to scale, how to scale, and what scaling actually costs are what separate systems that handle growth well from systems that collapse or bankrupt their operators.

Horizontal scaling vs vertical scaling

These are the two fundamental approaches to adding capacity, and they have very different trade-offs.

Vertical scaling (scale up)

Vertical scaling means making a single resource bigger — upgrading a VM from 4 CPUs to 16 CPUs, or a database from 32GB RAM to 128GB RAM. It is simple to do and requires no application changes, but it has hard limits (you cannot make a single server infinitely large), and scaling down requires downtime on most managed database services. It is also expensive at the high end — large instance sizes cost disproportionately more than equivalent capacity split across smaller instances.

Horizontal scaling (scale out)

Horizontal scaling means adding more instances of the same resource — running 10 servers instead of 2. This is the approach cloud infrastructure is designed for. It is theoretically unlimited, it provides redundancy (if one instance fails, the others continue), and it allows gradual scaling — add one instance, verify things look stable, add more if needed.

The catch: your application must be designed for horizontal scaling. Stateless applications (where any request can be handled by any instance) scale horizontally with no changes. Stateful applications (where a user’s session is stored in memory on a specific server) need redesigning — session state needs to move to an external store like Redis before horizontal scaling works properly.

Vertical scalingHorizontal scaling
Requires app changes?NoSometimes (for stateful apps)
Hard limits?Yes — max instance sizeNo — add more instances
Downtime required?SometimesNo
Redundancy?NoYes
Cost at scaleExpensiveMore predictable

Auto-scaling: the theory vs the practice

Auto-scaling means the infrastructure adjusts its own capacity in response to load — adding instances when traffic increases, removing them when traffic drops. In theory, this is ideal: you pay only for what you use, and you always have enough capacity.

In practice, auto-scaling requires careful configuration to work well. The most common mistakes:

  • Setting the scale-out threshold too high — by the time scaling triggers, users are already experiencing degraded performance
  • Setting the scale-in threshold too aggressively — instances are terminated while they still have active connections
  • Not accounting for the time it takes to launch and warm up a new instance — if startup takes 3 minutes and your traffic spike is 2 minutes, auto-scaling does not help
  • Setting minimum capacity too low — in a multi-AZ setup, the minimum should be at least 2 instances (one per zone) so a single failure does not cause an outage

AWS Auto Scaling Groups (ASGs)

AWS Auto Scaling Groups manage a fleet of EC2 instances as a single unit. You define minimum, maximum, and desired instance counts, attach scaling policies (scale out when CPU > 70% for 3 minutes, scale in when CPU < 30% for 10 minutes), and the ASG handles the rest. ASGs integrate with load balancers to automatically register and deregister instances as they start and stop.

GCP Managed Instance Groups (MIGs)

GCP’s equivalent is the Managed Instance Group. MIGs support autoscaling based on CPU utilisation, HTTP load balancing metrics, or custom metrics from Cloud Monitoring. They also support rolling updates — you can update the instance template (the configuration for new instances) and roll it out gradually without downtime.

Kubernetes Horizontal Pod Autoscaler (HPA)

If you are running workloads on Kubernetes, the Horizontal Pod Autoscaler handles scaling at the pod level. HPA increases or decreases the number of pod replicas in a Deployment based on observed metrics.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payments-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payments-api
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 65

This HPA maintains between 2 and 20 replicas, scaling to keep average CPU utilisation near 65%. When CPU climbs above 65%, HPA adds replicas. When it falls below, HPA removes them (with a stabilisation window to prevent thrashing).

HPA also supports custom metrics through the Custom Metrics API — you can scale based on queue depth (scale out when there are more than 1000 messages in a queue), request latency, or any metric you can expose. This is more sophisticated than CPU-based scaling and often produces better results for workloads where CPU is not the primary bottleneck.

For scaling the underlying Kubernetes nodes themselves (not just the pods), you need the Cluster Autoscaler (for self-managed clusters) or the managed equivalent (GKE Autopilot, EKS Managed Node Groups with autoscaling).

Choosing the right scaling trigger

The metric you use to trigger scaling should be the resource that is actually constraining your service. CPU is the default, but it is not always the right choice.

  • CPU utilisation — good for compute-bound workloads (data processing, transcoding, ML inference)
  • Memory utilisation — good for memory-bound workloads, though memory-based scaling is less common because memory leaks can trigger constant scale-out
  • Request rate or active connections — better than CPU for web services where CPU usage is low but request volume is high
  • Queue depth — ideal for worker processes consuming from a queue; scale workers proportionally to backlog size
  • Custom application metrics — the most precise option for workloads with complex scaling behaviour

A useful exercise: load test your service and observe which resource hits its limit first. That resource is your scaling trigger. If your service saturates database connections before CPU, you need to scale to reduce connection pressure — or fix the connection pooling.

Load testing basics

You cannot know how your system scales without testing it. Load testing is the practice of generating synthetic traffic to understand how a system behaves under increasing load.

The questions load testing answers:

  • At what request rate does response time start to degrade?
  • What is the maximum throughput the system can sustain?
  • Does auto-scaling trigger at the right point and keep up with load increases?
  • Where is the bottleneck — is it the application servers, the database, a downstream API?

Simple load testing tools: k6 (scripted, developer-friendly, outputs metrics), Apache JMeter (GUI-based, good for non-engineers), hey (simple CLI tool for quick HTTP load tests). For sustained production-like load testing, consider running tests against a staging environment that mirrors production sizing.

One important caution: load testing production is risky. If your scaling does not work as expected and your system falls over, that is a self-inflicted outage. Load test in staging first. Only load test production with careful planning, real monitoring in place, and the ability to cancel quickly.

Capacity planning vs reactive scaling

Reactive scaling (auto-scaling based on live metrics) handles unpredictable traffic well. But it does not replace capacity planning — thinking ahead about how much capacity you will need as your product grows.

Capacity planning is particularly important for:

  • Planned events — a product launch, a marketing campaign, a TV appearance. If you know traffic will spike at a specific time, pre-scale before the event rather than relying on auto-scaling to keep up
  • Database scaling — databases do not scale horizontally as easily as stateless services. Vertical database scaling requires maintenance windows. Capacity planning prevents reactive, emergency database upgrades
  • Reserved capacity — committing to reserved instances requires forecasting. Accurate capacity planning turns reactive cloud spend into predictable, discounted spend

Cost implications of scaling decisions

Auto-scaling does not automatically mean cost efficiency. Poorly configured scaling can increase costs significantly.

Scale-out without scale-in is a common trap. An ASG or HPA scales out under load, but if the scale-in policy is too conservative or the minimum instance count is set too high, the additional capacity stays running long after the load has passed. Check that your scale-in policy is as carefully tuned as your scale-out policy.

Also consider what type of instances are in your auto-scaling pool. If you are scaling out with on-demand instances during traffic spikes, the cost can be high. Running your auto-scaling pool on spot or preemptible instances can reduce that cost by 70–80%, with the trade-off that instances can be reclaimed with short notice — which is acceptable for stateless, horizontally-scaled workloads.

When NOT to scale

Scaling is sometimes the wrong answer to a performance problem. Before adding more instances, check whether the issue is actually a scaling problem or a different kind of problem wearing the costume of a scaling problem.

  • N+1 query problems — an application making 100 database queries per user request will make 10,000 queries per 100 concurrent users. Adding more app servers makes this worse, not better. Fix the query pattern first.
  • Memory leaks — a service that steadily consumes memory until it crashes will keep crashing faster as you scale out. Scaling out means more instances crashing. Fix the leak.
  • External bottlenecks — if your service is slow because a downstream API is slow, adding more instances of your service creates more requests to the slow API. Identify the bottleneck correctly.
  • Unnecessary work — if requests are doing work that could be cached, pre-computed, or eliminated, scaling does not help — optimisation does. Ten well-optimised instances is better than fifty inefficient ones.

The diagnostic question before scaling: “What resource is actually at capacity, and is adding more of it the right way to address that?” If the honest answer is “I am not sure,” do the load testing first.