Horizontal Pod Autoscaling in GKE: Kubernetes HPA Explained for Beginners
Horizontal Pod Autoscaling (HPA) is how Kubernetes automatically adjusts the number of running pods to match actual demand. When traffic rises and CPU climbs, the HPA adds more pods. When demand drops, it removes them. On GKE, it works out of the box with no extra setup required. This guide explains what HPA is, how it calculates scaling decisions, how to configure it correctly, and what to watch out for.
What is Horizontal Pod Autoscaling?
Horizontal Pod Autoscaling is a built-in Kubernetes controller that watches a Deployment (or StatefulSet, or ReplicaSet) and automatically changes the number of running pod replicas based on observed metrics.
“Horizontal” means adding or removing pod replicas: more instances of your application running in parallel. This is different from vertical scaling, which changes the CPU or memory limits on existing pods.
The HPA checks metrics roughly every 15 seconds. If utilisation is above your target, it creates more pods. If utilisation is below it, it reduces the count down to your configured minimum.
The HPA uses metrics-server, which is pre-installed in every GKE cluster. You do not need to deploy anything extra to start using CPU or memory-based autoscaling. Just set resource requests on your containers and create an HPA.
Horizontal Pod Autoscaling in simple terms
Think of a supermarket during a Saturday rush. When queues get long, the manager opens more checkout lanes. When the store quiets down, some cashiers go on break. The manager does not guess in advance. They respond to what they can actually see: how long the queues are right now.
HPA works the same way. Instead of checkout lanes, it manages pod replicas. Instead of queue length, it watches CPU utilisation (or memory, or custom metrics). Instead of a manager making calls every hour, it checks automatically every 15 seconds.
The key idea: you stop guessing how many pods you need and let the cluster respond to real demand. Your application gets more capacity when it needs it, and stops paying for idle capacity when it does not.
Why HPA matters
Without HPA, you face two bad options: over-provision (run too many pods permanently and waste money) or under-provision (run too few and get hammered during traffic spikes).
HPA gives you a third option: scale dynamically with demand.
- Handles traffic spikes automatically. A news story breaks, a sale goes live, a cron job fires off thousands of requests. HPA responds in seconds without you paging anyone.
- Reduces idle resource cost. At 3am when traffic is low, HPA scales down to your minimum replica count. You are not running ten pods when two would do.
- Improves resilience. Setting
minReplicasto 2 or more means your service always has redundancy. One pod crashing does not take down your whole application. - Removes manual toil. Engineers stop watching dashboards and manually running
kubectl scale. The HPA handles it.
For stateless web applications and API backends, the most common workloads on GKE, HPA is the standard first tool for scaling.
How Horizontal Pod Autoscaling works
The HPA runs a continuous reconciliation loop inside the Kubernetes control plane. Here is what happens on each cycle:
1. Metrics collection
Every 15 seconds, the HPA queries metrics-server for current resource usage across all pods in the target Deployment. Metrics-server aggregates CPU and memory usage from the kubelet on each node.
2. Utilisation calculation
The HPA calculates current average utilisation as a percentage of resource requests across all pods.
For CPU: (total current CPU across all pods) ÷ (total requested CPU across all pods) × 100
If you have three pods each requesting 250m CPU and they are collectively using 525m, average utilisation is 70%.
3. Desired replica calculation
The HPA uses this formula:
desiredReplicas = ceil( currentReplicas × (currentUtilisation ÷ targetUtilisation) )If you have 3 pods at 70% utilisation and your target is 50%, the HPA calculates ceil(3 × (70 ÷ 50)) = ceil(4.2) = 5. It will scale up to 5 pods.
4. Scaling action
If the desired count differs from the current count and falls within your configured minReplicas and maxReplicas bounds, the HPA updates spec.replicas on the Deployment. The Deployment controller then creates or terminates pods to match.
The HPA works like a thermostat. You set a target temperature: say, 70% CPU utilisation. The thermostat checks the current reading every 15 seconds. Too hot? It spins up more cooling units (creates pods). Too cool? It switches some off (removes pods). It never goes below your minimum or above your maximum.
Resource requests: why HPA depends on them
CPU-based HPA requires that every container in the Deployment has resources.requests.cpu set. This is non-negotiable.
The HPA expresses CPU utilisation as a percentage of the requested CPU, not as an absolute value in millicores. If a pod requests 250m CPU and is currently using 175m, utilisation is 70%. Without a request value, metrics-server has no denominator and reports the metric as unknown.
If the TARGETS column shows <unknown>/70%, missing resource requests are almost always the cause. The fix is straightforward: add resources.requests.cpu and resources.requests.memory to every container in your Deployment. The HPA will start reporting real values within about a minute.
Here is a Deployment correctly configured for use with HPA:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 2
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app
image: us-docker.pkg.dev/my-project/my-repo/my-app:v1
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"Both requests and limits are set here. The HPA uses requests to calculate utilisation. The limits cap what any individual pod can consume. Setting both is good practice regardless of whether you use HPA.
How to create an HPA in GKE
Quick test with kubectl
For testing or quick experimentation, the imperative command is the fastest way to create an HPA:
kubectl autoscale deployment my-app \
--cpu-percent=70 \
--min=2 \
--max=10This creates an HPA targeting 70% average CPU utilisation, with a minimum of 2 pods and a maximum of 10. It is useful for testing but does not belong in production config; you cannot track it in version control.
Production example with YAML
For production, use a HorizontalPodAutoscaler manifest with the autoscaling/v2 API. This is the current stable API on GKE and supports multiple metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80When both CPU and memory metrics are configured, the HPA calculates a desired replica count for each metric independently and uses whichever is higher. If CPU requires 4 pods and memory requires 6, the HPA scales to 6.
Apply the manifest with:
kubectl apply -f my-app-hpa.yamlAlways use autoscaling/v2 for new HPAs. The older autoscaling/v1 only supports CPU and lacks the behavior block for tuning scale-up and scale-down speeds. The v2 API is stable and the default on all current GKE versions.
How to check whether HPA is working
After creating an HPA, these commands tell you what it is doing and why.
List all HPAs in the namespace:
kubectl get hpaExample output:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
my-app-hpa Deployment/my-app 45%/70% 2 10 3 12mThe TARGETS column shows current/target. 45%/70% means current CPU is 45% against a target of 70%. The HPA is satisfied and not scaling.
Get a detailed view with scaling history:
kubectl describe hpa my-app-hpaThis output includes the current metric values, recent scaling decisions, and an event log showing exactly when the HPA scaled and why. This is your first diagnostic tool when something unexpected happens.
Common warning signs to look for:
<unknown>in TARGETS: missing resource requests on the containerPendingpods after a scale-up: not enough node capacity; check if Cluster Autoscaler is enabled- Replicas stuck at
maxReplicas: your maximum may be too low, or the application needs optimisation - Scale-down not happening: the 5-minute stabilisation window is likely still active
Scale-up and scale-down behaviour
The HPA has asymmetric default behaviour, and this is deliberate.
Scale-up is fast. When utilisation exceeds the target, the HPA acts aggressively. It can double the pod count every 15 seconds. This protects against sudden traffic spikes where a slow response would immediately degrade your service.
Scale-down is slow. After utilisation drops below the target, the HPA waits 5 minutes (300 seconds) of sustained below-target utilisation before reducing replicas.
The 5-minute stabilisation window prevents thrashing: a loop where the HPA scales down, traffic bounces back, it scales up again, and the cycle repeats continuously. Thrashing wastes resources and creates instability. The slow default is intentional protection, not a limitation.
You can tune both behaviours in autoscaling/v2:
spec:
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100
periodSeconds: 15 # Can double replicas every 15 seconds
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 minutes before scaling down
policies:
- type: Pods
value: 2
periodSeconds: 60 # Remove at most 2 pods per minuteThe scaleDown.policies entry limits the HPA to removing at most 2 pods per minute, even after the stabilisation window passes. This is useful for services where rapid pod removal could affect in-flight requests.
Custom metrics
CPU and memory cover most use cases, but some applications scale better on application-level signals. The autoscaling/v2 API supports custom metrics via the Custom Metrics API and External Metrics API.
A common GKE example is scaling based on the depth of a Cloud Pub/Sub subscription queue. If messages are accumulating, consumers are falling behind; scale out. If the queue is empty, scale in.
This requires a metrics adapter that exposes Pub/Sub metrics from Cloud Monitoring to the Kubernetes External Metrics API:
spec:
metrics:
- type: External
external:
metric:
name: pubsub.googleapis.com|subscription|num_undelivered_messages
selector:
matchLabels:
resource.labels.subscription_id: my-subscription
target:
type: AverageValue
averageValue: "100" # Target: 100 undelivered messages per podWith 500 undelivered messages and 2 pods, the HPA calculates ceil(500 ÷ 100) = 5 and scales up to 5 pods. Other useful custom metrics include HTTP requests per second from a load balancer or queue depth from Redis.
When to use Horizontal Pod Autoscaling
HPA is a strong fit for:
- Stateless web applications. HTTP services where any pod can handle any request are the ideal case. HPA was built for this.
- API backends. REST or gRPC services that see variable request rates across the day scale up during business hours and scale down overnight.
- Worker deployments consuming queues. Workers pulling from a Pub/Sub topic or task queue can scale based on queue depth using custom metrics, keeping processing latency stable under variable load.
- Batch processing frontends. Services that receive work in bursts benefit from HPA scaling up quickly when a batch arrives, then scaling back down once it is processed.
- Services with predictable daily traffic patterns. Even if you know traffic is higher during the day, HPA handles the exact timing automatically with no cron-based scaling needed.
If your workload is stateless and receives variable traffic, you should probably be using HPA.
When HPA is a poor fit
HPA is not the right tool for every workload:
- Stateful workloads with non-trivial state management. Databases, caches, and other stateful services often cannot simply add more replicas without coordination. Scaling a StatefulSet with HPA is technically possible but requires careful thought about data sharding and consistency.
- Workloads with poor or missing metrics. If your pods have no resource requests and expose no meaningful custom metrics, HPA cannot make useful scaling decisions. Fix the metrics problem first.
- Workloads that need scale-to-zero. The HPA cannot scale below
minReplicas: 1. If you need a deployment to run zero pods until an event triggers it (for example, a batch job that only runs when a queue has messages), standard HPA cannot do this. Look at KEDA instead, covered below.
If your pod takes 3 to 5 minutes to start and accept traffic, HPA may add capacity too slowly to help during a fast spike. For slow-starting services, a higher minReplicas is often a better answer than aggressive autoscaling. Keeping spare capacity warm is cheaper than an outage.
HPA vs Cluster Autoscaler
These two tools operate at different levels and are designed to work together, not as alternatives.
| HPA | Cluster Autoscaler | |
|---|---|---|
| What it scales | Pod replicas | Cluster nodes |
| Trigger | CPU, memory, or custom metrics | Unschedulable pods or underutilised nodes |
| Where it runs | Kubernetes control plane | GKE control plane integration |
| GKE Standard | Built-in, always available | Must be enabled per node pool |
| GKE Autopilot | Built-in, always available | Handled automatically by Google |
The HPA operates at the pod level. But if the cluster does not have enough node capacity to schedule those pods, they stay in Pending state. This is where the Cluster Autoscaler comes in. It watches for unschedulable pods and provisions new nodes to accommodate them.
On GKE Standard, enable the Cluster Autoscaler on a node pool:
gcloud container clusters update my-cluster \
--enable-autoscaling \
--min-nodes=1 \
--max-nodes=10 \
--node-pool=default-pool \
--region=europe-west2With both enabled, the full scaling sequence looks like this:
- Traffic increases, CPU utilisation rises above target.
- HPA creates additional pods.
- New pods cannot be scheduled: not enough node capacity.
- Cluster Autoscaler detects unschedulable pods and adds nodes.
- New nodes become ready, pods are scheduled, utilisation normalises.
On GKE Autopilot, node management is handled by Google. You configure the HPA and GKE handles the rest.
For HPA and Cluster Autoscaler to work well together, accurate resource requests are essential. The Cluster Autoscaler uses requests (not actual usage) to judge whether a node is underutilised and safe to remove. Pods with no resource requests appear to consume zero resources; the CA may incorrectly remove nodes that are actually busy.
HPA vs KEDA
KEDA (Kubernetes Event-Driven Autoscaling) is an open-source project that extends the Kubernetes autoscaling system with support for event-driven metrics and scale-to-zero.
| HPA | KEDA | |
|---|---|---|
| Scale to zero | No (minimum 1 pod) | Yes |
| Metrics sources | CPU, memory, Custom Metrics API | 60+ event sources (Pub/Sub, Kafka, RabbitMQ, HTTP, etc.) |
| Setup | Built in, no install needed | Requires installing the KEDA operator |
| Best for | Stateless services with steady traffic | Event-driven workloads, batch jobs, scale-to-zero scenarios |
KEDA does not replace HPA for general-purpose autoscaling; it extends it. For a standard web service scaling on CPU, the built-in HPA is simpler and sufficient. Where KEDA shines is workloads that should run zero pods when idle, or workloads that need to scale based on external events like a Pub/Sub queue depth or Kafka consumer lag without a custom metrics adapter.
Common beginner mistakes
- Not setting resource requests on containers. Without
resources.requests.cpu, the HPA cannot calculate CPU utilisation and reports<unknown>in the TARGETS column. This is the single most common reason HPA appears not to work. Fix it by setting both CPU and memory requests on every container in the Deployment. - Setting minReplicas to 1. A minimum of 1 pod means the HPA can scale down to a single instance with no redundancy. If that pod crashes or its node fails, your service is completely down. Set
minReplicasto at least 2 for any production workload. - Setting the CPU target too low. A target of 20 to 30% CPU causes the HPA to scale out aggressively at the slightest load, wasting resources. Most stateless applications handle 60 to 80% CPU without performance problems. Start at 70% and adjust based on observed response latency.
- Manually scaling a Deployment that has an HPA. Running
kubectl scale deployment my-app —replicas=10will be overridden by the HPA within 15 to 30 seconds. The HPA ownsspec.replicaswhile active. To override, remove the HPA or setminReplicasandmaxReplicasto the same value. - Not enabling the Cluster Autoscaler on GKE Standard. HPA can request more pods than the cluster has capacity for. Without the Cluster Autoscaler, those pods stay in
Pendingindefinitely. On GKE Autopilot, node scaling is handled automatically.
Best practices
- Start with a realistic CPU target. 70% is a reasonable starting point for most stateless applications. Adjust based on your observed latency at different utilisation levels, not by guessing.
- Set sensible min and max replicas. Minimum replicas should ensure availability (at least 2 for production). Maximum replicas should reflect the highest load you have seen, with headroom, but cap it to avoid runaway scaling from a bug.
- Test scaling under load before relying on it. Use a load testing tool to verify the HPA responds as expected before a real traffic spike. Check that Cluster Autoscaler adds nodes in time if needed.
- Monitor latency, not just CPU. CPU utilisation is a proxy. What you actually care about is response time. Set up latency-based alerting in Cloud Monitoring alongside HPA to catch cases where CPU looks fine but users experience slow responses.
- Combine HPA with Cluster Autoscaler on GKE Standard. HPA is not useful without enough node capacity to schedule the pods it creates. Enable the Cluster Autoscaler on your node pools.
- Use accurate resource requests. Padded or guessed requests make utilisation calculations meaningless and cause HPA to behave unpredictably. Profile your application under realistic load to set accurate values.
- Use the autoscaling/v2 API. It supports multiple metrics, fine-grained behaviour tuning, and is the current stable API. Avoid the deprecated v1 API.
If you are setting up HPA for the first time: start with CPU only at 70%, set minReplicas: 2 and a generous maxReplicas, then run a load test. Watch kubectl get hpa -w while the test runs. Tune from there once you can see how your specific application responds.
Summary
- HPA automatically adjusts pod replica counts based on CPU, memory, or custom metrics, keeping utilisation close to your target without manual intervention.
- It runs a reconciliation loop every 15 seconds: collect metrics, calculate desired replicas, update the Deployment if needed.
- Resource requests (
resources.requests.cpu) are required for CPU-based HPA. Without them, utilisation reports asunknownand no scaling occurs. - Create an HPA imperatively with
kubectl autoscalefor testing, or declaratively with aHorizontalPodAutoscalerYAML using theautoscaling/v2API for production. - Scale-up is aggressive by default; scale-down has a 5-minute stabilisation window to prevent thrashing.
- HPA scales pods; Cluster Autoscaler scales nodes. Both work together on GKE Standard. On GKE Autopilot, node scaling is automatic.
- HPA cannot scale to zero. For scale-to-zero and event-driven autoscaling, use KEDA.
Frequently asked questions
What is Horizontal Pod Autoscaling in Kubernetes?
Horizontal Pod Autoscaling (HPA) is a built-in Kubernetes feature that automatically adjusts the number of running Pod replicas based on observed resource usage. When CPU utilisation rises above your target, the HPA adds more Pods. When utilisation drops, it removes them. On GKE, the HPA works out of the box using metrics-server, which is pre-installed in every cluster.
Why does my HPA show 'unknown' for CPU usage?
The most common cause is missing resource requests on the Deployment. The HPA calculates CPU utilisation as a percentage of the Pod's requested CPU. If resources.requests.cpu is not set, metrics-server has no baseline to calculate against, so it reports 'unknown'. Set CPU and memory requests on every container in your Deployment and the HPA will start reporting real values within a minute.
Can HPA scale based on memory?
Yes. The autoscaling/v2 API supports memory as a target metric alongside CPU. You can configure both at once and the HPA will scale to satisfy whichever metric requires the most Pods. Memory-based scaling is less common because many applications hold memory regardless of load, making it a less reliable signal than CPU for most stateless workloads.
What is the difference between HPA and Cluster Autoscaler?
HPA scales Pods: it increases or decreases the number of replicas inside your cluster. Cluster Autoscaler scales nodes: it adds or removes virtual machines in response to Pods that cannot be scheduled due to insufficient capacity. Both work together. HPA requests more Pods, and Cluster Autoscaler provisions the nodes needed to run them. On GKE Autopilot, node scaling is handled automatically by Google.
Can HPA scale to zero?
No. The minimum value for minReplicas is 1. The standard HPA always keeps at least one Pod running. If you need scale-to-zero, for example for event-driven workloads or batch jobs that only run when triggered, look at KEDA (Kubernetes Event-Driven Autoscaling), which extends the HPA to support zero-replica scaling based on external event sources.