Horizontal Pod Autoscaling in GKE: Kubernetes HPA Explained for Beginners

Horizontal Pod Autoscaling (HPA) is how Kubernetes automatically adjusts the number of running pods to match actual demand. When traffic rises and CPU climbs, the HPA adds more pods. When demand drops, it removes them. On GKE, it works out of the box with no extra setup required. This guide explains what HPA is, how it calculates scaling decisions, how to configure it correctly, and what to watch out for.

What is Horizontal Pod Autoscaling?

Horizontal Pod Autoscaling is a built-in Kubernetes controller that watches a Deployment (or StatefulSet, or ReplicaSet) and automatically changes the number of running pod replicas based on observed metrics.

“Horizontal” means adding or removing pod replicas: more instances of your application running in parallel. This is different from vertical scaling, which changes the CPU or memory limits on existing pods.

The HPA checks metrics roughly every 15 seconds. If utilisation is above your target, it creates more pods. If utilisation is below it, it reduces the count down to your configured minimum.

On GKE

The HPA uses metrics-server, which is pre-installed in every GKE cluster. You do not need to deploy anything extra to start using CPU or memory-based autoscaling. Just set resource requests on your containers and create an HPA.

Horizontal Pod Autoscaling in simple terms

Analogy

Think of a supermarket during a Saturday rush. When queues get long, the manager opens more checkout lanes. When the store quiets down, some cashiers go on break. The manager does not guess in advance. They respond to what they can actually see: how long the queues are right now.

HPA works the same way. Instead of checkout lanes, it manages pod replicas. Instead of queue length, it watches CPU utilisation (or memory, or custom metrics). Instead of a manager making calls every hour, it checks automatically every 15 seconds.

The key idea: you stop guessing how many pods you need and let the cluster respond to real demand. Your application gets more capacity when it needs it, and stops paying for idle capacity when it does not.

Why HPA matters

Without HPA, you face two bad options: over-provision (run too many pods permanently and waste money) or under-provision (run too few and get hammered during traffic spikes).

HPA gives you a third option: scale dynamically with demand.

  • Handles traffic spikes automatically. A news story breaks, a sale goes live, a cron job fires off thousands of requests. HPA responds in seconds without you paging anyone.
  • Reduces idle resource cost. At 3am when traffic is low, HPA scales down to your minimum replica count. You are not running ten pods when two would do.
  • Improves resilience. Setting minReplicas to 2 or more means your service always has redundancy. One pod crashing does not take down your whole application.
  • Removes manual toil. Engineers stop watching dashboards and manually running kubectl scale. The HPA handles it.

For stateless web applications and API backends, the most common workloads on GKE, HPA is the standard first tool for scaling.

How Horizontal Pod Autoscaling works

The HPA runs a continuous reconciliation loop inside the Kubernetes control plane. Here is what happens on each cycle:

1. Metrics collection

Every 15 seconds, the HPA queries metrics-server for current resource usage across all pods in the target Deployment. Metrics-server aggregates CPU and memory usage from the kubelet on each node.

2. Utilisation calculation

The HPA calculates current average utilisation as a percentage of resource requests across all pods.

For CPU: (total current CPU across all pods) ÷ (total requested CPU across all pods) × 100

If you have three pods each requesting 250m CPU and they are collectively using 525m, average utilisation is 70%.

3. Desired replica calculation

The HPA uses this formula:

desiredReplicas = ceil( currentReplicas × (currentUtilisation ÷ targetUtilisation) )

If you have 3 pods at 70% utilisation and your target is 50%, the HPA calculates ceil(3 × (70 ÷ 50)) = ceil(4.2) = 5. It will scale up to 5 pods.

4. Scaling action

If the desired count differs from the current count and falls within your configured minReplicas and maxReplicas bounds, the HPA updates spec.replicas on the Deployment. The Deployment controller then creates or terminates pods to match.

Analogy

The HPA works like a thermostat. You set a target temperature: say, 70% CPU utilisation. The thermostat checks the current reading every 15 seconds. Too hot? It spins up more cooling units (creates pods). Too cool? It switches some off (removes pods). It never goes below your minimum or above your maximum.

Resource requests: why HPA depends on them

CPU-based HPA requires that every container in the Deployment has resources.requests.cpu set. This is non-negotiable.

The HPA expresses CPU utilisation as a percentage of the requested CPU, not as an absolute value in millicores. If a pod requests 250m CPU and is currently using 175m, utilisation is 70%. Without a request value, metrics-server has no denominator and reports the metric as unknown.

Watch out

If the TARGETS column shows <unknown>/70%, missing resource requests are almost always the cause. The fix is straightforward: add resources.requests.cpu and resources.requests.memory to every container in your Deployment. The HPA will start reporting real values within about a minute.

Here is a Deployment correctly configured for use with HPA:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app
          image: us-docker.pkg.dev/my-project/my-repo/my-app:v1
          resources:
            requests:
              cpu: "250m"
              memory: "256Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"

Both requests and limits are set here. The HPA uses requests to calculate utilisation. The limits cap what any individual pod can consume. Setting both is good practice regardless of whether you use HPA.

How to create an HPA in GKE

Quick test with kubectl

For testing or quick experimentation, the imperative command is the fastest way to create an HPA:

kubectl autoscale deployment my-app \
  --cpu-percent=70 \
  --min=2 \
  --max=10

This creates an HPA targeting 70% average CPU utilisation, with a minimum of 2 pods and a maximum of 10. It is useful for testing but does not belong in production config; you cannot track it in version control.

Production example with YAML

For production, use a HorizontalPodAutoscaler manifest with the autoscaling/v2 API. This is the current stable API on GKE and supports multiple metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

When both CPU and memory metrics are configured, the HPA calculates a desired replica count for each metric independently and uses whichever is higher. If CPU requires 4 pods and memory requires 6, the HPA scales to 6.

Apply the manifest with:

kubectl apply -f my-app-hpa.yaml
Tip

Always use autoscaling/v2 for new HPAs. The older autoscaling/v1 only supports CPU and lacks the behavior block for tuning scale-up and scale-down speeds. The v2 API is stable and the default on all current GKE versions.

How to check whether HPA is working

After creating an HPA, these commands tell you what it is doing and why.

List all HPAs in the namespace:

kubectl get hpa

Example output:

NAME          REFERENCE              TARGETS     MINPODS   MAXPODS   REPLICAS   AGE
my-app-hpa    Deployment/my-app      45%/70%     2         10        3          12m

The TARGETS column shows current/target. 45%/70% means current CPU is 45% against a target of 70%. The HPA is satisfied and not scaling.

Get a detailed view with scaling history:

kubectl describe hpa my-app-hpa

This output includes the current metric values, recent scaling decisions, and an event log showing exactly when the HPA scaled and why. This is your first diagnostic tool when something unexpected happens.

Common warning signs to look for:

Diagnostic checklist
  • <unknown> in TARGETS: missing resource requests on the container
  • Pending pods after a scale-up: not enough node capacity; check if Cluster Autoscaler is enabled
  • Replicas stuck at maxReplicas: your maximum may be too low, or the application needs optimisation
  • Scale-down not happening: the 5-minute stabilisation window is likely still active

Scale-up and scale-down behaviour

The HPA has asymmetric default behaviour, and this is deliberate.

Scale-up is fast. When utilisation exceeds the target, the HPA acts aggressively. It can double the pod count every 15 seconds. This protects against sudden traffic spikes where a slow response would immediately degrade your service.

Scale-down is slow. After utilisation drops below the target, the HPA waits 5 minutes (300 seconds) of sustained below-target utilisation before reducing replicas.

Why scale-down is slow

The 5-minute stabilisation window prevents thrashing: a loop where the HPA scales down, traffic bounces back, it scales up again, and the cycle repeats continuously. Thrashing wastes resources and creates instability. The slow default is intentional protection, not a limitation.

You can tune both behaviours in autoscaling/v2:

spec:
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0    # Scale up immediately
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15            # Can double replicas every 15 seconds
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 minutes before scaling down
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60            # Remove at most 2 pods per minute

The scaleDown.policies entry limits the HPA to removing at most 2 pods per minute, even after the stabilisation window passes. This is useful for services where rapid pod removal could affect in-flight requests.

Custom metrics

CPU and memory cover most use cases, but some applications scale better on application-level signals. The autoscaling/v2 API supports custom metrics via the Custom Metrics API and External Metrics API.

A common GKE example is scaling based on the depth of a Cloud Pub/Sub subscription queue. If messages are accumulating, consumers are falling behind; scale out. If the queue is empty, scale in.

This requires a metrics adapter that exposes Pub/Sub metrics from Cloud Monitoring to the Kubernetes External Metrics API:

spec:
  metrics:
    - type: External
      external:
        metric:
          name: pubsub.googleapis.com|subscription|num_undelivered_messages
          selector:
            matchLabels:
              resource.labels.subscription_id: my-subscription
        target:
          type: AverageValue
          averageValue: "100"    # Target: 100 undelivered messages per pod

With 500 undelivered messages and 2 pods, the HPA calculates ceil(500 ÷ 100) = 5 and scales up to 5 pods. Other useful custom metrics include HTTP requests per second from a load balancer or queue depth from Redis.

When to use Horizontal Pod Autoscaling

HPA is a strong fit for:

  • Stateless web applications. HTTP services where any pod can handle any request are the ideal case. HPA was built for this.
  • API backends. REST or gRPC services that see variable request rates across the day scale up during business hours and scale down overnight.
  • Worker deployments consuming queues. Workers pulling from a Pub/Sub topic or task queue can scale based on queue depth using custom metrics, keeping processing latency stable under variable load.
  • Batch processing frontends. Services that receive work in bursts benefit from HPA scaling up quickly when a batch arrives, then scaling back down once it is processed.
  • Services with predictable daily traffic patterns. Even if you know traffic is higher during the day, HPA handles the exact timing automatically with no cron-based scaling needed.

If your workload is stateless and receives variable traffic, you should probably be using HPA.

When HPA is a poor fit

HPA is not the right tool for every workload:

  • Stateful workloads with non-trivial state management. Databases, caches, and other stateful services often cannot simply add more replicas without coordination. Scaling a StatefulSet with HPA is technically possible but requires careful thought about data sharding and consistency.
  • Workloads with poor or missing metrics. If your pods have no resource requests and expose no meaningful custom metrics, HPA cannot make useful scaling decisions. Fix the metrics problem first.
  • Workloads that need scale-to-zero. The HPA cannot scale below minReplicas: 1. If you need a deployment to run zero pods until an event triggers it (for example, a batch job that only runs when a queue has messages), standard HPA cannot do this. Look at KEDA instead, covered below.
Slow-starting workloads

If your pod takes 3 to 5 minutes to start and accept traffic, HPA may add capacity too slowly to help during a fast spike. For slow-starting services, a higher minReplicas is often a better answer than aggressive autoscaling. Keeping spare capacity warm is cheaper than an outage.

HPA vs Cluster Autoscaler

These two tools operate at different levels and are designed to work together, not as alternatives.

HPACluster Autoscaler
What it scalesPod replicasCluster nodes
TriggerCPU, memory, or custom metricsUnschedulable pods or underutilised nodes
Where it runsKubernetes control planeGKE control plane integration
GKE StandardBuilt-in, always availableMust be enabled per node pool
GKE AutopilotBuilt-in, always availableHandled automatically by Google

The HPA operates at the pod level. But if the cluster does not have enough node capacity to schedule those pods, they stay in Pending state. This is where the Cluster Autoscaler comes in. It watches for unschedulable pods and provisions new nodes to accommodate them.

On GKE Standard, enable the Cluster Autoscaler on a node pool:

gcloud container clusters update my-cluster \
  --enable-autoscaling \
  --min-nodes=1 \
  --max-nodes=10 \
  --node-pool=default-pool \
  --region=europe-west2

With both enabled, the full scaling sequence looks like this:

  1. Traffic increases, CPU utilisation rises above target.
  2. HPA creates additional pods.
  3. New pods cannot be scheduled: not enough node capacity.
  4. Cluster Autoscaler detects unschedulable pods and adds nodes.
  5. New nodes become ready, pods are scheduled, utilisation normalises.

On GKE Autopilot, node management is handled by Google. You configure the HPA and GKE handles the rest.

Note

For HPA and Cluster Autoscaler to work well together, accurate resource requests are essential. The Cluster Autoscaler uses requests (not actual usage) to judge whether a node is underutilised and safe to remove. Pods with no resource requests appear to consume zero resources; the CA may incorrectly remove nodes that are actually busy.

HPA vs KEDA

KEDA (Kubernetes Event-Driven Autoscaling) is an open-source project that extends the Kubernetes autoscaling system with support for event-driven metrics and scale-to-zero.

HPAKEDA
Scale to zeroNo (minimum 1 pod)Yes
Metrics sourcesCPU, memory, Custom Metrics API60+ event sources (Pub/Sub, Kafka, RabbitMQ, HTTP, etc.)
SetupBuilt in, no install neededRequires installing the KEDA operator
Best forStateless services with steady trafficEvent-driven workloads, batch jobs, scale-to-zero scenarios

KEDA does not replace HPA for general-purpose autoscaling; it extends it. For a standard web service scaling on CPU, the built-in HPA is simpler and sufficient. Where KEDA shines is workloads that should run zero pods when idle, or workloads that need to scale based on external events like a Pub/Sub queue depth or Kafka consumer lag without a custom metrics adapter.

Common beginner mistakes

  1. Not setting resource requests on containers. Without resources.requests.cpu, the HPA cannot calculate CPU utilisation and reports <unknown> in the TARGETS column. This is the single most common reason HPA appears not to work. Fix it by setting both CPU and memory requests on every container in the Deployment.
  2. Setting minReplicas to 1. A minimum of 1 pod means the HPA can scale down to a single instance with no redundancy. If that pod crashes or its node fails, your service is completely down. Set minReplicas to at least 2 for any production workload.
  3. Setting the CPU target too low. A target of 20 to 30% CPU causes the HPA to scale out aggressively at the slightest load, wasting resources. Most stateless applications handle 60 to 80% CPU without performance problems. Start at 70% and adjust based on observed response latency.
  4. Manually scaling a Deployment that has an HPA. Running kubectl scale deployment my-app —replicas=10 will be overridden by the HPA within 15 to 30 seconds. The HPA owns spec.replicas while active. To override, remove the HPA or set minReplicas and maxReplicas to the same value.
  5. Not enabling the Cluster Autoscaler on GKE Standard. HPA can request more pods than the cluster has capacity for. Without the Cluster Autoscaler, those pods stay in Pending indefinitely. On GKE Autopilot, node scaling is handled automatically.

Best practices

  • Start with a realistic CPU target. 70% is a reasonable starting point for most stateless applications. Adjust based on your observed latency at different utilisation levels, not by guessing.
  • Set sensible min and max replicas. Minimum replicas should ensure availability (at least 2 for production). Maximum replicas should reflect the highest load you have seen, with headroom, but cap it to avoid runaway scaling from a bug.
  • Test scaling under load before relying on it. Use a load testing tool to verify the HPA responds as expected before a real traffic spike. Check that Cluster Autoscaler adds nodes in time if needed.
  • Monitor latency, not just CPU. CPU utilisation is a proxy. What you actually care about is response time. Set up latency-based alerting in Cloud Monitoring alongside HPA to catch cases where CPU looks fine but users experience slow responses.
  • Combine HPA with Cluster Autoscaler on GKE Standard. HPA is not useful without enough node capacity to schedule the pods it creates. Enable the Cluster Autoscaler on your node pools.
  • Use accurate resource requests. Padded or guessed requests make utilisation calculations meaningless and cause HPA to behave unpredictably. Profile your application under realistic load to set accurate values.
  • Use the autoscaling/v2 API. It supports multiple metrics, fine-grained behaviour tuning, and is the current stable API. Avoid the deprecated v1 API.
Quick wins

If you are setting up HPA for the first time: start with CPU only at 70%, set minReplicas: 2 and a generous maxReplicas, then run a load test. Watch kubectl get hpa -w while the test runs. Tune from there once you can see how your specific application responds.

Frequently asked questions

What is Horizontal Pod Autoscaling in Kubernetes?

Horizontal Pod Autoscaling (HPA) is a built-in Kubernetes feature that automatically adjusts the number of running Pod replicas based on observed resource usage. When CPU utilisation rises above your target, the HPA adds more Pods. When utilisation drops, it removes them. On GKE, the HPA works out of the box using metrics-server, which is pre-installed in every cluster.

Why does my HPA show 'unknown' for CPU usage?

The most common cause is missing resource requests on the Deployment. The HPA calculates CPU utilisation as a percentage of the Pod's requested CPU. If resources.requests.cpu is not set, metrics-server has no baseline to calculate against, so it reports 'unknown'. Set CPU and memory requests on every container in your Deployment and the HPA will start reporting real values within a minute.

Can HPA scale based on memory?

Yes. The autoscaling/v2 API supports memory as a target metric alongside CPU. You can configure both at once and the HPA will scale to satisfy whichever metric requires the most Pods. Memory-based scaling is less common because many applications hold memory regardless of load, making it a less reliable signal than CPU for most stateless workloads.

What is the difference between HPA and Cluster Autoscaler?

HPA scales Pods: it increases or decreases the number of replicas inside your cluster. Cluster Autoscaler scales nodes: it adds or removes virtual machines in response to Pods that cannot be scheduled due to insufficient capacity. Both work together. HPA requests more Pods, and Cluster Autoscaler provisions the nodes needed to run them. On GKE Autopilot, node scaling is handled automatically by Google.

Can HPA scale to zero?

No. The minimum value for minReplicas is 1. The standard HPA always keeps at least one Pod running. If you need scale-to-zero, for example for event-driven workloads or batch jobs that only run when triggered, look at KEDA (Kubernetes Event-Driven Autoscaling), which extends the HPA to support zero-replica scaling based on external event sources.

Last verified: 23 March 2026 Cloud services change frequently. Verify details against official documentation before making infrastructure decisions.