GCP Load Balancer Backend Unhealthy? Fix Health Checks Fast

“Backend unhealthy” means the GCP load balancer tried to verify that your server is alive and the server did not respond correctly. The load balancer pulls that backend out of rotation and stops sending it traffic. If every backend is unhealthy, users see 502 or 503 errors because there is nowhere to send requests.

This guide covers the three situations you will hit most often:

All backends are unhealthy. Usually a firewall rule or health check configuration problem.
Some backends are unhealthy. Usually an application-level issue on specific instances.
Backends are healthy but users still get 502 or 503. The health check passes, but the app returns errors to real requests.

The fastest path is to check which situation applies to you, then jump to the matching fix below.

Simple explanation

Mental model

Think of a load balancer like a restaurant host. Before seating guests at a table, the host peeks into each dining room to make sure a waiter is actually there. If the host cannot see a waiter (the health check fails), that room gets closed off and no guests are seated there. The waiter might be in the room but standing behind a locked door (firewall). The waiter might be in a completely different room (wrong port). Or the waiter might have gone home (app crashed). The host does not care why. No visible waiter means no guests.

A backend is any server, VM, container, or service that sits behind a load balancer and handles actual requests. In GCP, backends are grouped in managed instance groups, unmanaged instance groups, or network endpoint groups (NEGs).

A health check is a small probe the load balancer sends to each backend on a regular interval. It might be an HTTP GET to /health, a TCP connection attempt, or an SSL handshake. The backend must respond within a timeout and return an acceptable status code.

When a backend fails enough consecutive health checks, the load balancer marks it unhealthy and stops routing traffic to it. This protects users from hitting a broken server. But it also means a misconfigured health check or a missing firewall rule can make a perfectly working backend look dead.

The load balancer is doing its job correctly. The fix is almost always in the firewall, the health check config, or the application itself.

How GCP health checks work

The load balancer sends a probe to each backend at a fixed interval (default: 5 seconds). Each probe must get a valid response within the configured timeout (default: 5 seconds).

Unhealthy threshold: If a backend fails this many consecutive probes (default: 2), it is marked UNHEALTHY and removed from rotation.

Healthy threshold: After you fix the problem, the backend must pass this many consecutive probes (default: 2) before the load balancer considers it healthy again and starts sending traffic to it.

Key distinction

Health check probes originate from GCP infrastructure, not from your VPC or your machine. They come from specific IP ranges. This is why a backend can respond to your own curl requests perfectly but still fail health checks. The probe traffic gets blocked at the firewall before it ever reaches the app.

A health check failure is not the same as an application error. A health check failure means the load balancer could not reach or verify the backend at all. An application returning 500 to user requests is a different problem: the backend is reachable, but the app has a bug. The load balancer can mark a backend as healthy (health check passes) while the app still returns 5xx errors to real user traffic.

When to use this guide

Use this guide if you see any of these symptoms:

The Google Cloud console shows backends as UNHEALTHY in your backend service or load balancer details
Users get 502 Bad Gateway or 503 Service Unavailable from the load balancer
A Cloud Run or serverless NEG backend is failing health checks
Some instances in a managed instance group pass while others fail
gcloud compute backend-services get-health returns UNHEALTHY for one or more backends
You just set up a new HTTP load balancer and traffic is not flowing

Fast diagnosis: start here

Run this command to see the current status of every backend in your backend service:

# Global backend service (external HTTP/S load balancers)
gcloud compute backend-services get-health BACKEND_SERVICE_NAME --global

# Regional backend service (internal or regional external load balancers)
gcloud compute backend-services get-health BACKEND_SERVICE_NAME --region=REGION

Then match your situation to the right starting point:

Symptom	Most likely cause	Start with
All backends UNHEALTHY	Missing or wrong firewall rule for health check IPs	Fix 1: Firewall rules
Some backends UNHEALTHY, others healthy	App issue on specific instances, wrong tags, or zone-level firewall differences	Fix 3: Backend and instance group mismatches
All backends healthy but users get 502	App crashes on real traffic, connection resets, or protocol mismatch	Backend unhealthy vs 502 vs 503
All backends healthy but users get 503	No capacity, backend at max connections, or backend draining	Backend unhealthy vs 502 vs 503
Cloud Run / serverless NEG unhealthy	Cloud Run service failing, wrong NEG config, or wrong region	Fix 3: NEG and Cloud Run section
Backend flapping between healthy and unhealthy	Timeout too short, app slow under load, or threshold set to 1	Fix 2: Health check settings

Fix 1: Firewall rules for health checks

Most common cause

If all your backends show UNHEALTHY at the same time, start here. A missing health check firewall rule is the number one reason for all-backends-unhealthy on new load balancer setups.

GCP load balancer health checks originate from Google infrastructure using these IP ranges:

35.191.0.0/16
130.211.0.0/22

Your VPC firewall must have an ingress allow rule from both ranges on the port your health check probes. Without this rule, health check traffic is silently dropped and every backend appears unhealthy.

This applies to external and internal HTTP(S) load balancers that use compute health checks. For internal TCP/UDP load balancers, you may also need to allow 35.191.0.0/16 and 209.85.152.0/22 and 209.85.204.0/22. Always verify the required ranges for your specific load balancer type in the GCP documentation.

Create the firewall rule

# Allow health check probes on specific ports
# Adjust --rules to match YOUR health check port
gcloud compute firewall-rules create allow-health-checks \
  --network=my-vpc \
  --direction=INGRESS \
  --action=ALLOW \
  --source-ranges=35.191.0.0/16,130.211.0.0/22 \
  --rules=tcp:80,tcp:443 \
  --target-tags=load-balanced-backend

Verify the rule applies to your backends

The —target-tags flag scopes this rule to VMs with the matching network tag. Your backend VMs must carry the same tag. If the tags do not match, the rule exists but does not apply to your backends.

# Check which tags are on your VM
gcloud compute instances describe VM_NAME --zone=ZONE \
  --format="value(tags.items)"

# Add the tag if missing
gcloud compute instances add-tags VM_NAME --zone=ZONE \
  --tags=load-balanced-backend

Verify the port matches

The firewall rule must allow the exact port the health check probes. If your health check is configured for port 8080 but the firewall rule only allows 80 and 443, the probes are still blocked.

# Check which port the health check probes
gcloud compute health-checks describe HEALTH_CHECK_NAME \
  --format="value(httpHealthCheck.port, httpsHealthCheck.port, tcpHealthCheck.port)"

Note

If you use —target-tags without any tags, the rule applies to all VMs in the network. If you omit —target-tags entirely, it also applies to all VMs. Using a specific tag is best practice so the rule only affects load-balanced backends.

Fix 2: Health check port, path, protocol, timeout, and thresholds

Once the firewall rule is correct, the next most common cause is a mismatch between what the health check expects and what the application actually serves.

Analogy

Imagine calling a phone number to confirm a restaurant is open. If you dial the wrong number (wrong port), nobody picks up. If you call the right number but ask for the wrong department (wrong path), you get transferred to a dead line. If you hang up after two rings but the restaurant picks up on the third (timeout too short), you assume they are closed. Every part of the health check must match what the application actually does.

Wrong port

If the health check probes port 80 but your app listens on 8080, the probe connects to nothing and times out. This is surprisingly common after changing application ports.

# Check the health check port
gcloud compute health-checks describe HEALTH_CHECK_NAME --format="yaml"

# Update to match your application port
gcloud compute health-checks update http HEALTH_CHECK_NAME --port=8080

Wrong path

The health check path must return a 2xx status code. A path that returns 404 (not found) or 403 (forbidden) fails the check. Double-check that the path exists and is accessible without authentication.

Redirects (301 / 302)

Common trap

GCP health checks do not follow redirects. If your app redirects /health to /health/, or redirects HTTP to HTTPS, the health check receives a 301 or 302 and counts it as a failure. Point the health check to a path that returns 200 directly.

# Test from a VM in the same network to see what the path actually returns
curl -I http://BACKEND_IP:PORT/health

Protocol mismatch

If your app serves HTTPS on port 443 but the health check is configured as HTTP, the probe sends a plain HTTP request to an HTTPS endpoint. The response will be garbled or a connection reset. Match the health check protocol to what the app actually serves on that port.

# Switch from HTTP to HTTPS health check
gcloud compute health-checks update https HEALTH_CHECK_NAME \
  --port=443 \
  --request-path=/health

Timeout too short

If the app takes 3 seconds to respond to the health check path and the timeout is 2 seconds, the check always fails, even though the app eventually returns 200. Increase the timeout to be comfortably longer than the worst-case response time for the health endpoint.

# Increase timeout to 10 seconds
gcloud compute health-checks update http HEALTH_CHECK_NAME --timeout=10s

Unhealthy threshold too low

With —unhealthy-threshold=1, a single slow response removes the backend immediately. This causes flapping under load. Use at least 3 for production workloads to tolerate occasional slow responses.

# Set a more forgiving threshold
gcloud compute health-checks update http HEALTH_CHECK_NAME \
  --unhealthy-threshold=3 \
  --healthy-threshold=2

Healthy threshold delay after a fix

After you fix the problem, the backend does not become healthy instantly. It must pass the healthy threshold number of consecutive checks. With a healthy threshold of 2 and a check interval of 5 seconds, expect 10 to 15 seconds. If you raised the healthy threshold, it takes proportionally longer.

Best practice

Create a dedicated health check endpoint (like /healthz) that returns 200 immediately without doing heavy work. This avoids timeout issues and keeps health check results stable. Do not reuse a page that queries a database or calls external services.

Fix 3: Backend service, instance group, and NEG mismatches

Managed instance groups

Managed instance groups (MIGs) run identical VMs from an instance template. If the template is wrong (wrong startup script, wrong container image, wrong port), every instance in the group fails health checks.

Verify the instance template specifies the correct container image or startup script
Check that the template includes the correct network tag for the health check firewall rule
Confirm the app in the template listens on the port the health check probes
If you updated the template, make sure you rolled out new instances. Old instances still use the old template.

Unmanaged instance groups

Unmanaged instance groups contain manually added VMs that may have different configurations. If some VMs are unhealthy, check each one individually. Common causes: app not running, wrong port, missing network tag, or the VM is in a subnet with different firewall rules.

Zonal NEGs

Zonal network endpoint groups contain specific IP:port pairs. If the NEG references the wrong IP or port, health checks fail. Verify that each endpoint matches a running backend.

# List endpoints in a zonal NEG
gcloud compute network-endpoint-groups list-network-endpoints NEG_NAME \
  --zone=ZONE

Serverless NEGs and Cloud Run

Cloud Run backends use serverless NEGs. GCP manages health checking internally for these, so the troubleshooting is different:

Verify the serverless NEG points to the correct Cloud Run service name and region
Check that the Cloud Run service is deployed and not in a failed state
Look at Cloud Run logs for container startup failures. See the Cloud Run Container Failed to Start guide.
Confirm the Cloud Run service is in the same region as the serverless NEG
Check that the Cloud Run service accepts unauthenticated requests if the load balancer does not add authentication headers

Cloud Run difference

Serverless NEGs do not use the same health check resources as instance group backends. You will not see a separate health check resource in the console. If the Cloud Run service itself is healthy but the load balancer shows errors, verify that the NEG region matches the Cloud Run service region and that the service name is spelled correctly.

# Describe a serverless NEG
gcloud compute network-endpoint-groups describe NEG_NAME \
  --region=REGION

# Check Cloud Run service status
gcloud run services describe SERVICE_NAME --region=REGION \
  --format="value(status.conditions)"

Wrong backend attached or wrong region

A backend service can have backends in multiple regions. If you attached the wrong instance group or NEG, or attached one from the wrong region, the health check targets the wrong servers. Verify the attached backends:

gcloud compute backend-services describe BACKEND_SERVICE_NAME \
  --global \
  --format="yaml(backends)"

Application listening on localhost only

Silent failure

If your app binds to 127.0.0.1 (localhost) instead of 0.0.0.0 (all interfaces), it only accepts connections from the VM itself. Health check probes arrive from outside the VM and are refused. The app looks fine when you SSH in and test locally, but the health checker can never reach it. Configure your app to listen on 0.0.0.0 or the specific internal IP of the VM.

Backend unhealthy vs 502 vs 503

These look similar but have different causes and different fixes:

Condition	What it means	Check first	Typical root cause	Fastest fix
Backend UNHEALTHY	Health check probe cannot reach or verify the backend	Firewall rules, health check config	Missing firewall rule for health check source IPs	Create firewall rule allowing 35.191.0.0/16 and 130.211.0.0/22
502 Bad Gateway	Load balancer reached the backend but got an invalid or no response	Application logs, backend connectivity	App crash, connection reset, or protocol mismatch between LB and backend	Check app logs; verify app is running and responding on the expected port and protocol
503 Service Unavailable	No healthy backends available, or all backends are at capacity	Backend health status, capacity settings	All backends unhealthy, or max connections/utilization reached	Fix health checks so backends are healthy, or scale out the backend group
Health check timeout	Backend did not respond within the timeout window	App response time, timeout setting	App too slow, timeout too short, or health endpoint does heavy work	Increase timeout or make the health endpoint lighter
Wrong health check path / redirect	Health check gets 301, 302, 404, or 403 instead of 200	The actual response from the health check path	Path does not exist, redirects to another URL, or requires auth	Point health check to a path that returns 200 directly without redirect

Use Logs Explorer to see exactly which status codes the load balancer is returning and which backend handled each request.

Common beginner mistakes

Assuming the app works because curl works from your machine. Your machine is not the health checker. Health checks come from 35.191.0.0/16 and 130.211.0.0/22. If the firewall blocks those ranges, curl works but health checks fail.
Missing the health check firewall rule entirely. No generic “allow all” rule covers these ranges unless you explicitly added one. This is the number one cause of all-backends-unhealthy on new load balancer setups.
Using the wrong health check path. Pointing the health check to / when the app serves a heavy page there, or to a path that does not exist. Use a dedicated lightweight health endpoint like /healthz.
Redirecting the health check path. HTTP-to-HTTPS redirects, trailing slash redirects, or path rewrites all return 301/302. Health checks count these as failures. The path must return 200 directly.
Exposing the app on the wrong port. The health check port, the app listening port, and the firewall rule port must all agree. A one-digit typo in any of them breaks the chain.
Binding only to 127.0.0.1. An app that listens on localhost refuses connections from outside the VM, including health check probes. Bind to 0.0.0.0.
Forgetting the healthy threshold delay. After fixing the problem, the backend is not instantly healthy. It must pass multiple consecutive checks. Wait at least 10 to 15 seconds with default settings before concluding the fix did not work.
Checking the wrong backend service or region. If you have multiple load balancers or backend services, make sure you are inspecting the right one. Global load balancers use —global; regional ones need —region.

Step-by-step commands to verify backend health

Inspect backend health

# Global backend service
gcloud compute backend-services get-health BACKEND_SERVICE_NAME --global

# Regional backend service
gcloud compute backend-services get-health BACKEND_SERVICE_NAME --region=REGION

Inspect health check configuration

# List all health checks
gcloud compute health-checks list

# Describe a specific health check
gcloud compute health-checks describe HEALTH_CHECK_NAME --format="yaml"

Inspect backend service configuration

# See which backends are attached and which health check is used
gcloud compute backend-services describe BACKEND_SERVICE_NAME \
  --global \
  --format="yaml(backends, healthChecks)"

Inspect NEG configuration

# List all NEGs
gcloud compute network-endpoint-groups list

# Describe a specific NEG
gcloud compute network-endpoint-groups describe NEG_NAME --zone=ZONE

# For serverless NEGs
gcloud compute network-endpoint-groups describe NEG_NAME --region=REGION

Inspect relevant logs

Use Logs Explorer or the gcloud CLI to check load balancer logs and application logs:

# Load balancer access logs, filtered for 502 and 503 errors
gcloud logging read \
  'resource.type="http_load_balancer" AND (httpRequest.status=502 OR httpRequest.status=503)' \
  --project=PROJECT_ID \
  --limit=20 \
  --format="table(timestamp, httpRequest.requestUrl, httpRequest.status)"

# Application logs from a specific instance
gcloud logging read \
  'resource.type="gce_instance" AND resource.labels.instance_id="INSTANCE_ID"' \
  --project=PROJECT_ID \
  --limit=30

Verify app response directly from a VM in the same network

# SSH into a VM in the same VPC and test the health check path
# This tells you whether the app itself is responding correctly
curl -v http://BACKEND_INTERNAL_IP:PORT/healthz

# Check what the response headers and status code are
curl -I http://BACKEND_INTERNAL_IP:PORT/healthz

Tip

If the curl test returns 200, the app is fine and the problem is between the health checker and the app (most likely the firewall). If the curl test returns an error, fix the app first.

For deeper network-level troubleshooting, use VPC Connectivity Tests to trace the exact path health check traffic takes through your network and identify where it gets blocked.

Frequently asked questions

Why does my backend stay unhealthy even though curl returns 200?

When you curl from your own machine, the traffic comes from your IP. GCP health checks come from completely different IP ranges (35.191.0.0/16 and 130.211.0.0/22). If your VPC firewall does not have an ingress allow rule from those ranges on the health check port, the probes never reach your backend. Your app is fine, but the health checker cannot talk to it. Create or fix the firewall rule, verify the port matches, and check that any target tags on the rule also exist on your backend VMs.

Does a 301 or 302 redirect count as a healthy response?

No. GCP health checks do not follow redirects. A 301 or 302 response is treated as a failure. If your health check path redirects (for example, /health redirecting to /health/ or HTTP redirecting to HTTPS), the backend stays unhealthy. Fix this by pointing the health check to a path that returns 200 directly, without any redirect, and make sure the protocol in the health check matches what the app actually serves on that port.

Why are some backends healthy while others are unhealthy?

When only some backends fail, the problem is specific to those instances rather than a missing firewall rule (which would cause all backends to fail). Common causes include: the application crashed or is not running on the unhealthy instances, the instances are in a different zone or subnet with different firewall rules, disk is full, the instance ran out of memory, or the app is listening on the wrong port on those specific VMs. Check application logs and instance status for the unhealthy backends individually.

How long after the fix does the backend turn healthy again?

The backend must pass the healthy threshold number of consecutive health checks before the load balancer marks it healthy. With the default healthy threshold of 2 and check interval of 5 seconds, expect roughly 10 to 15 seconds. If you set a higher healthy threshold, it takes proportionally longer. You can check your health check settings with gcloud compute health-checks describe HEALTH_CHECK_NAME.

What is different for Cloud Run or serverless NEG backends?

Cloud Run backends use serverless NEGs instead of instance groups. Health checks work differently: GCP manages the health checking internally for serverless NEGs, so you do not configure firewall rules or health check resources the same way. If a Cloud Run backend shows unhealthy, the problem is usually the Cloud Run service itself. It may be failing to start, returning errors, or the serverless NEG may point to the wrong service or region. Check Cloud Run logs and verify the NEG configuration points to the correct service name and region.

Last verified: 27 March 2026 Cloud services change frequently. Verify details against official documentation before making infrastructure decisions.