GCP Load Balancer Backend Unhealthy? Fix Health Checks Fast
“Backend unhealthy” means the GCP load balancer tried to verify that your server is alive and the server did not respond correctly. The load balancer pulls that backend out of rotation and stops sending it traffic. If every backend is unhealthy, users see 502 or 503 errors because there is nowhere to send requests.
This guide covers the three situations you will hit most often:
- All backends are unhealthy. Usually a firewall rule or health check configuration problem.
- Some backends are unhealthy. Usually an application-level issue on specific instances.
- Backends are healthy but users still get 502 or 503. The health check passes, but the app returns errors to real requests.
The fastest path is to check which situation applies to you, then jump to the matching fix below.
Simple explanation
Think of a load balancer like a restaurant host. Before seating guests at a table, the host peeks into each dining room to make sure a waiter is actually there. If the host cannot see a waiter (the health check fails), that room gets closed off and no guests are seated there. The waiter might be in the room but standing behind a locked door (firewall). The waiter might be in a completely different room (wrong port). Or the waiter might have gone home (app crashed). The host does not care why. No visible waiter means no guests.
A backend is any server, VM, container, or service that sits behind a load balancer and handles actual requests. In GCP, backends are grouped in managed instance groups, unmanaged instance groups, or network endpoint groups (NEGs).
A health check is a small probe the load balancer sends to each backend on a
regular interval. It might be an HTTP GET to /health, a TCP connection attempt,
or an SSL handshake. The backend must respond within a timeout and return an acceptable status
code.
When a backend fails enough consecutive health checks, the load balancer marks it unhealthy and stops routing traffic to it. This protects users from hitting a broken server. But it also means a misconfigured health check or a missing firewall rule can make a perfectly working backend look dead.
The load balancer is doing its job correctly. The fix is almost always in the firewall, the health check config, or the application itself.
How GCP health checks work
The load balancer sends a probe to each backend at a fixed interval (default: 5 seconds). Each probe must get a valid response within the configured timeout (default: 5 seconds).
Unhealthy threshold: If a backend fails this many consecutive probes (default: 2), it is marked UNHEALTHY and removed from rotation.
Healthy threshold: After you fix the problem, the backend must pass this many consecutive probes (default: 2) before the load balancer considers it healthy again and starts sending traffic to it.
Health check probes originate from GCP infrastructure, not from your VPC or your machine.
They come from specific IP ranges. This is why a backend can respond to your own
curl requests perfectly but still fail health checks. The probe traffic gets
blocked at the firewall before it ever reaches the app.
A health check failure is not the same as an application error. A health check failure means the load balancer could not reach or verify the backend at all. An application returning 500 to user requests is a different problem: the backend is reachable, but the app has a bug. The load balancer can mark a backend as healthy (health check passes) while the app still returns 5xx errors to real user traffic.
When to use this guide
Use this guide if you see any of these symptoms:
- The Google Cloud console shows backends as UNHEALTHY in your backend service or load balancer details
- Users get 502 Bad Gateway or 503 Service Unavailable from the load balancer
- A Cloud Run or serverless NEG backend is failing health checks
- Some instances in a managed instance group pass while others fail
gcloud compute backend-services get-healthreturns UNHEALTHY for one or more backends- You just set up a new HTTP load balancer and traffic is not flowing
Fast diagnosis: start here
Run this command to see the current status of every backend in your backend service:
# Global backend service (external HTTP/S load balancers)
gcloud compute backend-services get-health BACKEND_SERVICE_NAME --global
# Regional backend service (internal or regional external load balancers)
gcloud compute backend-services get-health BACKEND_SERVICE_NAME --region=REGIONThen match your situation to the right starting point:
| Symptom | Most likely cause | Start with |
|---|---|---|
| All backends UNHEALTHY | Missing or wrong firewall rule for health check IPs | Fix 1: Firewall rules |
| Some backends UNHEALTHY, others healthy | App issue on specific instances, wrong tags, or zone-level firewall differences | Fix 3: Backend and instance group mismatches |
| All backends healthy but users get 502 | App crashes on real traffic, connection resets, or protocol mismatch | Backend unhealthy vs 502 vs 503 |
| All backends healthy but users get 503 | No capacity, backend at max connections, or backend draining | Backend unhealthy vs 502 vs 503 |
| Cloud Run / serverless NEG unhealthy | Cloud Run service failing, wrong NEG config, or wrong region | Fix 3: NEG and Cloud Run section |
| Backend flapping between healthy and unhealthy | Timeout too short, app slow under load, or threshold set to 1 | Fix 2: Health check settings |
Fix 1: Firewall rules for health checks
If all your backends show UNHEALTHY at the same time, start here. A missing health check firewall rule is the number one reason for all-backends-unhealthy on new load balancer setups.
GCP load balancer health checks originate from Google infrastructure using these IP ranges:
35.191.0.0/16130.211.0.0/22
Your VPC firewall must have an ingress allow rule from both ranges on the port your health check probes. Without this rule, health check traffic is silently dropped and every backend appears unhealthy.
This applies to external and
internal HTTP(S) load balancers
that use compute health checks. For internal TCP/UDP load balancers, you may also need to
allow 35.191.0.0/16 and 209.85.152.0/22 and 209.85.204.0/22.
Always verify the required ranges for your specific load balancer type in the GCP documentation.
Create the firewall rule
# Allow health check probes on specific ports
# Adjust --rules to match YOUR health check port
gcloud compute firewall-rules create allow-health-checks \
--network=my-vpc \
--direction=INGRESS \
--action=ALLOW \
--source-ranges=35.191.0.0/16,130.211.0.0/22 \
--rules=tcp:80,tcp:443 \
--target-tags=load-balanced-backendVerify the rule applies to your backends
The —target-tags flag scopes this rule to VMs with the matching
network tag. Your backend VMs must carry
the same tag. If the tags do not match, the rule exists but does not apply to your backends.
# Check which tags are on your VM
gcloud compute instances describe VM_NAME --zone=ZONE \
--format="value(tags.items)"
# Add the tag if missing
gcloud compute instances add-tags VM_NAME --zone=ZONE \
--tags=load-balanced-backendVerify the port matches
The firewall rule must allow the exact port the health check probes. If your health check is configured for port 8080 but the firewall rule only allows 80 and 443, the probes are still blocked.
# Check which port the health check probes
gcloud compute health-checks describe HEALTH_CHECK_NAME \
--format="value(httpHealthCheck.port, httpsHealthCheck.port, tcpHealthCheck.port)"If you use —target-tags without any tags, the rule applies to all VMs in the
network. If you omit —target-tags entirely, it also applies to all VMs. Using
a specific tag is best practice so the rule only affects load-balanced backends.
Fix 2: Health check port, path, protocol, timeout, and thresholds
Once the firewall rule is correct, the next most common cause is a mismatch between what the health check expects and what the application actually serves.
Imagine calling a phone number to confirm a restaurant is open. If you dial the wrong number (wrong port), nobody picks up. If you call the right number but ask for the wrong department (wrong path), you get transferred to a dead line. If you hang up after two rings but the restaurant picks up on the third (timeout too short), you assume they are closed. Every part of the health check must match what the application actually does.
Wrong port
If the health check probes port 80 but your app listens on 8080, the probe connects to nothing and times out. This is surprisingly common after changing application ports.
# Check the health check port
gcloud compute health-checks describe HEALTH_CHECK_NAME --format="yaml"
# Update to match your application port
gcloud compute health-checks update http HEALTH_CHECK_NAME --port=8080Wrong path
The health check path must return a 2xx status code. A path that returns 404 (not found) or 403 (forbidden) fails the check. Double-check that the path exists and is accessible without authentication.
Redirects (301 / 302)
GCP health checks do not follow redirects. If your app redirects /health
to /health/, or redirects HTTP to HTTPS, the health check receives a 301 or
302 and counts it as a failure. Point the health check to a path that returns 200 directly.
# Test from a VM in the same network to see what the path actually returns
curl -I http://BACKEND_IP:PORT/healthProtocol mismatch
If your app serves HTTPS on port 443 but the health check is configured as HTTP, the probe sends a plain HTTP request to an HTTPS endpoint. The response will be garbled or a connection reset. Match the health check protocol to what the app actually serves on that port.
# Switch from HTTP to HTTPS health check
gcloud compute health-checks update https HEALTH_CHECK_NAME \
--port=443 \
--request-path=/healthTimeout too short
If the app takes 3 seconds to respond to the health check path and the timeout is 2 seconds, the check always fails, even though the app eventually returns 200. Increase the timeout to be comfortably longer than the worst-case response time for the health endpoint.
# Increase timeout to 10 seconds
gcloud compute health-checks update http HEALTH_CHECK_NAME --timeout=10sUnhealthy threshold too low
With —unhealthy-threshold=1, a single slow response removes the backend
immediately. This causes flapping under load. Use at least 3 for production workloads
to tolerate occasional slow responses.
# Set a more forgiving threshold
gcloud compute health-checks update http HEALTH_CHECK_NAME \
--unhealthy-threshold=3 \
--healthy-threshold=2Healthy threshold delay after a fix
After you fix the problem, the backend does not become healthy instantly. It must pass the healthy threshold number of consecutive checks. With a healthy threshold of 2 and a check interval of 5 seconds, expect 10 to 15 seconds. If you raised the healthy threshold, it takes proportionally longer.
Create a dedicated health check endpoint (like /healthz) that returns 200
immediately without doing heavy work. This avoids timeout issues and keeps health check
results stable. Do not reuse a page that queries a database or calls external services.
Fix 3: Backend service, instance group, and NEG mismatches
Managed instance groups
Managed instance groups (MIGs) run identical VMs from an instance template. If the template is wrong (wrong startup script, wrong container image, wrong port), every instance in the group fails health checks.
- Verify the instance template specifies the correct container image or startup script
- Check that the template includes the correct network tag for the health check firewall rule
- Confirm the app in the template listens on the port the health check probes
- If you updated the template, make sure you rolled out new instances. Old instances still use the old template.
Unmanaged instance groups
Unmanaged instance groups contain manually added VMs that may have different configurations. If some VMs are unhealthy, check each one individually. Common causes: app not running, wrong port, missing network tag, or the VM is in a subnet with different firewall rules.
Zonal NEGs
Zonal network endpoint groups contain specific IP:port pairs. If the NEG references the wrong IP or port, health checks fail. Verify that each endpoint matches a running backend.
# List endpoints in a zonal NEG
gcloud compute network-endpoint-groups list-network-endpoints NEG_NAME \
--zone=ZONEServerless NEGs and Cloud Run
Cloud Run backends use serverless NEGs. GCP manages health checking internally for these, so the troubleshooting is different:
- Verify the serverless NEG points to the correct Cloud Run service name and region
- Check that the Cloud Run service is deployed and not in a failed state
- Look at Cloud Run logs for container startup failures. See the Cloud Run Container Failed to Start guide.
- Confirm the Cloud Run service is in the same region as the serverless NEG
- Check that the Cloud Run service accepts unauthenticated requests if the load balancer does not add authentication headers
Serverless NEGs do not use the same health check resources as instance group backends. You will not see a separate health check resource in the console. If the Cloud Run service itself is healthy but the load balancer shows errors, verify that the NEG region matches the Cloud Run service region and that the service name is spelled correctly.
# Describe a serverless NEG
gcloud compute network-endpoint-groups describe NEG_NAME \
--region=REGION
# Check Cloud Run service status
gcloud run services describe SERVICE_NAME --region=REGION \
--format="value(status.conditions)"Wrong backend attached or wrong region
A backend service can have backends in multiple regions. If you attached the wrong instance group or NEG, or attached one from the wrong region, the health check targets the wrong servers. Verify the attached backends:
gcloud compute backend-services describe BACKEND_SERVICE_NAME \
--global \
--format="yaml(backends)"Application listening on localhost only
If your app binds to 127.0.0.1 (localhost) instead of 0.0.0.0
(all interfaces), it only accepts connections from the VM itself. Health check probes
arrive from outside the VM and are refused. The app looks fine when you SSH in and test
locally, but the health checker can never reach it. Configure your app to listen on
0.0.0.0 or the specific internal IP of the VM.
Backend unhealthy vs 502 vs 503
These look similar but have different causes and different fixes:
| Condition | What it means | Check first | Typical root cause | Fastest fix |
|---|---|---|---|---|
| Backend UNHEALTHY | Health check probe cannot reach or verify the backend | Firewall rules, health check config | Missing firewall rule for health check source IPs | Create firewall rule allowing 35.191.0.0/16 and 130.211.0.0/22 |
| 502 Bad Gateway | Load balancer reached the backend but got an invalid or no response | Application logs, backend connectivity | App crash, connection reset, or protocol mismatch between LB and backend | Check app logs; verify app is running and responding on the expected port and protocol |
| 503 Service Unavailable | No healthy backends available, or all backends are at capacity | Backend health status, capacity settings | All backends unhealthy, or max connections/utilization reached | Fix health checks so backends are healthy, or scale out the backend group |
| Health check timeout | Backend did not respond within the timeout window | App response time, timeout setting | App too slow, timeout too short, or health endpoint does heavy work | Increase timeout or make the health endpoint lighter |
| Wrong health check path / redirect | Health check gets 301, 302, 404, or 403 instead of 200 | The actual response from the health check path | Path does not exist, redirects to another URL, or requires auth | Point health check to a path that returns 200 directly without redirect |
Use Logs Explorer to see exactly which status codes the load balancer is returning and which backend handled each request.
Common beginner mistakes
Assuming the app works because curl works from your machine. Your machine is not the health checker. Health checks come from 35.191.0.0/16 and 130.211.0.0/22. If the firewall blocks those ranges, curl works but health checks fail.
Missing the health check firewall rule entirely. No generic “allow all” rule covers these ranges unless you explicitly added one. This is the number one cause of all-backends-unhealthy on new load balancer setups.
Using the wrong health check path. Pointing the health check to
/when the app serves a heavy page there, or to a path that does not exist. Use a dedicated lightweight health endpoint like/healthz.Redirecting the health check path. HTTP-to-HTTPS redirects, trailing slash redirects, or path rewrites all return 301/302. Health checks count these as failures. The path must return 200 directly.
Exposing the app on the wrong port. The health check port, the app listening port, and the firewall rule port must all agree. A one-digit typo in any of them breaks the chain.
Binding only to 127.0.0.1. An app that listens on localhost refuses connections from outside the VM, including health check probes. Bind to 0.0.0.0.
Forgetting the healthy threshold delay. After fixing the problem, the backend is not instantly healthy. It must pass multiple consecutive checks. Wait at least 10 to 15 seconds with default settings before concluding the fix did not work.
Checking the wrong backend service or region. If you have multiple load balancers or backend services, make sure you are inspecting the right one. Global load balancers use
—global; regional ones need—region.
Step-by-step commands to verify backend health
Inspect backend health
# Global backend service
gcloud compute backend-services get-health BACKEND_SERVICE_NAME --global
# Regional backend service
gcloud compute backend-services get-health BACKEND_SERVICE_NAME --region=REGIONInspect health check configuration
# List all health checks
gcloud compute health-checks list
# Describe a specific health check
gcloud compute health-checks describe HEALTH_CHECK_NAME --format="yaml"Inspect backend service configuration
# See which backends are attached and which health check is used
gcloud compute backend-services describe BACKEND_SERVICE_NAME \
--global \
--format="yaml(backends, healthChecks)"Inspect NEG configuration
# List all NEGs
gcloud compute network-endpoint-groups list
# Describe a specific NEG
gcloud compute network-endpoint-groups describe NEG_NAME --zone=ZONE
# For serverless NEGs
gcloud compute network-endpoint-groups describe NEG_NAME --region=REGIONInspect relevant logs
Use Logs Explorer or the gcloud CLI to check load balancer logs and application logs:
# Load balancer access logs, filtered for 502 and 503 errors
gcloud logging read \
'resource.type="http_load_balancer" AND (httpRequest.status=502 OR httpRequest.status=503)' \
--project=PROJECT_ID \
--limit=20 \
--format="table(timestamp, httpRequest.requestUrl, httpRequest.status)"
# Application logs from a specific instance
gcloud logging read \
'resource.type="gce_instance" AND resource.labels.instance_id="INSTANCE_ID"' \
--project=PROJECT_ID \
--limit=30Verify app response directly from a VM in the same network
# SSH into a VM in the same VPC and test the health check path
# This tells you whether the app itself is responding correctly
curl -v http://BACKEND_INTERNAL_IP:PORT/healthz
# Check what the response headers and status code are
curl -I http://BACKEND_INTERNAL_IP:PORT/healthzIf the curl test returns 200, the app is fine and the problem is between the health checker and the app (most likely the firewall). If the curl test returns an error, fix the app first.
For deeper network-level troubleshooting, use VPC Connectivity Tests to trace the exact path health check traffic takes through your network and identify where it gets blocked.
Summary
- Health check probes come from 35.191.0.0/16 and 130.211.0.0/22. Your firewall must allow ingress from these ranges on the health check port.
- All backends UNHEALTHY usually means a missing firewall rule. Some backends UNHEALTHY usually means an app-level issue on those specific instances.
- The health check port, path, protocol, and timeout must all match what the application actually serves.
- Health checks do not follow redirects. A 301 or 302 counts as a failure.
- The app must bind to 0.0.0.0, not 127.0.0.1, to accept health check probes.
- After fixing the issue, wait for the healthy threshold (default: 10 to 15 seconds) before the backend shows healthy.
- Use
gcloud compute backend-services get-healthto see per-backend status and start diagnosing from there. - Use Cloud Monitoring and load balancer access logs to correlate health check failures with 502/503 errors.
Frequently asked questions
Why does my backend stay unhealthy even though curl returns 200?
When you curl from your own machine, the traffic comes from your IP. GCP health checks come from completely different IP ranges (35.191.0.0/16 and 130.211.0.0/22). If your VPC firewall does not have an ingress allow rule from those ranges on the health check port, the probes never reach your backend. Your app is fine, but the health checker cannot talk to it. Create or fix the firewall rule, verify the port matches, and check that any target tags on the rule also exist on your backend VMs.
Does a 301 or 302 redirect count as a healthy response?
No. GCP health checks do not follow redirects. A 301 or 302 response is treated as a failure. If your health check path redirects (for example, /health redirecting to /health/ or HTTP redirecting to HTTPS), the backend stays unhealthy. Fix this by pointing the health check to a path that returns 200 directly, without any redirect, and make sure the protocol in the health check matches what the app actually serves on that port.
Why are some backends healthy while others are unhealthy?
When only some backends fail, the problem is specific to those instances rather than a missing firewall rule (which would cause all backends to fail). Common causes include: the application crashed or is not running on the unhealthy instances, the instances are in a different zone or subnet with different firewall rules, disk is full, the instance ran out of memory, or the app is listening on the wrong port on those specific VMs. Check application logs and instance status for the unhealthy backends individually.
How long after the fix does the backend turn healthy again?
The backend must pass the healthy threshold number of consecutive health checks before the load balancer marks it healthy. With the default healthy threshold of 2 and check interval of 5 seconds, expect roughly 10 to 15 seconds. If you set a higher healthy threshold, it takes proportionally longer. You can check your health check settings with gcloud compute health-checks describe HEALTH_CHECK_NAME.
What is different for Cloud Run or serverless NEG backends?
Cloud Run backends use serverless NEGs instead of instance groups. Health checks work differently: GCP manages the health checking internally for serverless NEGs, so you do not configure firewall rules or health check resources the same way. If a Cloud Run backend shows unhealthy, the problem is usually the Cloud Run service itself. It may be failing to start, returning errors, or the serverless NEG may point to the wrong service or region. Check Cloud Run logs and verify the NEG configuration points to the correct service name and region.