GKE CrashLoopBackOff: Logs, Exit Codes, Causes, and Fixes

CrashLoopBackOff is not the error. It is Kubernetes telling you that a container keeps crashing and it is spacing out the restart attempts to avoid hammering a failing process. The actual cause is always something else: a missing environment variable, an out-of-memory kill, a liveness probe that fires too early, or any number of application-level failures. Your job is to find that underlying cause and fix it. This page gives you a clear, step-by-step workflow to do exactly that, starting with reading the right logs and exit codes, then working through each common root cause with the exact commands and YAML fixes you need.

If you are new to Kubernetes or GKE, do not panic when you see CrashLoopBackOff. The name sounds alarming, but it is simply a restart state. Every CrashLoopBackOff has a concrete, fixable cause. By the end of this page, you will know how to find it in under five minutes.

CrashLoopBackOff in simple terms

Think of CrashLoopBackOff like a car that will not start. You turn the key, the engine tries to start, then stalls. You wait a moment and try again. Each time it fails, you wait a little longer before the next attempt. Kubernetes does the same thing with your container: it starts the container, the container crashes, and Kubernetes waits before trying again. That waiting period is the “backoff.”

When you run kubectl get pods and see a pod in CrashLoopBackOff, you are seeing two important pieces of information:

  • STATUS: CrashLoopBackOff means the container has crashed and Kubernetes is waiting before the next restart attempt
  • RESTARTS: N tells you how many times the container has been restarted so far (a high number means the crash has been happening for a while)

The container is not currently running. It is sitting in a wait period. That is why kubectl logs POD_NAME on its own often shows nothing useful because the current container instance has not started yet. You need —previous to see the logs from the last crash.

Key concept

CrashLoopBackOff is the symptom, not the disease. Think of it like a fever: it tells you something is wrong, but not what is wrong. The exit code and the previous logs are the actual diagnosis.

How CrashLoopBackOff works

Exponential backoff timing

After each crash, Kubernetes applies exponential backoff before restarting the container. The timing works like this:

  • First crash: wait 10 seconds before restart
  • Second crash: wait 20 seconds
  • Third crash: wait 40 seconds
  • Fourth crash: wait 80 seconds
  • Fifth crash and beyond: wait 5 minutes (the cap)

Once the backoff reaches 5 minutes, it stays at 5 minutes for every subsequent restart attempt. Kubernetes will keep trying indefinitely. It never gives up. If you fix the underlying problem and the container starts successfully, the restart counter eventually resets.

Restart policy

The pod’s restartPolicy controls whether Kubernetes restarts a crashed container. Most workloads use Always (the default for Deployments and StatefulSets), which means Kubernetes always restarts a failed container. Jobs and CronJobs typically use OnFailure or Never. If your container exits cleanly (exit code 0) and the restart policy is Always, Kubernetes still restarts it. This can look like CrashLoopBackOff if a liveness probe then fails on the restarted container.

Why kubectl logs —previous matters

During the backoff window, the current container has not started yet. Running kubectl logs POD_NAME returns nothing or an error. The flag —previous tells kubectl to fetch logs from the last terminated container, the one that actually crashed. This is nearly always where the root cause is visible: an exception stack trace, a “file not found” error, a “connection refused” message, or a segfault.

Where to find the exit code and reason

Run kubectl describe pod POD_NAME and look for the Last State section under the container entry. It shows:

  • Reason: a human-readable label like OOMKilled, Error, or Completed
  • Exit Code: the numeric code the process returned (0, 1, 137, 143, etc.)
  • Started / Finished: timestamps showing how long the container ran before it crashed

The exit code and reason together are your first diagnostic data point. A container that ran for 0 seconds before crashing has a different problem than one that ran for 30 minutes. The exit code narrows the cause category. The pod status (CrashLoopBackOff) tells you nothing about why. The Last State section does.

Watch out

If Started and Finished timestamps are identical (the container ran for 0 seconds), the process crashed at startup. If the container ran for minutes or hours before crashing, look for memory leaks, connection pool exhaustion, or timeout-driven failures instead of configuration errors.

When to use this guide

This page helps when you see any of these situations:

  • Pod status shows CrashLoopBackOff in kubectl get pods
  • The RESTARTS column keeps increasing
  • Your application starts but immediately exits
  • The pod runs for a few seconds then gets killed
  • Events show “Back-off restarting failed container”
  • A liveness or readiness probe keeps killing your container
  • You deployed a new image and the pod will not stabilise

If the pod status shows ImagePullBackOff instead of CrashLoopBackOff, the container image cannot be pulled. That is a different problem entirely. See the ImagePullBackOff section below for the fix.

If the pod status shows Pending and never transitions to Running or CrashLoopBackOff, the issue is scheduling. The cluster cannot find a node with enough resources. That is outside the scope of this guide.

Fast triage: 5-minute workflow

Follow these five steps in order. By step 5, you will know which category the crash falls into and can jump to the matching fix section below.

Analogy

This workflow is like diagnosing a car that will not start. Step 1: check the dashboard warning lights (pod status). Step 2: pop the bonnet and look at the engine (describe). Step 3: check the error log from the last drive (previous logs). Step 4: ask a mechanic what they noticed (events). Step 5: decide whether it is a fuel, electrical, or engine problem (classify).

Step 1: List the pods and confirm the status

kubectl get pods -n NAMESPACE

Look at the STATUS and RESTARTS columns. A high restart count means the crash has been happening for a while. Note the exact pod name for the next steps.

Step 2: Describe the pod

kubectl describe pod POD_NAME -n NAMESPACE

Focus on three sections in the output:

  • Last State (under each container): the exit code and reason from the last crash
  • Events (at the bottom): Kubernetes-level events like image pull errors, probe failures, and OOM kills
  • Containers, Restart Count: confirms which container is crashing in multi-container pods

Step 3: Read the previous logs

# Single-container pod
kubectl logs POD_NAME -n NAMESPACE --previous

# Multi-container pod: specify the crashing container
kubectl logs POD_NAME -n NAMESPACE -c CONTAINER_NAME --previous

This is almost always where you find the actual error. Look for exception stack traces, “connection refused” errors, “file not found” messages, or “permission denied” lines.

Step 4: Check recent events

kubectl get events -n NAMESPACE --sort-by='.lastTimestamp'

Events show cluster-level context that logs miss: failed image pulls, failed scheduling attempts, node pressure events, and probe failures.

Step 5: Classify the root cause

Based on steps 2–4, the crash falls into one of these categories:

Reading exit codes

The exit code from the previous container run is your first diagnostic data point. You can find it in the kubectl describe pod output under Last State, Exit Code.

  • Exit code 0. The container exited successfully. It did not crash. If the pod still shows CrashLoopBackOff, either the restartPolicy is Always and the container is being restarted after a clean exit, or a liveness probe is failing and forcing a restart even though the app completed normally.

  • Exit code 1. Application error. An uncaught exception, a missing configuration file, a failed database connection on startup, or any unhandled error that caused the process to call exit(1). Read —previous logs for the specific error.

  • Exit code 126. The command specified in the container entrypoint was found but is not executable. Common when a shell script is missing the executable bit or the binary format does not match the container architecture (e.g. running an ARM image on an AMD64 node).

  • Exit code 127. The command was not found. The entrypoint or command in the Dockerfile or pod spec refers to a binary that does not exist in the container image. Check the image contents with docker run —rm -it IMAGE sh.

  • Exit code 137 (OOMKilled). The container exceeded its memory limit and was killed by the Linux kernel OOM killer, or it received SIGKILL from Kubernetes (e.g. after a failed liveness probe exceeded the grace period). Check the Reason field: OOMKilled confirms a memory issue.

  • Exit code 143. The container received SIGTERM and shut down. This is normal during rolling updates, node drains, and scale-down events. If it appears as CrashLoopBackOff, the container may not be handling SIGTERM gracefully and is being killed by SIGKILL after the termination grace period expires.

  • Exit code 255. A runtime crash in the container process, often a segmentation fault or a language runtime error (e.g. a Go panic without a recover, or a JVM native crash).

Tip

Exit codes above 128 indicate the process was killed by a signal. The signal number is the exit code minus 128. Exit code 137 = 128 + 9 (SIGKILL). Exit code 143 = 128 + 15 (SIGTERM). This formula helps you identify unexpected signals quickly.

Fixing application crashes (exit code 1)

Exit code 1 means the application itself crashed during startup or shortly after. The previous logs contain the specific error. Start here:

kubectl logs POD_NAME -n NAMESPACE --previous

Missing environment variables

The application reads a required environment variable that is not set in the pod spec. The logs typically show “undefined,” “KeyError,” “env var X not set,” or a NullPointerException when the code tries to use the value.

Fix by adding the variable to the container spec:

containers:
- name: my-app
  env:
  - name: DATABASE_URL
    valueFrom:
      secretKeyRef:
        name: db-credentials
        key: url
  - name: APP_ENV
    valueFrom:
      configMapKeyRef:
        name: app-config
        key: environment

Check which environment variables the pod currently has:

kubectl exec POD_NAME -n NAMESPACE -- env

Failed database connection

The app tries to connect to a database on startup and fails because the hostname is wrong, the Cloud SQL Auth Proxy sidecar has not started yet, or credentials are incorrect. If the sidecar starts after the main container, the connection attempt fails before the proxy is ready.

See Cloud SQL Connection Refused for the full diagnosis. For sidecar ordering, consider adding a startup probe or an init container that waits for the database port to become available.

Missing files or configuration

A mounted ConfigMap, Secret, or volume is missing or has the wrong path. The logs show “file not found” or “no such file or directory.” Verify volumes are mounted correctly:

# Check volume mounts in the pod spec
kubectl describe pod POD_NAME -n NAMESPACE | grep -A 5 "Mounts:"

# Verify the file exists inside the container (if it is briefly running)
kubectl exec POD_NAME -n NAMESPACE -- ls -la /path/to/config/
Common trap

Do not delete and recreate the pod to “start fresh.” Deleting the pod erases the previous logs and restart history you need for diagnosis. Investigate first, then fix the Deployment or StatefulSet spec. Kubernetes will roll out new pods automatically.

Permission errors calling GCP APIs

GKE pods use Workload Identity or the node service account for GCP API calls. A missing IAM binding causes “permission denied” or “forbidden” errors that crash the application on startup. Check which service account the pod is using:

# Check the Kubernetes service account
kubectl get pod POD_NAME -n NAMESPACE -o jsonpath='{.spec.serviceAccountName}'

# Check whether Workload Identity is configured on the KSA
kubectl describe serviceaccount KSA_NAME -n NAMESPACE

If the annotation iam.gke.io/gcp-service-account is missing, the pod falls back to the node service account. See Permission Denied Errors for the full fix.

Fixing OOMKilled (exit code 137)

OOMKilled means the container exceeded its memory limit and the Linux kernel killed it. This is one of the most common causes of CrashLoopBackOff on GKE.

Confirm the OOM kill

# Check the termination reason
kubectl describe pod POD_NAME -n NAMESPACE | grep -A 3 "Last State:"

# Check actual memory usage across pods
kubectl top pods -n NAMESPACE

If the Reason field shows OOMKilled, the fix is to increase the memory limit, reduce the application’s memory consumption, or both.

Set appropriate resource limits

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    spec:
      containers:
      - name: my-app
        image: my-image:latest
        resources:
          requests:
            memory: "256Mi"    # Minimum reserved for the container
            cpu: "100m"
          limits:
            memory: "512Mi"    # Hard ceiling. OOMKilled if exceeded
            cpu: "500m"
# Apply the updated deployment
kubectl apply -f deployment.yaml

# Or patch directly without editing the file
kubectl set resources deployment my-app \
  -c my-app \
  --limits=memory=512Mi,cpu=500m \
  --requests=memory=256Mi,cpu=100m
Analogy

Think of requests and limits like booking a hotel room. The request is the room you reserved: it is guaranteed to be there when you arrive. The limit is the maximum room size the hotel will allow. If you try to move into a suite that exceeds your booking class, the hotel kicks you out (OOMKilled).

Sizing guidelines

  • Requests set the guaranteed minimum. The scheduler uses requests to decide which node to place the pod on.

  • Limits set the hard ceiling. The container is killed if it exceeds the memory limit.

  • Set memory limits to at least 1.5–2× the typical peak usage to give headroom for load spikes, garbage collection bursts, and JVM metaspace growth.

  • If the OOM kills happen gradually (the container runs for minutes before being killed), suspect a memory leak. Profile the application or check for unbounded caches.

Tip

On GKE Standard clusters, Metrics Server is installed automatically, so kubectl top pods works immediately. On Autopilot clusters, Kubernetes manages resource limits automatically based on the requests you set. If you are on Autopilot and see OOMKilled, increase the resource requests. Autopilot adjusts limits accordingly.

Liveness and readiness probe failures

Kubernetes kills a container if its liveness probe fails consecutively. This can cause CrashLoopBackOff even when the application itself is healthy but temporarily slow or still starting up. Readiness probe failures do not kill the container, but they remove it from the Service, which can cause cascading failures if all replicas become unready simultaneously.

Detect probe failures

# Check for probe failure events
kubectl describe pod POD_NAME -n NAMESPACE | grep -E "(Liveness|Readiness|Startup) probe"

# Check current probe configuration
kubectl get pod POD_NAME -n NAMESPACE -o jsonpath='{.spec.containers[0].livenessProbe}' | python -m json.tool

If you see “Liveness probe failed” in the events, the probe is killing the container. The application may be perfectly healthy but too slow to respond within the probe timeout.

Common probe mistakes

  • Probe fires before the app is ready. The initialDelaySeconds is shorter than the application’s startup time. The probe runs, gets no response, and kills the container repeatedly.
  • Timeout too short. The probe timeoutSeconds is 1 second, but the health endpoint takes 2 seconds under load. The probe fails even though the app is running.
  • Wrong port or path. The probe targets port 8080 but the app listens on 3000, or the probe path is /health but the app serves /healthz.
  • failureThreshold: 1. A single slow response kills the container. Always set failureThreshold to at least 3.
Warning

Liveness probe failures look identical to application crashes in kubectl get pods. Both show CrashLoopBackOff with a rising restart count. The only way to tell them apart is kubectl describe pod: look for “Liveness probe failed” in the Events section. If you see it, the probe is killing a healthy container. Do not start debugging your application code until you have ruled out probes.

Fix the probe configuration

# Use a startupProbe for slow-starting applications
startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30       # 30 attempts × 10s = 5 minutes to start
  periodSeconds: 10

# The livenessProbe only begins after the startupProbe succeeds
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 0     # startupProbe already handled the delay
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3        # Require 3 consecutive failures before killing

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3
Tip

If your application has a variable startup time (common with JVM-based apps, large model loading, or database migrations), use a startupProbe instead of relying on a large initialDelaySeconds on the liveness probe. The startup probe disables the liveness probe entirely until the application signals it is ready. This avoids both premature kills and unnecessarily long delays before liveness checking begins.

Dependency failures

If the previous logs show “connection refused,” “connection timed out,” “name resolution failed,” or “no route to host,” the container started successfully but crashed because it could not reach a dependency. This could be a database, an external API, a message queue, or another microservice.

Analogy

A dependency failure is like arriving at work and finding the office door locked. You (the container) started up fine, but you cannot do your job because something you depend on is unavailable. The fix is not in your code; it is in whatever is behind the locked door.

Common dependency failure patterns

  • Cloud SQL Auth Proxy not ready. The main container starts before the Auth Proxy sidecar is listening. Add retry logic or an init container that waits for the proxy port. See Cloud SQL Connection Refused.

  • DNS not resolving. The container tries to reach a service by hostname but cluster DNS (CoreDNS) has not resolved it yet, or the hostname is wrong. Check DNS resolution inside the pod:

# Test DNS resolution from inside the pod
kubectl exec POD_NAME -n NAMESPACE -- nslookup SERVICE_NAME.NAMESPACE.svc.cluster.local

# If the pod is crashing too fast, run a temporary debug pod
kubectl run dns-test --rm -it --image=busybox --restart=Never -- nslookup SERVICE_NAME.NAMESPACE.svc.cluster.local
  • Network policy blocking traffic. A NetworkPolicy in the namespace may be blocking egress to the dependency. Check:

kubectl get networkpolicies -n NAMESPACE
  • External API unreachable. The pod needs to reach an external endpoint but lacks internet egress. On private GKE clusters, nodes do not have external IP addresses by default. You need Cloud NAT configured for the subnet. See Private GKE Clusters for the full setup.

SIGTERM handling (exit code 143)

Exit code 143 means the container received SIGTERM (signal 15) and shut down. SIGTERM is sent during rolling updates, node drains, and scale-down events. If the container does not handle SIGTERM and shut down within the terminationGracePeriodSeconds (default 30 seconds), Kubernetes sends SIGKILL to force-stop it.

If exit code 143 appears in a CrashLoopBackOff cycle (not during a deployment), something is sending SIGTERM to the container unexpectedly. Possible causes:

  • A liveness probe failure (Kubernetes sends SIGTERM before SIGKILL)
  • The node is under memory pressure and evicting pods
  • A preemptible or Spot VM was reclaimed
  • A cluster autoscaler is draining the node

Check events for eviction or drain messages:

kubectl get events -n NAMESPACE --sort-by='.lastTimestamp' | grep -i -E "(evict|drain|preempt)"

ImagePullBackOff alongside CrashLoopBackOff

Some pods show ImagePullBackOff rather than CrashLoopBackOff. This means the container image cannot be pulled. This is a completely different problem. The container never started, so there are no application logs to read.

Diagnose the image pull failure

# Check the events for the specific pull error
kubectl describe pod POD_NAME -n NAMESPACE | grep -A 5 "Events:"

# Verify the image exists in Artifact Registry
gcloud artifacts docker images list REGION-docker.pkg.dev/PROJECT/REPO

# Check which service account the GKE node pool uses for image pulls
gcloud container clusters describe CLUSTER_NAME \
  --zone=ZONE \
  --format="value(nodeConfig.serviceAccount)"

Common image pull failure causes

  • Image tag does not exist. The tag was never pushed, was overwritten, or you have a typo. Verify with gcloud artifacts docker images list.

  • Registry permissions. GKE nodes need roles/artifactregistry.reader on the Artifact Registry repository to pull images. Grant it to the node service account:

gcloud artifacts repositories add-iam-policy-binding REPO \
  --location=REGION \
  --member="serviceAccount:NODE_SA@PROJECT.iam.gserviceaccount.com" \
  --role="roles/artifactregistry.reader"
  • Private registry without imagePullSecrets. If you are pulling from a registry outside GCP, the pod needs an imagePullSecret configured.

  • Using :latest without imagePullPolicy: Always. If the image is cached on the node with an older version of :latest, Kubernetes may use the stale cache. Set imagePullPolicy: Always or use immutable tags (e.g. the git SHA).

Danger

Never use :latest tags in production. Two deployments can run different versions of the same tag, and rollbacks become impossible because there is no immutable reference to roll back to. Use the git commit SHA or a build number as the tag instead.

Init container failures

If an init container fails, the main containers never start, and the pod may enter CrashLoopBackOff. Init containers run sequentially before the main containers and must complete successfully.

# Check init container status
kubectl describe pod POD_NAME -n NAMESPACE | grep -A 10 "Init Containers:"

# Get logs from a failed init container
kubectl logs POD_NAME -n NAMESPACE -c INIT_CONTAINER_NAME

Common init container failures: a database migration that fails, a secret-fetching init container that lacks IAM permissions, or a network-check init container that cannot reach a dependency.

Warning

Init container logs are not covered by —previous in the same way as main containers. You must explicitly name the init container with -c to get its logs. If you skip this step, you may miss the real failure entirely.

GKE-specific considerations

Some CrashLoopBackOff causes are specific to GKE or more common on GKE than on other Kubernetes platforms.

  • Workload Identity misconfiguration. The Workload Identity binding between the Kubernetes service account (KSA) and the GCP service account (GSA) is missing or incorrect. The pod authenticates as the wrong identity and gets “permission denied” on every GCP API call.

  • Autopilot resource adjustments. On GKE Autopilot, Kubernetes automatically adjusts resource limits. If your requests are too low, Autopilot may set limits that are still insufficient for peak load.

  • Node pool machine type too small. On Standard clusters, if the node pool uses small machine types (e.g. e2-micro), system pods consume a large fraction of available resources, leaving little for your workloads.

  • GKE version skew. If the cluster control plane and node pools are on different Kubernetes minor versions, API incompatibilities can cause unexpected behaviour in admission controllers or mutating webhooks. Keep node pools within one minor version of the control plane. See Upgrading GKE Clusters Safely.

Common beginner mistakes

  1. Reading logs from the current container instead of the previous one. During the backoff window, the current container has not started. Always use —previous to get logs from the last crash.

  2. Setting memory limits too close to baseline usage. A limit only slightly above baseline triggers OOMKilled under any load spike. Set limits with a significant buffer above typical peak, at least 1.5–2× peak usage.

  3. Not setting resource requests at all. Without requests, Kubernetes does not reserve CPU or memory for the pod. Under node pressure, the pod is evicted first, producing CrashLoopBackOff on an otherwise healthy application.

  4. Using failureThreshold: 1 on liveness probes. A single slow probe response kills the container. Always set failureThreshold to at least 3.

  5. Assuming CrashLoopBackOff is a Kubernetes bug. CrashLoopBackOff is almost always an application-level or configuration-level issue. Kubernetes is doing its job correctly. The container is the thing that needs fixing.

  6. Deleting and recreating the pod instead of investigating. Deleting the pod erases the previous logs and restart history. Investigate first, then fix the Deployment or StatefulSet spec. The new pods will roll out automatically.

  7. Using :latest tags without imagePullPolicy: Always. Nodes cache images. If you push a new version of :latest but the node already has the old version cached, the pod runs stale code. Use immutable tags (e.g. the git commit SHA) instead.

Frequently asked questions

What exactly is CrashLoopBackOff and why does the backoff time keep increasing?

CrashLoopBackOff means a container is crashing repeatedly and Kubernetes is applying exponential backoff before restarting it. The backoff starts at 10 seconds, doubles with each restart (20s, 40s, 80s...), and caps at 5 minutes. Kubernetes does this to avoid hammering a failing system. The container will keep trying to restart until you fix the underlying problem. CrashLoopBackOff is not the error itself. It is the restart management state.

How do I see the logs from a pod that keeps crashing before I can read them?

Use kubectl logs POD_NAME --previous to see the logs from the last terminated container. If the pod is in the middle of the backoff wait, the current container has not started yet, so --previous is the only way to get the most recent crash logs. You can also use kubectl describe pod POD_NAME to see the last exit code and termination reason without needing the logs.

My pod shows OOMKilled in the exit reason. What causes this and how do I fix it?

OOMKilled (exit code 137) means the Linux kernel killed the container because it exceeded its configured memory limit. Fix it by increasing the memory limit in the container spec, or reduce the application memory usage. The request sets the minimum reserved; the limit is the ceiling. Set limits to at least 2x typical peak usage to give headroom for bursts.

A liveness probe is killing my container before it finishes starting up. How do I fix this?

Increase the initialDelaySeconds on the liveness probe to give the application time to start before the probe begins. Set failureThreshold to at least 3 so a single slow response does not kill the container. If startup is highly variable, consider adding a separate startupProbe which disables the liveness probe until the startup check passes.

What is the difference between CrashLoopBackOff and ImagePullBackOff?

CrashLoopBackOff means the container image was pulled successfully and the container started, but then crashed. ImagePullBackOff means Kubernetes could not pull the container image at all. The image tag does not exist, the registry is unreachable, or the node lacks permission to pull from the registry. Check kubectl describe pod to see which state your pod is in and follow the matching fix.

How do I tell whether a CrashLoopBackOff is caused by a probe failure or an application crash?

Run kubectl describe pod POD_NAME and check the Events section. If the events show "Liveness probe failed" or "Readiness probe failed" messages, a probe is killing the container. If the events only show "Back-off restarting failed container," the application is crashing on its own. Also check the Last State section: exit code 137 with reason OOMKilled points to a memory issue, while exit code 1 with reason Error points to an application-level crash.

Last verified: 27 March 2026 Cloud services change frequently. Verify details against official documentation before making infrastructure decisions.