How to Debug Production Issues in GCP: Metrics, Logs, Traces, and Rollbacks

This page teaches a practical workflow for debugging live production issues in Google Cloud. You will learn when to check metrics, how to write fast log queries, how to use traces to find the failing operation in a distributed system, and how to correlate incidents with recent deployments. The goal is to cut your time-to-diagnosis. This workflow is for live production issues where you cannot attach a local debugger to a running service.

Incident triage fast path
  1. Open Cloud Monitoring: check error rate, latency, and request rate for the last 2 hours
  2. Note the exact time the behavior changed and which services are affected
  3. Check Cloud Audit Logs for deployments or config changes at that time
  4. Open Logs Explorer: filter to the affected service, the incident time window, and severity>=ERROR
  5. If the error source is unclear across multiple services, open Cloud Trace to find the failing span
  6. If the issue is CPU or memory with no clear error, open Cloud Profiler
Warning

If you find a deployment that matches the incident start time, roll back first and diagnose after. Do not spend time in logs or traces while users are affected. A working service with an unexplained cause is better than a broken service with a detailed investigation underway.

Simple explanation

Each signal type answers a different question about a production incident:

  • Metrics tell you when and where something changed: error rate spiked, latency increased, request volume dropped.
  • Logs tell you what error happened: the message, stack trace, and request context for each individual failure.
  • Traces tell you which operation in the request path was slow or failing. In distributed systems, the service that returns an error is often not where the error originated. Each traced operation is called a span.
  • Profiler helps when the issue is CPU, memory, or lock contention over time. It shows where your code spends its runtime resources, not just when something failed.

Each signal costs time to search. Use them in order, from coarse to fine, so each step narrows the next one.

🔧

Analogy

Debugging production is like diagnosing a car breakdown. First, check the dashboard warning lights: when did they come on, and which ones lit up? That is metrics. Then look under the hood for anything obviously wrong: a disconnected hose, a burnt smell, a visible crack. That is logs. Then plug in the diagnostic scanner to find which specific component is throwing the fault code. That is traces. Work from coarse to fine. Do not start replacing parts until you know which one is broken.

When to use this guide

This page is the right starting point when you are dealing with:

  • A sudden spike in error rate or 5xx responses
  • Rising p99 latency (the slowest 1% of requests) with no obvious cause
  • Partial outages where some requests succeed and some fail
  • Post-deploy regressions where a service behaved correctly before a recent change
  • Intermittent failures that are hard to reproduce locally

If you already know which service is failing and what kind of failure it is, jump directly to the relevant troubleshooting page: Cloud Run container failed to start, GKE CrashLoopBackOff, Cloud SQL connection refused, or Cloud Functions failures.

Metrics vs Logs vs Traces vs Profiler

SignalBest forFastest question it answersCommon weaknessPrimary GCP tool
MetricsScope and timingWhen did it break, and how widespread?No request-level detailCloud Monitoring
LogsError detailWhat was the exact error message?Slow to search without filtersLogs Explorer
TracesRequest pathWhich service or query caused the slowness?Only covers instrumented pathsCloud Trace
ProfilerResource behaviorWhere is CPU or memory time going?Not real-time; needs a sampling periodCloud Profiler
Note

The order matters as much as the tools. Logs are more detailed than metrics, but jumping to logs without context from metrics first is slower overall. Each signal in the table above produces better results when the one above it has already narrowed the search space.

How the debugging workflow works

Production debugging is an operational decision tree. Each step narrows the search space for the next one.

  1. Start with metrics to establish the blast radius (which services are affected, which regions, how many users) and the precise start time. Open your service dashboard and look at the last 2 hours.
  2. Check what changed immediately after identifying the start time. Most sudden production incidents are caused by a recent deployment or configuration update. Check Cloud Audit Logs before spending time anywhere else.
  3. Use logs to find the error pattern. With the time window from step 1, your log filter is now specific. Search for severity>=ERROR in the affected service during the incident window. Read the full log entry: stack traces, request IDs, and dependency errors are usually in structured log fields alongside the message.
  4. Use traces to locate the failing hop when logs show an error but the root cause is in a different service. A span is a single operation within a distributed request. The trace shows all spans in sequence so you can see which one failed or took the most time.
  5. Use profiler only when the issue points to resource behavior. If metrics show high CPU or memory and logs show no errors, Cloud Profiler reveals which function or code path is consuming the most resources over time.
  6. Mitigate first, diagnose second. If a rollback is available and the incident is ongoing, roll back before spending time diagnosing. Diagnose the root cause once the service has recovered.

Step-by-step production debugging workflow

Step 1: Confirm what changed in metrics

Open Cloud Monitoring and check request rate, error rate, and latency for the affected service over the last 2 hours. Identify the exact minute the behavior changed. Note whether the change is sudden (deployment, config push) or gradual (resource exhaustion, traffic growth).

If you have alerting policies configured, they will tell you which metric threshold was breached and when. If you do not have alerts yet, this incident is a practical reason to add them after the postmortem.

Step 2: Check what changed in the environment

The most common cause of a sudden production incident is a recent deployment or configuration change. Before opening logs or traces, spend two minutes here.

Warning

Skipping this step is the single most common mistake in production debugging. It takes two minutes to check Cloud Audit Logs. Skipping it can mean spending thirty minutes searching logs in the wrong service for a cause that was a bad deployment all along.

In Logs Explorer, filter Cloud Audit Logs for admin activity. This filter shows all API-level changes across your project:

log_id("cloudaudit.googleapis.com/activity")

To narrow to Cloud Run deployments specifically:

log_id("cloudaudit.googleapis.com/activity")
protoPayload.serviceName="run.googleapis.com"
protoPayload.methodName="google.cloud.run.v1.Services.ReplaceService"

The protoPayload.authenticationInfo.principalEmail field shows who made each change. If the deployment timestamp aligns with the incident start time, roll back to the previous revision and confirm the service recovers before continuing your investigation.

Step 3: Narrow with logs

Move to Logs Explorer. Filter to the affected service and the incident time window. Start with severity>=ERROR to find the highest-priority signals first.

# Pull recent error logs from the command line
gcloud logging read \
  'severity>=ERROR resource.type="cloud_run_revision" resource.labels.service_name="api-service"' \
  --limit=50 \
  --format='value(timestamp,jsonPayload.message)' \
  --project=my-app-prod

# Pull all logs from the last 10 minutes
gcloud logging read \
  'resource.type="cloud_run_revision" resource.labels.service_name="api-service"' \
  --freshness=10m \
  --limit=100 \
  --project=my-app-prod

# Pull logs and pipe to jq for field extraction
gcloud logging read \
  'severity>=ERROR resource.type="cloud_run_revision"' \
  --limit=20 \
  --format=json \
  --project=my-app-prod | jq '.[].jsonPayload.message'
Tip

The —freshness flag accepts relative offsets like 5m, 1h, or 2d. Use it for quick pulls when you know roughly how long ago the incident started. Once you have the exact window from metrics, switch to an explicit timestamp filter so you are not pulling logs from outside the incident period.

Read the full log entry, not just the message. Structured logs carry request IDs, user IDs, upstream service names, and error codes as separate queryable fields. These let you correlate errors across services and identify which specific request triggered the failure.

Step 4: Confirm the failing operation with traces

Open Cloud Trace and filter to the affected time window. Sort by latency or filter for traces with errors. Click a failing or slow trace to open the waterfall view.

In a microservices architecture, the service that returns an error to the user is often not where the error originated. The trace shows the full call chain. Each span represents one operation: an HTTP call, a database query, a Pub/Sub publish. Find the span that failed or consumed most of the time, then go directly to that service’s logs.

✈️

Analogy

Think of a trace like a flight itinerary for a lost package. You would not just check the departure airport. You would check each leg of the journey: first connection, second connection, final delivery hub. Distributed traces work the same way. Every service the request passes through adds a span to the itinerary. You follow the itinerary until you find the leg where the package went missing, then investigate that stop specifically.

For more on how distributed tracing works across services, see Distributed Tracing.

Step 5: Escalate to profiler when the issue is resource behavior

If metrics show elevated CPU or memory, logs show no clear error, and traces show slow requests without an obvious failing span, the issue may be runtime resource behavior: a hot loop, a memory leak, or lock contention.

Cloud Profiler collects continuous, low-overhead profiling data from running services. Open Profiler in the Cloud Console, select the service and time range, and look for functions that consume a disproportionate share of CPU or memory. This is the right tool when the problem is a resource consumption pattern rather than a logic error.

Step 6: Mitigate first, diagnose second

During an active incident, recovery takes priority over root cause analysis. If a rollback is available: roll back, confirm the service is healthy, then investigate the failed revision in a non-production environment. If rollback is not available, consider routing traffic away from the affected region or instance first.

Tip

Document what you observed at each step as you go, not after. That real-time record becomes the basis for your post-incident review and is far more accurate than reconstructing the timeline from memory an hour later. See Incident Response with Monitoring for the full detect-triage-mitigate-postmortem structure.

Common GCP production debugging scenarios

Error spike after a deployment

What you see

Error rate climbs sharply at a specific minute. Latency is up. Request volume has not changed. Nothing else is different except something was deployed.

Open Cloud Audit Logs and filter for changes at the incident start time. For Cloud Run, filter for protoPayload.methodName=“google.cloud.run.v1.Services.ReplaceService”. If the timestamp matches, roll back immediately without further investigation. See Monitoring Cloud Run for revision traffic controls and error metrics.

Slow database-backed requests

What you see

p99 latency is elevated. CPU is normal. Error rate is low. Requests do eventually succeed, just slowly. Users notice timeouts on specific pages or API calls.

Open Cloud Trace and filter for slow requests above your SLO threshold. Click a slow trace and look for a database span consuming most of the elapsed time. For Cloud SQL, cross-check with Query Insights for per-query execution time. See Cloud SQL Connection Refused if requests are failing rather than just slow.

Cloud Run 500s with no application error in logs

What you see

Cloud Run returns 500s but application logs show nothing. Either the container logs are missing entirely, or the process exits before writing anything. The error is at the infrastructure level, not the application level.

Search Logs Explorer for container-level events including OOM kills and startup failures:

resource.type="cloud_run_revision"
resource.labels.service_name="api-service"
"OutOfMemory"

Also check Cloud Monitoring for container/memory/utilization approaching 1.0 before the errors started. If memory is the cause, increase the limit or investigate the leak with Cloud Profiler. See Cloud Run container failed to start for the full startup failure checklist.

GKE pod crashes and CrashLoopBackOff

What you see: pods restart repeatedly, kubectl get pods shows CrashLoopBackOff, application traffic is intermittently disrupted.

Check GKE monitoring dashboards for pod restart count and container exit codes. Then pull logs from the last crash with kubectl logs pod-name —previous. See GKE CrashLoopBackOff Explained for the full diagnostic process and common root causes.

Cloud Functions runtime failures

What you see: invocations fail, Cloud Monitoring shows increased error count, but the failure rate is not 100% so the function is clearly running sometimes.

Filter Logs Explorer to resource.type=“cloud_function” and severity>=ERROR. Look for timeout errors, memory limit exceeded messages, and dependency import failures. Check whether execution time is approaching the configured timeout. See Debugging Cloud Functions Failures for a scenario-by-scenario breakdown.

Cloud SQL connection failures

What you see: the application returns database connection errors, the Cloud SQL instance shows as running, and the issue appeared after a deployment or restart.

Tip

Connection failures after a restart are often caused by connection pool exhaustion or the Cloud SQL Auth Proxy not restarting alongside the application. Check Logs Explorer for pool exhaustion messages first, then verify the Auth Proxy is running with the correct service account permissions.

See Cloud SQL Connection Refused for the most common root causes and fixes.

What replaced Cloud Debugger?

Warning

Cloud Debugger was deprecated in May 2022 and shut down in May 2023. Any guide or workflow that references it is out of date. Do not build debugging processes that depend on it.

Cloud Debugger allowed attaching a snapshot debugger to a running production application without pausing it. Modern approaches cover the same ground more reliably:

  • Structured logging: emit enough context in each log entry (request ID, user ID, relevant application state) to reconstruct what happened without a live debugger. Add spans with attributes to capture state at key decision points. This is the primary replacement for breakpoint-style debugging in production.
  • Cloud Trace: trace the full path of a specific request and inspect the state captured in each span. Useful when the problem is latency or a failing downstream call.
  • Cloud Profiler: continuous profiling for CPU, memory, and lock contention. Useful when the problem is runtime resource behavior rather than a logic error.
  • Cloud Workstations: run a full development environment in GCP with VPC access. Debug against a staging environment that mirrors production rather than attaching to production directly.
  • Feature flags and targeted diagnostic logging: enable verbose logging for a specific user or request subset in production without a full redeploy. Combine with structured logs to capture additional context on demand.

Common beginner mistakes

  1. Starting with logs before checking metrics. Logs are the most detailed signal but the slowest to search. Metrics tell you when the problem started, which service is affected, and how widespread it is in seconds. Always check your service dashboard first to understand scope before opening Logs Explorer.
  2. Searching logs without a time filter. severity>=ERROR across the full log history of a busy service returns thousands of entries from weeks ago. Always add a timestamp filter scoped to the incident window. Narrow the time range progressively until you are looking at the exact moment the problem started.
  3. Looking at the wrong service’s logs. In a microservices system, the service that returns the error to the user is often not where the error originated. Use Cloud Trace to find which service in the call chain is actually failing before you open its logs. Going directly to the front-end service wastes time when the real problem is a downstream dependency.
  4. Not checking for a recent deployment first. The most common cause of a sudden production incident is a recent deployment or configuration change. Spend two minutes in Cloud Audit Logs before spending thirty minutes in traces. If there was a deployment at the start of the incident window, roll back first.
  5. Diagnosing during an active incident instead of mitigating first. A working service with an unexplained history is better than a broken service with a detailed investigation in progress. Roll back, reroute traffic, or apply a config change to stop the incident. Then diagnose the root cause in the postmortem.

Frequently asked questions

What is the fastest way to find what changed before an incident?

Check Cloud Audit Logs for recent API calls around the time the incident started. In Logs Explorer, filter log_id("cloudaudit.googleapis.com/activity") and narrow by service name to see deployments, config changes, and IAM updates made in the minutes before the incident.

Should I start with logs or metrics?

Always start with metrics. Metrics tell you when something changed, which service is affected, and how widespread the impact is. Without that context, searching logs means scanning thousands of entries with no clear target. Use metrics to set your time window and scope, then use logs to find the specific error.

When should I use traces instead of logs?

Use traces when the issue is latency in a distributed system, or when logs show errors but you cannot identify which service in the call chain is the root cause. Traces give you the full request path across services and show you exactly which span is slow or failing.

What replaced Cloud Debugger in Google Cloud?

Cloud Debugger was deprecated in 2022 and shut down in 2023. Modern replacements include structured logging with enough context to reconstruct what happened, Cloud Trace for request-level visibility, Cloud Profiler for CPU and memory behavior, and Cloud Workstations for debugging against a staging environment.

How do I debug a production issue when logs show nothing obvious?

If logs show nothing obvious, check Cloud Audit Logs for recent infrastructure changes. Check Cloud Trace for slow or dropped spans. Check Cloud Monitoring for upstream dependency errors from external APIs, Cloud SQL, or Pub/Sub. If the issue looks resource-related, open Cloud Profiler for CPU, memory, or lock contention patterns.

Last verified: 25 March 2026 Cloud services change frequently. Verify details against official documentation before making infrastructure decisions.