Monitoring GKE in Google Cloud: Metrics, Alerts & Logs

When a pod starts crashing in production, the root cause could be in your application code, a misconfigured resource limit, a node under memory pressure, or a bad deploy rolling out while the previous version is still partially serving traffic. GKE monitoring means watching all three layers — nodes, orchestration, and workloads — at the same time so you can isolate the cause quickly.

This page explains how GKE monitoring works end to end: which metrics GKE sends by default, what Managed Service for Prometheus adds, how to use dashboards, logs, and alerts together, and what to check first when things go wrong.

Simple explanation

GKE monitoring has three distinct layers, and problems at each layer look different.

Pod and container monitoring

Tracks CPU usage, memory usage, and restart counts for individual containers. A pod in CrashLoopBackOff shows a rising restart count. A container hitting its memory limit produces an OOMKilled event before the pod restarts. These signals are fast and specific: they tell you which container failed and why.

Node monitoring

Tracks how much of each node’s allocatable CPU and memory is actually in use. When a node approaches its allocatable memory limit, Kubernetes starts evicting lower-priority pods to free space. Node pressure is a cluster-wide signal: it means your fleet does not have enough capacity, not just that one pod is misbehaving.

Workload and deployment monitoring

Tracks whether your deployments have the expected number of available replicas, whether a rollout is progressing, and whether your application is serving traffic correctly. This layer connects infrastructure health to user-facing availability. Horizontal Pod Autoscaling also depends on workload-level metrics to make scaling decisions.

All three layers need to be visible at the same time. A slow API endpoint might be caused by a node being overloaded, a container being CPU-throttled, or a rollout deploying a broken image version. Without visibility into all three, you are debugging with partial information.

How GKE monitoring works

GKE integrates with Cloud Monitoring automatically. When you create a GKE cluster with the default settings, monitoring is enabled and metrics start flowing without any additional configuration.

Default system metrics

New GKE clusters (Standard and Autopilot) automatically send system metrics to Cloud Monitoring under the kubernetes.io/ namespace. These cover containers, pods, and nodes:

  • kubernetes.io/container/cpu/core_usage_time: CPU core-seconds consumed by a container. Apply rate() to get cores per second.
  • kubernetes.io/container/memory/used_bytes: current memory used by a container. Compare to its memory limit to measure utilization.
  • kubernetes.io/container/restart_count: total restarts for a container. This is the fastest signal for CrashLoopBackOff.
  • kubernetes.io/node/cpu/allocatable_utilization: fraction of allocatable CPU in use on the node.
  • kubernetes.io/node/memory/allocatable_utilization: fraction of allocatable memory in use. When this nears 1.0, evictions follow.
  • kubernetes.io/node/pod_count: pods currently running per node.
Note

The current metric namespace for GKE system metrics is kubernetes.io/. You may see older docs or dashboards referencing container.googleapis.com/ — that is a legacy namespace from earlier GKE versions. New clusters and current dashboards use kubernetes.io/ exclusively.

Cloud Logging and Kubernetes events

GKE sends container logs and Kubernetes system logs to Cloud Logging by default. Container stdout and stderr streams are captured automatically and searchable in Logs Explorer. Kubernetes events (OOM kills, image pull failures, scheduling failures, probe failures) appear as structured log entries you can filter and alert on. See Logging in Kubernetes for how log routing works.

Managed Service for Prometheus

Managed Service for Prometheus (GMP) collects Prometheus-format metrics from your application pods. GMP is enabled by default on Autopilot clusters and on Standard clusters created with GKE 1.27 or later. On older Standard clusters, you may need to enable it explicitly in the cluster’s observability settings.

GMP runs a managed collection agent on each node. You define what to scrape using PodMonitoring resources. Collected metrics appear in Cloud Monitoring under the prometheus.googleapis.com/ namespace and are queryable using PromQL.

Optional observability packages

Beyond the defaults, GKE supports additional observability components you can enable per cluster:

  • Kube State Metrics: deployment replica status, pod phase distribution, resource request vs. limit comparisons. Useful for workload-level visibility beyond what default system metrics provide.
  • Control plane metrics: kube-scheduler and kube-apiserver metrics. Relevant when diagnosing scheduling latency or API server load on Standard clusters.
  • cAdvisor / Kubelet metrics: more detailed per-container resource metrics from the Kubelet. Useful for fine-grained capacity analysis.

Enable these in the GKE cluster settings under Observability → Managed Collection in the Cloud Console, or via gcloud container clusters update. They are not required for basic GKE monitoring.

What to monitor first

Start with signals that have the highest incident rate and clearest remediation path. This checklist is ordered by urgency:

  • Pod restart count: a container restarting 3+ times in 10 minutes is in or approaching CrashLoopBackOff. Alert before it causes sustained user impact.
  • OOMKilled events: containers killed for exceeding their memory limit generate OOMKilled events in logs. Create a log-based metric and alert on it. OOMKilled is almost always caused by a memory limit set too low or a memory leak.
  • Node allocatable memory utilization: when this exceeds 85%, evictions become likely. Alert early so you can add nodes before pods start getting evicted.
  • Node allocatable CPU utilization: high CPU pressure does not cause evictions, but it causes throttling across all pods on the node. Alert at 80% as an early warning.
  • Unavailable replicas: when a deployment has fewer available replicas than desired, something is wrong. Use kube state metrics or check kubectl get deployments.
  • Rollout failures: a deployment stalling mid-rollout (new pods not becoming ready) is a common failure mode after a bad deploy. Kubernetes events capture this.
  • Warning and error logs: filter container logs for severity=WARNING or ERROR on high-traffic services. Spikes in error rate often precede escalation.
  • Request latency: if your application exposes Prometheus metrics, alert on p99 latency. Infrastructure metrics alone will not surface a slow database query or N+1 problem.
Start here

If you have zero alerts configured today, add these three first. They catch the most common GKE production failures with the lowest false-positive rate:

  1. Pod restart count > 3 in 10 minutes (metric: kubernetes.io/container/restart_count)
  2. Node allocatable memory utilization > 85% (metric: kubernetes.io/node/memory/allocatable_utilization)
  3. OOMKilled count in logs > 0 in 5 minutes (via a log-based metric on jsonPayload.reason=“OOMKilling”)

Managed Service for Prometheus in GKE

Google Managed Service for Prometheus (GMP) removes the operational burden of running a Prometheus server. You do not manage storage, retention, or high availability — GMP handles all of that inside Cloud Monitoring’s infrastructure.

GMP stores metrics in Cloud Monitoring’s time-series backend and supports PromQL queries natively. You can use existing Prometheus dashboards and alerting rules with minimal changes. The main tradeoff is that GMP does not support every Prometheus feature (some long-retention, high-cardinality patterns differ), but for most production GKE workloads it is the right default.

Configuring GMP with PodMonitoring

To tell GMP which pods to scrape, create a PodMonitoring resource in the same namespace as your pods:

apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: api-service
  namespace: production
spec:
  selector:
    matchLabels:
      app: api-service
  endpoints:
  - port: metrics
    interval: 30s

This tells GMP to scrape the metrics port on pods labeled app: api-service every 30 seconds. Your application must expose a /metrics endpoint in Prometheus text format. Scraped metrics appear in Cloud Monitoring under prometheus.googleapis.com/ and are immediately queryable with PromQL.

Tip

GMP scrapes at the pod level, not the service level. With 5 replicas, GMP scrapes each one individually and stores per-pod time series. This gives you per-replica visibility — exactly what you need when one replica is behaving differently from the rest.

Dashboards, logs, and alerts for GKE

Effective GKE observability uses dashboards, logs, and alerts together. Each tool solves a different part of the operational problem.

Dashboards

Cloud Monitoring dashboards give you a continuous picture of cluster health over time. GKE provides pre-built dashboards in the Cloud Console under Kubernetes Engine → Observability that show node CPU/memory utilization, pod resource usage, container restart counts, and workload availability — with no setup required.

Custom dashboards are useful for correlating application-level metrics from GMP with infrastructure metrics from the kubernetes.io/ namespace on a single screen. A dashboard that shows p99 request latency, pod CPU usage, and node memory utilization side by side makes it much easier to determine whether a latency spike is infrastructure-driven or application-driven.

Logs and Kubernetes events

Logs answer what happened; metrics show how much. When a pod restarts, the restart count metric tells you something went wrong. The logs tell you what the container printed before it died. The Kubernetes event tells you whether it was an OOM kill, a liveness probe failure, or an image pull error.

In Logs Explorer, filter with resource.type=“k8s_container” to narrow to container logs, or resource.type=“k8s_cluster” for Kubernetes events. The query jsonPayload.reason=“OOMKilling” surfaces OOM kills directly.

Alerts

Dashboards require someone to be watching them. Alerting policies notify you without requiring active monitoring. The most impactful GKE alerts:

  • Container restart count > 3 in 10 minutes: use the kubernetes.io/container/restart_count metric with a rate condition.
  • Node memory allocatable utilization > 85%: use kubernetes.io/node/memory/allocatable_utilization.
  • OOM kill events in logs: create a log-based metric matching OOMKilled events and alert when count > 0 over 5 minutes.

Using all three together during an incident

Incident flow

An alert fires for pod restarts → open the GKE dashboard to see which nodes the affected pods are on → check Kubernetes events in Logs Explorer to identify whether it is an OOM kill or probe failure → check container logs for the error message just before termination → correlate with a recent deploy in the audit log. See Incident Response with Monitoring for the full workflow.

Quick diagnostics with kubectl

Use kubectl for immediate in-terminal visibility before opening dashboards, especially right after an alert fires:

# See CPU and memory for all pods in a namespace, sorted by memory usage
kubectl top pods -n production --sort-by=memory

# See node-level CPU and memory utilization
kubectl top nodes

# List pods with restart counts — a high RESTARTS number is the signal
kubectl get pods -n production

# View recent Kubernetes events sorted by time
# This is the fastest way to find OOM kills and scheduling failures
kubectl get events -n production --sort-by='.lastTimestamp'

# Deep-dive on a specific pod: status, resource usage, events, and conditions
kubectl describe pod POD_NAME -n production

# See logs from the previous container instance (useful right after a restart)
kubectl logs POD_NAME -n production --previous
Tip

kubectl get events is almost always the fastest first step when a pod is unhealthy. Events show OOM kills, scheduling failures, image pull errors, and liveness probe failures in chronological order. Read events before diving into application logs — the cause is often visible immediately, without reading hundreds of log lines.

When to use this

GKE monitoring applies in several recurring operational situations:

  • New cluster going to production: configure system metric alerts and GMP before your first production deploy, not after the first incident.
  • Pods restarting unexpectedly: check restart count metrics and Kubernetes events first. Most causes (OOM kills, liveness probe failures, image pull errors) are visible within seconds from events alone.
  • Performance regression after a deploy: compare pre- and post-deploy dashboards for CPU usage, memory usage, and request latency. A new version consuming significantly more memory than the previous one shows up immediately in kubernetes.io/container/memory/used_bytes.
  • Node pressure or capacity issues: when nodes approach their allocatable limits, use kubectl top nodes and node allocatable utilization dashboards to identify which nodes are saturated. This informs whether you need to scale the node pool or rightsize your pod resource requests.
  • Moving from kubectl-only debugging to proper monitoring: understanding pods and using kubectl describe gets you started, but Cloud Monitoring dashboards and alerts catch problems before a user reports them.
  • Comparing GKE Autopilot vs Standard observability: see the Autopilot vs Standard comparison for how node management differences affect what you monitor.

Monitoring GKE vs self-managed Prometheus

Teams migrating existing Prometheus setups to GKE often ask whether to keep self-managed Prometheus or switch to GMP. Here is a direct comparison:

Managed Service for Prometheus (GMP)Self-managed Prometheus
Setup effortMinimal. Enabled by default on new clustersRequires Helm chart, PVCs, and RBAC config
Maintenance burdenNone. Google manages upgrades, storage, and HAYou manage Prometheus, Alertmanager, and storage
ScaleHandles high-cardinality workloads automaticallyRequires manual scaling and federation for large clusters
DashboardsCloud Monitoring dashboards + PromQLGrafana with Prometheus data source
QueryingPromQL via Cloud Monitoring UI and APIPromQL via Prometheus UI or Grafana
AlertingCloud Monitoring alerting policiesAlertmanager with routing rules
Existing Prometheus rulesMostly compatible (some federation patterns differ)Full compatibility
Best for beginnersYes. No infrastructure to operateNo. Requires Prometheus expertise

For most teams deploying on GKE without an existing Prometheus investment, GMP is the better default. If you have existing Grafana dashboards, alert rules, or a large Prometheus ecosystem to maintain, self-managed Prometheus gives you more control, at an operational cost.

Common beginner mistakes

  1. Not setting resource requests and limits. Without requests, the Kubernetes scheduler cannot make good placement decisions and pods can be evicted unpredictably. Without limits, a runaway container can consume an entire node’s memory. Cloud Monitoring’s resource metrics are also far less useful without budgets to compare against.

    This one matters most

    Skipping resource requests and limits is the single most common reason GKE clusters behave unpredictably. You will not get useful monitoring data out of a cluster where half the pods have no resource budget set. Always define both before deploying to production.

  2. Only monitoring at the service level. Service-level metrics look healthy when some pods are working and masking unhealthy ones. Monitor at the pod level: individual restart counts and per-pod CPU/memory reveal problems that aggregates hide.
  3. Ignoring Kubernetes events. Most new engineers jump straight to application logs when a pod is crashing. Kubernetes events are faster to read and often contain the full explanation: “OOMKilled”, “Back-off pulling image”, “Liveness probe failed”. Check events first with kubectl get events.
  4. Confusing CPU throttling with CPU utilization. A container can show low CPU utilization but high CPU throttling if its CPU limit is set too low. Throttling means the container is being artificially slowed down. kubectl top does not show throttling — use kubernetes.io/container/cpu/core_usage_time alongside GKE Workload Insights to see the full picture.
  5. Assuming dashboards replace alerts. Dashboards require someone to be watching. Alerts fire when no one is watching. Configure alerts for the signals in the checklist above before relying on dashboards for production visibility.
  6. Not correlating logs and metrics during troubleshooting. A restart count spike in a dashboard and a matching OOMKilled event in logs together tell a clear story. Either one alone is incomplete. When investigating an incident, open both Cloud Monitoring and Logs Explorer at the same time with the same time window.

Frequently asked questions

Do I need to install Prometheus on GKE?

No. GKE includes Managed Service for Prometheus (GMP), which handles Prometheus-format metric collection without you running or operating a Prometheus server. You create PodMonitoring resources to define what to scrape, and GMP handles collection, storage in Cloud Monitoring, and PromQL querying.

What is the difference between system metrics and Prometheus metrics in GKE?

System metrics are emitted by GKE automatically under the kubernetes.io/ namespace: pod CPU usage, memory usage, restart counts. Prometheus metrics are application-level metrics your code exposes in Prometheus format on a /metrics endpoint. Managed Service for Prometheus collects the latter. Both end up queryable in Cloud Monitoring.

What should I alert on first in GKE?

Start with three alerts: pod restart count increasing unexpectedly (catches CrashLoopBackOff early), node allocatable memory utilization above 85% (predicts eviction before it happens), and OOM kill events from logs (signals containers hitting their memory limit). These three cover the most common GKE production failures.

How do I check pod CPU and memory usage quickly?

Run kubectl top pods -n NAMESPACE to see current CPU and memory for all pods in a namespace. Run kubectl top nodes for node-level usage. Both commands use the Metrics Server, which GKE enables by default on all cluster types.

Do I need different monitoring for Autopilot vs Standard GKE?

The core setup is the same. Cloud Monitoring system metrics and Managed Service for Prometheus work on both. The key difference is that Autopilot manages nodes for you, so node-pressure alerting matters less. On Autopilot clusters, focus on pod and workload metrics. On Standard clusters, node allocatable utilization and node pool capacity deserve dedicated alerts.

Last verified: 25 March 2026 Cloud services change frequently. Verify details against official documentation before making infrastructure decisions.