Blue Green Deployment in GCP: Cloud Run, GKE, Rollbacks

A blue green deployment lets you ship a new version to production, test it fully in a production-identical environment, then switch all user traffic in a single command. If anything goes wrong, you roll back in seconds. On Cloud Run this is built into the revision model. On GKE it takes two Deployments and a Service selector switch. Either way, your users see zero downtime.

This guide walks through how blue green works, how to implement it on both Cloud Run and GKE, when to use it, how it compares to canary and rolling deployments, and what to validate before you flip the switch.

Simple explanation

Analogy

Imagine you run a restaurant. The kitchen is open and serving lunch (that is blue, the live version). You want to switch to the new seasonal menu. Instead of changing dishes mid-service, you set up a second kitchen (green) with the new menu and run a quiet test service before any customers see it. Once you are happy everything is right, you open the new kitchen and close the old one. If a dish comes out wrong in the first five minutes, you reopen the old kitchen immediately.

In deployment terms: blue is your current live production version. Green is the new version you have deployed but not yet exposed to users. You validate green, then flip a switch to make it the live version. Blue stays running and ready so you can flip back instantly.

What is a blue green deployment in GCP?

A blue green deployment maintains two complete, production-equivalent environments at the same time. One environment (blue) serves all live user traffic. The other (green) runs the new version of your application, fully deployed, fully configured, connected to the same production dependencies, but receiving no user traffic yet.

When you are ready to release, you move all traffic from blue to green in one step. The switch is instantaneous from the user’s perspective: one request hits blue, the next hits green, with no in-between state where both versions might handle the same session unpredictably.

Because blue stays running after the switch, rollback is just the reverse operation. There is no redeployment, no image pull, no cold start wait. You shift traffic back to the previous version and it is live again within seconds.

How this differs from rolling deploys

A rolling deployment gradually replaces instances of the old version with new ones, creating a window where both versions serve traffic at the same time. Blue green eliminates that mixed-version window entirely. At any moment, all traffic goes to exactly one version.

How blue green deployments work

The sequence looks like this:

  1. Deploy the new version (green) without traffic. Green runs and is fully started but receives zero user requests. On Cloud Run this uses —no-traffic. On GKE, a separate Deployment runs alongside blue with a different version label.
  2. Validate green in isolation. Run smoke tests, integration checks, and health checks against the green environment directly. Because it receives no production traffic, you can test thoroughly without any user impact. See the CI/CD pipeline guide for how to automate this step.
  3. Switch all traffic to green. One command moves 100% of traffic from blue to green. Users hit green from this point on.
  4. Monitor the post-switch window. Watch error rates and latency for the first few minutes. Automated checks against Cloud Monitoring alerts can trigger automatic rollback if metrics degrade.
  5. Roll back if needed. If anything looks wrong, shift traffic back to blue instantly. No redeployment required.
  6. Clean up. Once green is confirmed stable (typically after 24 hours or more), decommission blue.

The key distinction from canary: in blue green, green is tested in isolation before any users see it. In a canary, the new version gets a small slice of real user traffic as a validation mechanism. Blue green is faster and simpler; canary provides more evidence from real traffic at the cost of more operational complexity. See canary deployments for when to choose that path instead.

Blue green on Cloud Run

Cloud Run’s revision model makes blue green straightforward. Every deploy creates a new revision, and you control traffic allocation between revisions explicitly. The Cloud Run overview explains the revision model in more detail if this is new to you.

# Step 1: Deploy the new revision with no production traffic
# --no-traffic means this revision starts but receives zero requests
# --tag gives it a stable, predictable URL for testing
gcloud run deploy api-service \
  --image=europe-west2-docker.pkg.dev/my-app-prod/api/api:v2.0.0 \
  --region=europe-west2 \
  --no-traffic \
  --tag=green

# Green is now accessible for testing at its stable tagged URL:
# https://green---api-service-xxxx-ew.a.run.app
# This URL is stable and persists until you remove the tag

# Step 2: Run your validation suite against the tagged URL
# Smoke tests, integration checks, dependency verification
# (these should be automated in your pipeline)

# Step 3: Switch all traffic to green instantly
gcloud run services update-traffic api-service \
  --region=europe-west2 \
  --to-revisions=LATEST=100

# Step 4 (if needed): List revisions to find the previous blue revision name
gcloud run revisions list \
  --service=api-service \
  --region=europe-west2

# Roll back to the previous revision by name
gcloud run services update-traffic api-service \
  --region=europe-west2 \
  --to-revisions=api-service-00049-xyz=100

Cloud Run billing during blue green

Cloud Run charges for actual request processing: CPU time, memory, and requests served. A revision that receives no traffic is not billed for request processing.

Billing caveat: minimum instances

If you configure —min-instances on the green revision, those instances are kept warm even at zero traffic and you will be billed for their idle time. For most blue green workflows you do not need minimum instances on the green revision during validation. The tagged URL will cause a cold start on your smoke tests, which is acceptable. Only set minimum instances on the live (blue) revision unless you have a specific latency requirement during validation.

After switching, the old blue revision goes idle and its request billing drops to zero. It stays available for rollback without generating costs, unless it also has minimum instances configured.

Tip

Tagged revision URLs are stable and persist until you explicitly remove the tag. Build your smoke test suite to target the tagged URL so it always hits the exact revision under test, regardless of which revision is currently live. The Cloud Build deployment guide shows how to wire this into an automated pipeline.

Blue green on GKE

On GKE, there is no built-in revision model, so you manage it yourself with two Deployments and a Service selector switch. The Service routes traffic to pods based on label selectors. Changing the selector is the traffic switch:

# The Service selects pods by the 'version' label
# Change 'blue' to 'green' to switch all traffic
apiVersion: v1
kind: Service
metadata:
  name: api-service
  namespace: production
spec:
  selector:
    app: api
    version: blue   # <-- this is the only thing that changes during a switch
  ports:
    - port: 80
      targetPort: 8080
# Blue Deployment: current live version
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-blue
  namespace: production
spec:
  replicas: 10
  selector:
    matchLabels:
      app: api
      version: blue
  template:
    metadata:
      labels:
        app: api
        version: blue
    spec:
      containers:
        - name: api
          image: europe-west2-docker.pkg.dev/my-app-prod/api/api:v1.9.0
---
# Green Deployment: new version, not yet selected by the Service
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-green
  namespace: production
spec:
  replicas: 10
  selector:
    matchLabels:
      app: api
      version: green
  template:
    metadata:
      labels:
        app: api
        version: green
    spec:
      containers:
        - name: api
          image: europe-west2-docker.pkg.dev/my-app-prod/api/api:v2.0.0
# Switch all traffic to green by patching the Service selector
kubectl patch service api-service -n production \
  -p '{"spec":{"selector":{"app":"api","version":"green"}}}'

# Rollback: patch back to blue
kubectl patch service api-service -n production \
  -p '{"spec":{"selector":{"app":"api","version":"blue"}}}'

GKE capacity and cost tradeoffs

GKE cost: two environments means double compute

Unlike Cloud Run, idle GKE pods still consume node resources. If those nodes are dedicated to your workload, they cost money regardless of traffic. Running two full-capacity Deployments simultaneously can roughly double your compute bill during the validation window.

A practical approach: run the green Deployment at a small replica count (1 or 2 pods) during development. Scale it to full production capacity in the hour before a planned release. After the switch is confirmed stable, scale the blue Deployment down to zero or delete it entirely. You pay for dual capacity only during the brief release window.

See monitoring GKE for how to watch both Deployments during the switch window. This is one of the key operational differences from Cloud Run blue green: on Cloud Run the cost concern is minimal, but on GKE it requires deliberate scaling management.

When to use blue green deployments

Good fits for blue green

  • High-risk releases. Major version changes, database schema migrations, significant refactors. Anything where you want instant rollback available.
  • User-facing services where downtime is unacceptable. Blue green gives you a clean switch with no mixed-version window.
  • Changes that are hard to canary safely. If the new version is incompatible with the old version during a gradual rollout (for example, a breaking API change), blue green is cleaner than a canary.
  • Regulated or compliance-sensitive workloads. Blue green makes it easy to demonstrate that only one known version was serving traffic at any given time.
  • Teams with strong pre-deploy validation automation. The more automated your smoke tests and integration checks, the safer the full-traffic switch becomes.

When to choose something else

  • Low-risk, high-frequency deploys. Dependency updates, config tweaks, minor bug fixes. A standard rolling deployment is simpler and fast enough.
  • When you need production traffic to validate. If you cannot fully trust pre-switch testing and want real-user signal before committing, use a canary deployment instead.
  • When validation automation is weak. Blue green exposes 100% of users to the new version in one step. Without meaningful pre-switch validation, you are just doing a faster, less safe rolling deploy.
  • Complex stateful migrations where two-version compatibility is impossible. Some changes cannot safely run against the same database or state store at the same time. In those cases, a maintenance window may be unavoidable.
  • GKE workloads with tight budget constraints. If running two Deployments simultaneously is not cost-feasible, consider a rolling deployment with a pause-and-validate checkpoint instead.

The dev vs staging vs production guide is useful context here. Blue green works best when your staging environment already mirrors production closely enough to catch most issues before the green environment is even built.

Blue green vs canary vs rolling deployments

These three strategies solve the same problem differently. Knowing when to reach for each one is part of designing a solid release process.

Blue GreenCanaryRolling
Traffic exposure100% switches at once after isolated testingSmall % of real users hit the new version firstGradually replaces old instances
Rollback speedInstant: shift traffic back, no redeployFast for canary slice; full rollback is quickSlow: must wait for old version to redeploy
Validation stylePre-switch in isolation, synthetic traffic onlyLive production traffic on a small sliceNo separate validation window
Mixed-version windowNone, switch is atomicYes, during the canary phaseYes, throughout the rollout
Operational complexityMedium: need two environments runningHigh: requires per-revision traffic splitting and monitoringLow: often built into deployment tools
Cost on GKEHigh during switch window (two full environments)Low (canary slice is small)Low (in-place replacement)
Best forHigh-risk releases, instant rollback requirementWhen you need real-traffic validation before full rolloutLow-risk frequent deploys

The short version: blue green is the right choice when your pre-switch testing is thorough and you want the cleanest possible rollback. Canary is better when the change is risky enough that you want production signal before going to 100%. Rolling is fine for routine, low-risk deploys where simplicity and low cost matter more than instant rollback.

Validating green before switching

The safety of blue green depends entirely on what you check before switching. A careless switch is worse than a careful canary, because you expose 100% of users to an unvalidated version at once. Automate as much of this as possible so the pipeline cannot proceed to the traffic switch unless all checks pass.

Smoke and functional tests

  • Cover your core user journeys against the tagged URL, not just a /healthz endpoint. A health check returning 200 does not mean your application is processing requests correctly.
  • Test the critical paths that would cause user-visible failures: login, checkout, data reads and writes, third-party integrations.

Dependency and configuration checks

  • Verify the green revision can connect to its database, caches, and external APIs. A missing or malformed secret causes startup failures that may not surface until the first real request. The secrets in CI/CD guide covers how to catch these problems early.
  • Check that the new version is reading the correct config values. Environment variable renames and config schema changes are a common source of silent failures.

Logs and error checks

  • Inspect Cloud Logging for the green revision before switching. Look for startup errors, configuration warnings, and any stack traces that appear during your smoke test traffic.
  • A revision that has errors in its logs during smoke testing is not ready to serve production traffic.

Baseline metrics

  • Check the error rate and latency of the green revision under your synthetic smoke test traffic. Even a small test load can surface obvious problems. A 5xx rate on the green revision during your own tests is a clear stop signal.
  • Set a hard threshold: if the error rate on green exceeds X% during validation, the pipeline aborts. Do not let warnings get ignored.

Schema and backward compatibility

  • Before switching, confirm that the green version’s expected database schema is in place. Run any necessary migrations before flipping traffic, and make sure those migrations are backward compatible with the previous version.
Define rollback criteria before the deploy, not during it

Know in advance: if the error rate exceeds X% in the first five minutes, or if more than Y users report problem Z, the rollback triggers automatically. Deciding this under pressure after something goes wrong leads to slower, less confident decisions.

Database and stateful workload considerations

Blue green is cleanest when your application is stateless. When state is involved, you need to plan more carefully.

Database schema migrations

During and after a blue green switch, both the blue (old) and green (new) versions may be talking to the same database at the same time. Blue is live during the rollback window; green is the live version after the switch. Any schema change must be backward compatible with the old version:

Safe migrations

Adding nullable columns, new tables, or new indexes. Blue ignores the new column; green uses it. Neither version breaks.

Unsafe migrations

Dropping a column that blue still reads, renaming a column, or changing a data type. These break blue if you need to roll back. For destructive changes, use a multi-step approach: first add the new column alongside the old one, then release the new code that uses it, then drop the old column in a later release once blue is safely retired.

Sessions and user state

If your application stores session state in a cookie, a token, or a server-side session store, check that the new version can read sessions created by the old version. A session format change that silently logs users out after the switch is a real user impact even if no errors appear in monitoring.

Caches and queues

Shared caches (Memorystore, Redis) and queues (Pub/Sub, Cloud Tasks) can cause problems if the new version uses a different data format than the old one. During the rollback window, blue might dequeue a message that green serialised, or vice versa. Design your message and cache formats to be forward and backward compatible across adjacent versions.

API contract compatibility

If blue and green both call downstream services, make sure those services can handle requests from both versions at the same time during the switch window. This matters most in microservice architectures where many services deploy together.

Automated rollback and post-deploy verification

Manual rollback only works if someone notices the problem quickly. For anything beyond a low-traffic service, automate it.

Post-switch monitoring window on Cloud Run

After switching traffic to green, your pipeline should pause and query Cloud Monitoring for the new revision’s error rate. If the error rate exceeds a threshold within the first five minutes, the pipeline triggers the rollback command automatically and posts to your incident channel:

# Example post-switch automated rollback (pseudocode within a pipeline script)
# 1. Switch traffic to green
gcloud run services update-traffic api-service \
  --region=europe-west2 \
  --to-revisions=LATEST=100

# 2. Wait for monitoring data to accumulate
sleep 300

# 3. Query Cloud Monitoring for the new revision's error rate
# (implement via gcloud monitoring read or the Cloud Monitoring API)
# If error rate > threshold:

# 4. Rollback
gcloud run services update-traffic api-service \
  --region=europe-west2 \
  --to-revisions=api-service-00049-xyz=100

For a production-quality implementation, pair this with Cloud Monitoring alerts and a Cloud Run error rate metric. The monitoring Cloud Run guide shows which metrics to watch.

Cloud Deploy verification jobs

If you use Cloud Deploy to manage releases, it supports post-deployment verification via a verify job in your delivery pipeline definition. The verify job runs after the deployment and executes a test container of your choice. If the verify job fails, Cloud Deploy can be configured to automatically roll back to the previous release.

Cloud Deploy rollback speed

Cloud Deploy’s rollback is a re-deployment of the previous release artifact through the same pipeline, not an instant traffic switch like the Cloud Run revision model. It takes longer than a manual traffic shift. For time-sensitive rollbacks on Cloud Run, the direct update-traffic command is faster. See rollbacks in Cloud Deploy for how the verification and rollback flow works in practice.

The Cloud Deploy overview explains how to structure a delivery pipeline if you are not already using it.

Common mistakes

  1. Skipping meaningful validation before switching. Running a single health check and calling it done removes the entire safety net. Test your core user journeys, not just whether the process started.

  2. Switching 100% of traffic before you are confident. If you have doubts, use a canary first. Blue green is not appropriate when the new version is still genuinely uncertain.

  3. Ignoring schema compatibility. If the green version requires a database change that breaks the blue version, you cannot roll back without also reverting the migration. Plan backward-compatible migrations explicitly.

  4. Deleting blue too soon after switching. Keep the previous version available for at least 24 hours. Problems that smoke testing did not catch, like edge-case failures or traffic pattern issues, may only surface over time.

  5. Using blue green for every deployment regardless of risk level. Low-risk routine deploys do not need the overhead of managing two environments. Reserve blue green for releases that genuinely warrant it.

  6. Forgetting the cost implications on GKE. Running two full-capacity Deployments costs real money on GKE. Plan your scaling strategy so you are only paying for dual capacity during the brief release window.

  7. Not defining rollback criteria in advance. Deciding when to roll back should happen before the release, not in the middle of an incident. Set concrete thresholds: error rate, latency p99, user reports.

  8. Configuring minimum instances on the green revision unnecessarily. On Cloud Run, minimum instances on an idle green revision will generate billing even at zero traffic. Only set minimum instances if your validation requires warm instances from the start.

Frequently asked questions

What is the difference between blue green and canary deployments?

Blue green switches all traffic at once after testing the new version in isolation. Canary gradually shifts a small percentage of real user traffic to the new version while monitoring for errors. Blue green gives you instant rollback if something goes wrong; canary gives you evidence from real traffic before committing to a full rollout. Use blue green when you are confident the new version works and want a clean, fast switch. Use canary when you want production signal before committing 100%.

How do blue green deployments work on Cloud Run?

Deploy the new revision with --no-traffic so it receives no production requests. Test it using the tagged revision URL. When validation passes, shift 100% of traffic in one command. The old revision stays running and can serve traffic again within seconds if you roll back. Cloud Run only bills for requests actually served, so a revision sitting idle at zero traffic is not billed for request processing. The exception is if you configure minimum instances on that revision, which will generate idle instance charges even with no traffic.

Are blue green deployments expensive on GKE?

They can be. On GKE with dedicated nodes, running both blue and green at full production capacity doubles your compute cost during the validation window. A common approach is to run the green Deployment at minimum replica count during development, scale it to full production capacity shortly before the planned switch, then scale down the blue Deployment after the switch is confirmed stable.

Do I need separate databases for blue and green?

No, but you need to manage schema changes carefully. Both the blue and green versions will talk to the same database during and after the switch. Any schema migration you run for the new version must be backward compatible with the old version, so that blue can still function if you need to roll back. Additive changes like new nullable columns are generally safe. Destructive changes like dropping a column or renaming a field require a multi-step migration across multiple releases.

When should I avoid blue green deployments?

Skip blue green for very low-risk or high-frequency deploys where a rolling update is fast enough. Also avoid it when your validation automation is weak, since blue green exposes 100% of users to the new version in one step. A poorly validated switch is more disruptive than a cautious canary. Avoid it too when your changes involve complex stateful or schema migrations that make running two versions simultaneously unsafe.

Last verified: 25 March 2026 Cloud services change frequently. Verify details against official documentation before making infrastructure decisions.