Canary Deployments on GCP Explained: Cloud Run Traffic Splitting
A canary deployment releases a new version to a controlled slice of production traffic and monitors it before rolling out further. If the canary encounters elevated errors or unexpected latency, you roll back before the problem reaches most users. On GCP, this is built into Cloud Run’s revision and traffic-splitting system with no extra infrastructure required to get started.
What is a canary deployment?
When you deploy a new version of an application, you face a choice: replace the old version immediately and trust your tests, or expose the new version to a controlled percentage of real users first and watch what happens. A canary deployment takes the second approach.
Instead of sending all production traffic to the new version at once, you route a small fraction, typically 1 to 5%, while the majority continues to the proven stable version. You monitor both versions side by side. If the new version performs well, you gradually shift more traffic. If it does not, you move all traffic back to the stable version and most users never experience the problem.
The reduced blast radius is the core value here. A bug that passes all pre-production tests may only surface under specific real-world conditions: particular user data, high concurrency, or traffic patterns that staging never replicates. A canary exposes the new version to genuine production conditions on a controlled subset before you commit fully.
Coal miners once sent canaries into tunnels before entering themselves. If the canary was harmed by gas, miners knew to stay back. In software, a small fraction of users acts as the canary. If they encounter problems, the rest of your users never do and you roll back before increasing the percentage.
How canary deployments work on Cloud Run
Cloud Run is built around the concept of named revisions. Every time you deploy a new container image, Cloud Run creates a new immutable revision. By default it routes all traffic to the latest revision, but you can override this and split traffic between any number of revisions explicitly, down to the percentage level.
This makes the canary pattern straightforward to implement without any additional tooling:
- Deploy the new revision with
—no-trafficso it receives zero production traffic initially - Cloud Run assigns it a stable tagged URL you can use for smoke testing in the real production environment with zero user exposure
- When smoke tests pass, shift a small percentage of live traffic to the new revision
- Monitor error rate and latency per revision side by side
- Increase the percentage gradually as confidence builds
- Roll back instantly if problems surface at any stage
You can also automate this entire progression using Cloud Deploy, GCP’s managed delivery pipeline service. Cloud Deploy handles traffic splitting, verification steps, and rollback without manual commands at each stage. For teams deploying frequently, this is the more sustainable approach. It is also less likely to be skipped under pressure.
Step-by-step canary deployment with Cloud Run
Here is a complete canary rollout from initial deploy to full promotion. Each step is intentional. Do not skip ahead to save time.
Step 1: Deploy with no traffic
Deploy the new image and tell Cloud Run not to route any production traffic to it yet. The —tag flag assigns a stable subdomain URL so you can test this specific revision directly.
gcloud run deploy api-service \
--image=europe-west2-docker.pkg.dev/my-app-prod/api/api:v2.0.0 \
--region=europe-west2 \
--no-traffic \
--tag=canaryAfter this command completes, the new revision exists and is running but serves zero production traffic. It is only accessible at its tagged URL, for example: https://canary---api-service-xxxx-ew.a.run.app
Step 2: Smoke test the tagged revision
Run your smoke tests against the tagged URL. Check that the application starts correctly, connects to its dependencies, and returns expected responses on core paths. Look at Cloud Logging for the new revision to confirm there are no startup errors before proceeding.
A health check endpoint returning 200 does not mean your application is working correctly. Test the paths that matter: submit a form, trigger a key workflow, check a core API response. Failures caught here are free. Failures caught after traffic shifts are not.
Step 3: Shift a small initial percentage
Once smoke tests pass, send a small slice of live traffic to the new revision. Get the exact revision name from the deploy output or by listing revisions first.
# Confirm the new revision name
gcloud run revisions list \
--service=api-service \
--region=europe-west2
# Send 5% to the canary, keeping the rest on the stable revision
gcloud run services update-traffic api-service \
--region=europe-west2 \
--to-revisions=api-service-00050-abc=5,LATEST=0Step 4: Monitor both revisions
Watch your metrics with revision-level filtering. You need revision-level comparison, not just overall service health. Give the canary enough time to collect meaningful signal before drawing conclusions. Set up alert policies on the canary revision’s error rate before proceeding to the next step.
Step 5: Increase traffic gradually
If the canary looks healthy after a suitable observation window, increase the percentage. There is no fixed rule for how many stages to use, calibrated to your service’s traffic volume and the risk level of the change.
# After a monitoring window, increase to 20%
gcloud run services update-traffic api-service \
--region=europe-west2 \
--to-revisions=api-service-00050-abc=20,LATEST=0
# After another monitoring window, increase to 50%
gcloud run services update-traffic api-service \
--region=europe-west2 \
--to-revisions=api-service-00050-abc=50,LATEST=0Step 6: Full rollout
Once you are confident the new revision is stable at higher percentages, send it all traffic.
gcloud run services update-traffic api-service \
--region=europe-west2 \
--to-revisions=api-service-00050-abc=100Step 7: Rollback command — have it ready before you start
Paste this into an open terminal before any traffic is shifted. Do not search for it during an active incident. If the canary shows problems at any step, this returns all traffic to the stable revision immediately.
# Rollback: send all traffic back to the stable revision
gcloud run services update-traffic api-service \
--region=europe-west2 \
--to-revisions=api-service-00049-xyz=100Cloud Run revisions that receive zero traffic cost nothing. Running a canary at 5% means you pay for 5% of requests on the new revision and 95% on the stable one. There is no meaningful cost penalty for canary testing on Cloud Run.
How it works in practice
Stepping back from the individual commands, here is how a canary release actually flows in a real team’s deployment cycle:
- The stable revision is handling 100% of production traffic and is known to be healthy.
- A new revision is deployed with no traffic. The team runs smoke tests against the tagged URL to catch obvious failures before any real users are involved.
- The team shifts 1 to 5% of live traffic to the new revision. They have pre-built revision-filtered panels and alert policies already configured, not watching dashboards manually.
- After a suitable observation window, they compare error rate and latency between the two revisions. If the canary looks equivalent to stable, they increase the percentage.
- This continues in stages until the new revision handles 100%. The old revision then sits idle at zero cost for a day before anyone considers removing it.
- If at any stage the canary shows problems, a single command restores the stable revision to 100% and the incident process takes over.
Blue/green is a light switch: off, then fully on, all at once. A canary is a dimmer: you bring the new version up slowly and watch what happens as the light changes. If something looks wrong, you dim it back before most people notice.
The important thing to understand: a canary is not just a traffic split command. It is a strategy that only provides safety if you have per-revision monitoring in place before traffic starts shifting. Without that, you are rolling out gradually but blind, which is less safe than a well-validated blue/green switch.
For teams integrating canary into their full delivery pipeline, see CI/CD pipelines for Cloud Run to understand how the build-deploy-promote cycle fits together end to end.
When to use canary deployments
Canary is not the right choice for every release. It adds process overhead and requires mature monitoring in place before it delivers real safety benefits.
Strong fit
- High-risk changes: authentication changes, pricing logic, payment flows, database query refactors — any code path where a bug would be immediately visible or costly
- Changes with uncertain production behaviour: features that are difficult to test exhaustively in staging, or that depend on production-scale traffic patterns to surface bugs
- Services with sufficient traffic: you need enough requests flowing to the canary revision to measure a meaningful signal within a practical time window
- Teams with monitoring already in place: the blast radius reduction only works if someone is watching or via automatic alert policies
Weaker fit
- Low-traffic services: at 5% of 10 requests per day, your observation window becomes impractical; you may need hours to collect enough signal to be meaningful
- Simple, low-risk changes: a dependency version bump or minor copy change does not warrant a staged rollout and the overhead that comes with it
- Session-sensitive applications: Cloud Run splits traffic per-request, not per-user session; a single user may hit the stable revision on one request and the canary on the next
If your application stores user state that is incompatible between versions — cached tokens, session cookies, shopping cart data — a canary can create confusing mixed-state experiences. A user might add an item to their cart on the stable revision, then hit the canary revision on the next request and find it missing. Plan for this before shifting any traffic, or choose a different strategy for that release.
For managing how changes flow through dev, staging, and production as part of a wider delivery process, see managing environments in CI/CD.
Canary vs blue/green vs rolling deployments
These three strategies solve different problems. The right choice depends on what kind of risk you are managing and what operational tooling you have available.
| Strategy | Traffic exposure | Rollback speed | Monitoring needed | Complexity |
|---|---|---|---|---|
| Canary | Gradual, small % of real users | Fast (single traffic command) | High, per-revision metrics required | Medium-high |
| Blue/green | All-at-once after isolated testing | Instant (traffic switch) | Medium, post-switch window | Medium |
| Rolling | Progressive replacement (pod by pod) | Slower, requires redeploy or scale-down | Low | Low |
Blue/green deployments give you a clean, all-or-nothing switch: you validate the new version in isolation and then flip all traffic at once. The rollback is instant because the old version stays running. The key limitation is that you validate against synthetic or pre-production traffic. If a bug only surfaces under real production load, blue/green will not catch it before the switch.
Rolling deployments replace instances one by one and are the Kubernetes default. They are the simplest strategy operationally, but both old and new versions serve traffic simultaneously during the rollout with no explicit traffic control. You cannot cap exposure at 5%. You are rolling forward whether or not problems emerge.
Canary sits between these. It provides the production-traffic validation that blue/green lacks and the explicit traffic control that rolling deployments lack, but it requires the most preparation to execute safely. The monitoring infrastructure is not optional.
What to monitor during a canary
Set up these checks before shifting any traffic. You need the dashboards ready before the canary starts, not after you notice something is wrong. Cloud Monitoring lets you filter all Cloud Run metrics by revision name, which is the capability that makes per-revision comparison possible.
- Error rate per revision: filter
run.googleapis.com/request_countby response code class and revision name. Compare the 5xx rate on the canary against the stable revision. A canary error rate more than 2 percentage points above baseline is a signal to roll back immediately. Do not wait to gather more data. - Latency (p99): average latency hides slow outliers. Watch p99 on
run.googleapis.com/request_latenciesfiltered to the canary revision. A p99 increase of more than 50% compared to the stable revision suggests a performance regression that will affect user experience. - Downstream dependency health: check Cloud Trace to see if the canary is generating unusual failure rates on calls to databases or external APIs that the stable revision is not. A regression in a downstream call may not show up as a server error but will degrade user experience in ways that are harder to catch.
- Business metrics: if you emit log-based metrics for key events such as orders placed, searches completed, or sign-ups, compare the rate per request between revisions. A functional regression that does not cause server errors will only surface here. See metrics in GCP for how to build log-based metrics for this kind of tracking.
Give the canary enough time at each percentage before concluding it is healthy. At 5% traffic on a service handling 100 requests per minute, you get only 5 canary requests per minute. Wait at least 15 to 30 minutes before moving to the next stage. For lower-traffic services, the observation window needs to extend to hours. Not minutes.
Build a Cloud Monitoring dashboard with side-by-side panels for canary and stable revisions before any rollout begins. Switching between separate metric queries during an active incident adds seconds you cannot afford. Have the comparison view ready before the first traffic shift.
Common mistakes
No monitoring before starting the canary. A canary without per-revision metrics is just a slow rollout with no safety net. Set up error rate and latency alerts on the canary revision before shifting any traffic. Monitoring is not something you add after you first notice a problem.
Starting too high. Beginning at 20% or 50% defeats the purpose. Start at 1 to 5%. You can increase quickly if things look healthy, but you cannot undo a large blast radius once a bug is in front of half your users.
Not having the rollback command ready. Before shifting any traffic, write the rollback command in an open terminal and leave it there. Searching for the correct syntax during an active incident is avoidable and adds stress you do not need.
Not waiting long enough. Five minutes at 5% is not enough signal for most services. Wait for a statistically meaningful number of canary requests before deciding the release is safe. Low-traffic services require longer windows. The temptation to promote quickly is highest when you feel confident, which is precisely when patience matters most.
Ignoring session behaviour. Cloud Run splits traffic per-request, not per-user session. A single user may hit the stable revision on one request and the canary on the next. If your application has version-incompatible session data or user-state assumptions, account for this in your rollout plan or consider a different strategy for that release.
Promoting based on elapsed time alone. “It has been running for an hour” is not the same as “the canary looks healthy.” If you were not watching the right metrics, you have not actually validated anything. Promotion decisions should be based on observed data, not time elapsed.
Automating canary progression with Cloud Deploy
Manual traffic splitting works, but it requires someone to run each command at each stage, watch dashboards, and decide when to promote. Under deadline pressure, these steps get skipped or rushed. Cloud Deploy has a built-in canary strategy that automates the entire progression: traffic splitting, verification, promotion, and rollback.
serialPipeline:
stages:
- targetId: prod
strategy:
canary:
runtimeConfig:
cloudRun:
automaticTrafficControl: true
canaryDeployment:
percentages: [5, 20, 50]
verify: trueThis configuration tells Cloud Deploy to deploy at 5% first, run a verification job, then advance to 20%, then 50%, then 100%. At each stage, verify: true means Cloud Deploy runs a verification job before promoting further. If the verification job fails at any stage, Cloud Deploy rolls back automatically with no human needed at the exact moment a problem surfaces.
# Create a release and Cloud Deploy manages the entire canary progression
gcloud deploy releases create api-v2-0-0 \
--delivery-pipeline=api-pipeline \
--region=europe-west2 \
--images=api=europe-west2-docker.pkg.dev/my-app-prod/api/api:v2.0.0For understanding what happens when automated verification fails and how to trigger manual rollbacks within Cloud Deploy, see rollbacks in Cloud Deploy. For hardening the pipeline itself, see secure CI/CD pipelines.
Summary
- Canary deployments expose the new version to a controlled percentage of real production traffic, providing validation that synthetic tests and staging environments cannot replicate
- On Cloud Run: deploy with
—no-traffic, smoke test at the tagged URL, then shift traffic gradually using—to-revisions=REVISION=PERCENT - The blast radius reduction only works if you have per-revision monitoring in place before any traffic is shifted. Monitoring is the mechanism, not an afterthought.
- Monitor error rate and p99 latency filtered to each revision separately. Overall service health metrics will not show you what you need.
- Have your rollback command ready in an open terminal before the canary begins
- Canary differs from blue/green in that both versions serve real users simultaneously; blue/green switches all traffic at once after isolated testing
- Cloud Deploy automates progression through configurable traffic percentages with verification steps and automatic rollback if anything fails
Frequently asked questions
What is a canary deployment?
A canary deployment routes a small percentage of production traffic to a new version while the majority continues to the stable version. You monitor error rates and latency on the new version. If it behaves well, you increase the split incrementally until the new version handles 100%. If problems emerge, you send all traffic back to the stable version and most users never experience the issue.
How does Cloud Run support canary deployments?
Cloud Run's revision system natively supports traffic splitting between revisions. Deploy a new revision with --no-traffic so it receives nothing initially, verify it at its tagged URL, then use --to-revisions=REVISION=PERCENT to shift a small percentage. You can run multiple revisions simultaneously with explicit traffic weights, and Cloud Run only charges for requests actually served.
What is the difference between canary and blue/green deployments?
A canary gradually shifts traffic so both versions serve real users simultaneously over an extended period. Blue/green switches all traffic at once after testing in isolation. Canary provides more evidence because you validate under real production load, but it requires per-revision monitoring to be safe. Blue/green is simpler and faster to execute, but cannot catch bugs that only surface under real traffic patterns.
How much traffic should a canary start with?
Start at 1 to 5%. A canary at 50% exposes half your users to an untested version, which defeats the purpose. Start small, collect signal, then increase gradually. A progression of 5%, 20%, 50%, 100% is a reasonable starting point, adjusted to your service's traffic volume and risk tolerance.
How long should a canary run before full rollout?
Long enough to collect statistically meaningful signal. At 5% traffic on a service handling 100 requests per minute, you get only 5 canary requests per minute. Wait at least 15 to 30 minutes before concluding the canary is healthy. For lower-traffic services, the observation window may need to be several hours. The temptation to promote quickly is highest when you are most confident, which is exactly when patience matters most.