Canary Deployments in AWS: Safe Rollouts for ECS, Lambda, and ALB-Based Apps
A canary deployment sends a small slice of real production traffic to a new version of your service while the rest continues hitting the stable version. If the new version has a serious bug, only that fraction of users is affected. When signals look healthy, you complete the rollout. If signals look bad, you roll back in seconds. This page covers canary deployments in AWS for ECS, Lambda, and ALB-fronted applications, including CodeDeploy traffic shifting, automatic rollback with CloudWatch alarms, and how to choose the right pattern for your workload.
What a canary deployment actually does
Instead of flipping a switch that sends 100 percent of your users to the new version at once, you nudge the dial to 10 percent. Most users keep hitting the stable version. A small slice hits the new one. You watch what happens for 10 to 15 minutes.
If error rates stay flat, latency looks normal, and business metrics hold steady, you nudge to 50 percent, observe again, then go to 100 percent. If anything looks wrong, you nudge back to 0 percent and the rollback is complete. No scrambling, no incident bridge, no manual deploys at 2am.
The key advantage over a full release is that you get real production signal before the whole fleet runs the new code: real users, real data, real integrations, in a configuration staging cannot replicate.
Where the name comes from: Coal miners carried canaries into mines because the bird’s smaller respiratory system detected toxic gas before levels became dangerous for humans. In software, the canary deployment shows distress first, before the problem reaches all users.
Which AWS pattern fits your platform
AWS offers multiple canary deployment patterns. The right one depends on where your workload runs.
| Platform | Recommended pattern | How traffic shifts |
|---|---|---|
| Lambda | CodeDeploy with Lambda aliases | Alias routing splits invocations between two function versions |
| ECS (with CodeDeploy) | CodeDeploy blue/green with TimeBasedCanary | CodeDeploy manages ALB traffic between two ECS task sets |
| ECS / ALB-fronted apps | ALB weighted target groups | Manually adjust weights between stable and canary target groups |
| EC2-backed services | ALB weighted target groups (versioned backends) | Register old and new instances in separate target groups and adjust ALB rule weights |
EC2 and CodeDeploy: CodeDeploy supports canary and linear traffic shifting natively for Lambda and ECS. For EC2 and on-premises workloads, CodeDeploy uses a different deployment model (in-place or blue/green with instance replacement) that does not provide the same granular percentage-based traffic shifting. If you need percentage-based canaries for EC2-backed services, use ALB weighted target groups directly.
How canary deployments work in AWS
The rollout flow
- Deploy a new version alongside the stable one. The old version keeps serving traffic. Nothing is decommissioned yet.
- Route a small percentage of traffic to the new version. Typically 5 to 10 percent, depending on your traffic volume and risk tolerance.
- Watch technical and business signals. Error rates, latency p99, failed transactions, unusual log patterns.
- Promote if healthy. Shift to 50 percent, observe again, then shift to 100 percent.
- Roll back automatically if alarms fire. CodeDeploy or your ALB rule reverts the traffic split before most users notice anything.
Prerequisites for a successful canary
- Working health checks: Both ALB target groups and your ECS tasks or Lambda functions must have health checks configured and passing.
- Observability in place: You need CloudWatch metrics watching the canary before you deploy. A canary with no monitoring provides no safety net.
- A configured rollback path: Either CodeDeploy alarm-linked rollback or a runbook for manual ALB weight reversion. See Rollbacks in AWS CodeDeploy.
- Backward-compatible schema changes: Both versions run simultaneously. A schema change the old version cannot read will cause data corruption or errors.
- Enough traffic volume: At very low traffic, canary percentages do not generate enough requests to detect bugs reliably.
When to use canary deployments
Canary deployments are the right tool when:
- You are releasing a change to a customer-facing API and want to validate it under real load before full exposure.
- You are updating a Lambda-backed production function that handles payments, notifications, or other high-stakes operations.
- You are deploying a new version of an ECS service behind an ALB and want gradual traffic promotion with automatic rollback.
- The change is high-risk: a large refactor, a dependency upgrade, or a new integration with a third-party service.
- You need production validation that staging cannot provide, especially for performance or data-dependent behavior.
When not to use canary deployments
- Very low-traffic workloads. With 100 requests per day and a 10 percent split, you get roughly 10 canary requests per day. That is not enough signal to catch most bugs. Consider a blue-green deployment instead.
- Breaking database or schema changes. Running two versions against the same schema when only the new version understands it will corrupt data. Make migrations additive first.
- Batch jobs and queue processors. There is no live request routing to split. A canary percentage does not apply to workers consuming from a queue.
- Simple, low-risk changes. Updating a log message or a response field label does not need a canary. The operational overhead is not worth it for trivial changes.
- Changes that break session affinity. If users have session state that must stay pinned to a single version, splitting requests across versions will cause broken experiences.
Canary vs blue-green vs rolling vs feature flags
These four strategies all reduce deployment risk but solve different problems. Use this table to decide which fits your situation. See also: Blue-Green Deployments in AWS.
| Strategy | Best for | How traffic moves | Rollback speed | Main downside |
|---|---|---|---|---|
| Canary | High-risk releases with enough traffic volume | Gradual percentage shift: 10% to 50% to 100% | Fast (seconds to minutes) | Two versions run in parallel; schema changes must be backward-compatible |
| Blue-green | Releases needing instant full rollback capability | Switches 100% at once; old environment stays on standby | Very fast (near-instant ALB switch) | Double infrastructure cost during transition; no gradual real-traffic validation |
| Rolling | Instance-by-instance updates with no spare environment | Replaces instances one at a time, mixing old and new | Slow (must re-deploy old version) | Mixed versions run simultaneously; no easy full rollback path |
| Feature flags | User-level feature exposure within the same codebase | App-level toggle per user; no infrastructure routing change | Instant (flip the flag) | Requires flag management infrastructure; long-lived flags accumulate technical debt |
Use canary or blue-green for deploying a new service version, and feature flags for controlling which users see a new feature within that version. They complement each other. AWS AppConfig integrates with Lambda and ECS for feature flags without building your own evaluation system.
Canary with ALB weighted target groups
Best for: ECS services, EC2-backed services, or any ALB-fronted workload where you want direct control over traffic splitting without requiring CodeDeploy integration.
An Application Load Balancer can forward traffic across two target groups using weighted rules. You register the new version’s tasks or instances in a second target group, set the ALB listener rule weights, observe, then promote or revert the weights if something goes wrong.
What you need first
- An ALB listener rule you can modify
- A stable target group (current version with passing health checks)
- A canary target group (new version, same health check path)
- CloudWatch alarms watching the canary target group’s error rate
Step 1: Create the canary target group and set initial weights at 10%
# Create target group for new version
aws elbv2 create-target-group \
--name my-app-canary \
--protocol HTTP \
--port 8080 \
--vpc-id vpc-abc123 \
--health-check-path /health \
--target-type ip
# Route 10% of traffic to the canary
aws elbv2 modify-rule \
--rule-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:listener-rule/... \
--actions '[
{
"Type": "forward",
"ForwardConfig": {
"TargetGroups": [
{"TargetGroupArn": "arn:...my-app-stable", "Weight": 90},
{"TargetGroupArn": "arn:...my-app-canary", "Weight": 10}
]
}
}
]'Step 2: Observe, then promote in stages
Monitor for 15 to 30 minutes. Watch the canary target group’s 5xx rate, latency p99, and application-level metrics in CloudWatch. If signals are healthy, promote:
# Shift to 50/50
aws elbv2 modify-rule \
--rule-arn arn:... \
--actions '[
{
"Type": "forward",
"ForwardConfig": {
"TargetGroups": [
{"TargetGroupArn": "arn:...my-app-stable", "Weight": 50},
{"TargetGroupArn": "arn:...my-app-canary", "Weight": 50}
]
}
}
]'
# Complete rollout
aws elbv2 modify-rule \
--rule-arn arn:... \
--actions '[
{
"Type": "forward",
"ForwardConfig": {
"TargetGroups": [
{"TargetGroupArn": "arn:...my-app-stable", "Weight": 0},
{"TargetGroupArn": "arn:...my-app-canary", "Weight": 100}
]
}
}
]'The ALB weighted target group approach gives you full control over timing and works without any CodeDeploy setup. The tradeoff is that there is no built-in alarm-triggered rollback. You need to script the weight reversion or trigger it from a pipeline step. For automated rollback, the CodeDeploy-integrated patterns below are a better fit.
Lambda canary deployments with CodeDeploy
Best for: Lambda functions in production where you want percentage-based traffic shifting with automatic alarm-triggered rollback, fully managed by CodeDeploy.
Lambda canary deployments use function aliases and AWS CodeDeploy. A Lambda alias points to a specific function version. CodeDeploy adjusts the alias’s routing configuration to start sending a small percentage of invocations to the new version, then promotes or rolls back based on CloudWatch alarm state.
For Lambda-specific metrics to watch during a canary, see Monitoring Lambda in CloudWatch.
How aliases work: Think of a Lambda alias like a pointer on a sign outside a restaurant. The sign says “Today’s special” and points to kitchen A (your current version). You quietly start routing one table out of ten to kitchen B (the new version) without changing the sign. If kitchen B serves bad food, you stop sending tables there and kitchen A handles everything again.
Built-in deployment configurations
| Configuration | What it does |
|---|---|
Canary10Percent5Minutes | Sends 10% to new version, waits 5 minutes, then shifts 100% |
Canary10Percent10Minutes | Sends 10% to new version, waits 10 minutes, then shifts 100% |
Canary10Percent30Minutes | Sends 10% to new version, waits 30 minutes, then shifts 100% |
Linear10PercentEvery1Minute | Increases traffic to new version by 10% every minute until 100% |
Linear10PercentEvery3Minutes | Increases traffic to new version by 10% every 3 minutes until 100% |
AllAtOnce | Shifts all traffic immediately with no canary window |
SAM template with canary deployment and alarm-triggered rollback
Transform: AWS::Serverless-2016-10-31
Globals:
Function:
AutoPublishAlias: live
DeploymentPreference:
Type: Canary10Percent10Minutes
Alarms:
- !Ref CanaryErrorAlarm
Resources:
MyFunction:
Type: AWS::Serverless::Function
Properties:
Handler: index.handler
Runtime: nodejs20.x
CodeUri: ./src
CanaryErrorAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: MyFunction-CanaryErrors
MetricName: Errors
Namespace: AWS/Lambda
Statistic: Sum
Period: 60
EvaluationPeriods: 1
Threshold: 5
ComparisonOperator: GreaterThanOrEqualToThreshold
Dimensions:
- Name: FunctionName
Value: !Ref MyFunction
- Name: Resource
Value: !Sub '${MyFunction}:live'When CanaryErrorAlarm fires during the canary window, CodeDeploy automatically rolls back. The alias returns to pointing 100 percent at the previous version without manual intervention. See Rollbacks in AWS CodeDeploy for how Lambda rollbacks work across deployment types.
What metrics matter for Lambda canaries
- Errors: The primary rollback trigger. Set the threshold based on your normal error baseline, not zero.
- Duration p99: Performance regressions appear here before they show up as errors.
- Throttles: A spike may indicate resource exhaustion or a misconfigured concurrency limit.
- Custom business metrics: If your function handles payments or job processing, emit custom metrics for failed operations and alarm on those too.
ECS canary deployments with CodeDeploy
Best for: ECS services where you want CodeDeploy to manage the full deployment lifecycle including task set creation, traffic shifting, and alarm-triggered rollback, without manually adjusting ALB rules.
For ECS workloads, CodeDeploy manages traffic shifting between two ECS task sets rather than target groups you modify manually. CodeDeploy creates the new task set, registers it with the ALB, shifts traffic according to your configuration, and tears down the old task set after a successful deployment. If alarms breach, it shifts traffic back to the original task set and the old tasks keep running.
This requires configuring your ECS service to use the CODE_DEPLOY deployment controller. For teams building this into a pipeline, see CI/CD Pipelines for ECS.
Deployment configuration: time-based canary
{
"deploymentConfig": {
"deploymentType": "BLUE_GREEN",
"blueGreenDeploymentConfiguration": {
"terminateBlueInstancesOnDeploymentSuccess": {
"action": "TERMINATE",
"terminationWaitTimeInMinutes": 5
},
"deploymentReadyOption": {
"actionOnTimeout": "CONTINUE_DEPLOYMENT",
"waitTimeInMinutes": 0
}
},
"trafficRoutingConfig": {
"type": "TimeBasedCanary",
"timeBasedCanary": {
"canaryPercentage": 10,
"canaryInterval": 5
}
}
}
}With TimeBasedCanary, CodeDeploy sends 10 percent of traffic to the new task set for 5 minutes. If CloudWatch alarms stay in OK state, it shifts 100 percent and terminates the old task set. If an alarm breaches, traffic shifts back to the original task set.
ALB weighted target groups vs ECS CodeDeploy: The ALB approach gives you manual control over timing and works without CodeDeploy integration. It is useful for simpler setups or EC2-backed targets. The ECS CodeDeploy approach automates task set creation, traffic shifting, and rollback. It is better for production ECS services where you want the full deployment lifecycle managed.
What to monitor during a canary
Error rate alone is not enough. A canary can have zero HTTP 5xx errors while silently returning incorrect data, creating corrupted records, or degrading performance in ways users notice later. Monitor across both of these categories:
Technical signals
- HTTP 5xx / function errors: The minimum baseline. Watch both the absolute count and the rate relative to request volume on the canary target group.
- Latency p95 / p99: Performance regressions often appear here before they surface as errors. A 3x increase in p99 latency is a problem even with a zero error rate.
- Resource saturation: CPU, memory, and connection pool usage. A memory leak may not produce errors during a short canary window but will cause problems hours after a full rollout.
- Unusual log patterns: Unexpected exceptions, retry storms, timeout warnings, or error messages that did not appear in the previous version.
Business signals
- Failed transactions: Checkout completions, payment successes, signup completions, job success rates.
- Data integrity: Records written correctly, no duplicate or missing writes caused by the new version’s logic.
- Downstream impact: If this service calls others, watch for cascading error spikes in dependent services.
Set up your CloudWatch alarms on both categories before you deploy, not after the canary is already running. For deeper investigation when something goes wrong mid-canary, see Debugging Production Systems.
Automatic rollback with CloudWatch alarms
Automatic rollback is what makes canary deployments genuinely safe at scale. Manual monitoring during the canary window is unreliable. It does not work at 3am, it does not catch slow-building problems, and it requires someone to act quickly under pressure. Connect your alarms to CodeDeploy and let the system roll back without human intervention.
See Rollbacks in AWS CodeDeploy for how rollback behavior differs across Lambda, ECS, and EC2 deployment types.
# Create a CloudWatch alarm on canary target group 5xx errors
aws cloudwatch put-metric-alarm \
--alarm-name "canary-5xx-rate" \
--alarm-description "Triggers rollback if canary error rate exceeds threshold" \
--metric-name HTTPCode_Target_5XX_Count \
--namespace AWS/ApplicationELB \
--dimensions Name=TargetGroup,Value=targetgroup/my-app-canary/abc123 \
--statistic Sum \
--period 60 \
--evaluation-periods 2 \
--threshold 10 \
--comparison-operator GreaterThanOrEqualToThreshold \
--treat-missing-data notBreachingReference this alarm in your CodeDeploy deployment group. When the alarm breaches during a deployment, CodeDeploy automatically reverses the traffic shift and restores the previous version. Your on-call engineer gets notified that a rollback occurred but does not need to take action.
Test your rollback before you need it. Deploy a canary that you know will fail by introducing a deliberate error, and verify that the alarm fires and CodeDeploy rolls back correctly. Do this in staging. On-call engineers should never discover that rollback automation was misconfigured during a real incident.
Common mistakes
- Canary window too short. A 1 to 2 minute window will not catch memory leaks, rare code paths, or bugs that only appear under sustained load. Start with at least 10 to 15 minutes for most services.
- Percentage too low for the traffic volume. At 1 percent on a service handling 100 requests per day, you get one canary request per day. Scale the percentage to get enough requests to be statistically meaningful.
- Monitoring only infrastructure error rates. A canary that processes payments should also alarm on failed transactions. A canary handling ML inference should alarm on abnormal latency. Infrastructure errors are a floor, not a ceiling.
- Never testing the automatic rollback. If you have not verified that alarm breach triggers CodeDeploy rollback end-to-end, you do not have rollback protection. You have rollback hope. Test it in staging before you depend on it in production.
- Deploying a canary over a breaking schema change. Two versions running against the same database where only the new version understands the new schema will cause data corruption or query errors. Use additive migrations first.
- Treating canary as a substitute for feature flags. A canary shifts traffic by percentage; it cannot target specific users. If you need to roll out a feature to beta users or a specific cohort, use feature flags alongside your canary deployment.
Schema changes and canary deployments do not mix unless migrations are additive. When the new version introduces a column or table that the old version does not know about, and both versions write to the same database simultaneously, you will get errors, duplicate records, or silent data loss. Run the migration first. Deploy the canary after both versions can handle the new schema safely.
Summary
- A canary deployment sends 5 to 10 percent of production traffic to a new version. If it has bugs, only that fraction of users is affected before rollback.
- For Lambda, use CodeDeploy with aliases and a configuration like
Canary10Percent10Minutes. For ECS, use CodeDeploy withTimeBasedCanaryor ALB weighted target groups for manual control. For EC2 and ALB-fronted apps, use ALB weighted target groups directly since EC2 CodeDeploy uses a different deployment model. - Configure CloudWatch alarms on error rates and key business metrics, and link them to CodeDeploy for automatic rollback. Never rely on manual monitoring during the canary window.
- Set the canary window long enough to get statistical signal. For high-traffic services, 10 to 15 minutes is usually enough. For low-traffic services, consider whether canary is the right strategy at all.
- Ensure schema migrations are backward-compatible before running a canary. Two versions running against an incompatible schema will cause data corruption.
- Canary is traffic-based. Feature flags are user-based. Use both: canary for safe infrastructure rollouts, feature flags for controlled feature exposure within a version.
Frequently asked questions
What percentage of traffic should a canary receive?
Start with 5 to 10 percent. That is enough to detect real problems while keeping the blast radius small. Only that fraction of users hits the new version if something goes wrong. For very high-traffic services, even 1 percent can be statistically significant. For low-traffic services, you may need a higher percentage to generate enough requests to detect bugs reliably.
How long should a canary run?
Long enough to generate meaningful signal. For services handling thousands of requests per minute, 10 to 15 minutes is usually enough. For low-traffic services handling dozens of requests per day, you may need hours. Monitor error rates, latency p99, and business metrics throughout, not just whether the alarm stayed quiet.
What if my service has very low traffic?
Canary deployments are less effective at very low traffic volumes. With 100 requests per day and a 10 percent canary split, you get roughly 10 canary requests per day, which is not enough to detect most bugs. Consider a higher percentage, a longer observation window, or a blue-green deployment instead.
Can I use canary deployments with database migrations?
Only if the migration is backward-compatible. Both versions of your service run simultaneously and hit the same database. A schema change that the old version cannot read will cause errors or data corruption. Always use additive migrations: add columns or tables before removing old ones. Never deploy a canary that depends on a breaking schema change.
What is the difference between canary, blue-green, and feature flags?
A canary gradually shifts a percentage of traffic to a new service version. Blue-green flips 100 percent of traffic to a new environment at once, with the old environment on standby for rollback. A feature flag controls whether a specific feature is visible to specific users within a single running version of your code. Canary and blue-green are infrastructure patterns. Feature flags are an application-level pattern. Teams often use all three together.