How to Create CloudWatch Alarms in AWS (Console + CLI Examples)
A CloudWatch alarm watches a metric and takes action when it crosses a threshold you set. Without alarms, you have to watch dashboards manually and hope someone notices a problem before users report it. With well-configured alarms, CloudWatch pages you automatically when something goes wrong. This guide covers how to create CloudWatch alarms in the AWS Console and with the CLI, how to configure SNS notifications, how to choose thresholds that won’t flood you with false positives, and the mistakes that make alarms unreliable.
Simple explanation
Before creating an alarm, it helps to understand the difference between three related CloudWatch concepts:
- A metric is a time-series measurement. CPU usage at 2:00 PM, 2:05 PM, 2:10 PM. It is just data. CloudWatch collects hundreds of metrics automatically from AWS services with no setup required.
- A dashboard visualizes those metrics in charts. CloudWatch Dashboards are useful for understanding trends, but someone still has to look at them.
- An alarm watches a metric and does something when the value crosses a line you draw. No one has to be watching. The alarm fires on its own.
A useful way to picture it
A metric is the thermometer. A dashboard is the display on the wall that shows the reading. An alarm is the smoke detector that wakes you up at 3 AM. The first two give you information. Only the third one acts.
How CloudWatch alarms work
When you create an alarm, you define a chain of components that determine when and how it fires:
- Metric — which specific measurement to watch (for example,
CPUUtilizationfor a specific EC2 instance) - Statistic — how to aggregate data points within each period: Average, Sum, Minimum, Maximum, SampleCount, or a percentile like p99
- Period — the time window for each data point, in seconds (60 = 1 minute, 300 = 5 minutes)
- Threshold — the numeric value and comparison operator (greater than 80%, less than 10 GB, and so on)
- Evaluation periods — how many consecutive data points must breach the threshold before the alarm fires; higher values reduce false positives from brief spikes
- Datapoints to alarm (M of N) — a flexible variant: M out of the last N periods must breach the threshold. Setting 3 out of 5 fires the alarm if 3 of the last 5 data points were over the threshold, even if they weren’t all consecutive.
- Missing data treatment — what to do when no data arrives for a period (covered below)
- Actions — what happens when the alarm transitions to a new state
The three alarm states
| State | Meaning | When it occurs |
|---|---|---|
| OK | Metric is within the acceptable threshold | Normal operation; no threshold breach detected |
| ALARM | Threshold has been breached | Metric exceeded (or fell below) the threshold for the required evaluation periods |
| INSUFFICIENT_DATA | Not enough data to evaluate | New alarm before the first period completes, metric stopped reporting, or instance was stopped |
Actions fire on state transitions, not while an alarm stays in a state. To be notified when an incident starts and again when it resolves, configure actions for both the ALARM state and the OK state.
M-of-N evaluation catches intermittent problems. If you set evaluation periods to 5 and datapoints to alarm to 3, the alarm fires when 3 of the last 5 data points breach the threshold, even if they were not consecutive. This catches recurring spikes that a strictly consecutive check might miss, while still ignoring a single isolated blip.
Missing data treatment
When no metric data arrives during a period, CloudWatch needs to know how to handle the gap. The four options are:
- notBreaching — treat the missing period as within threshold. Good for sparse metrics like Lambda invocations that genuinely have quiet periods.
- breaching — treat the missing period as a threshold violation. Use this when absence of data is itself a problem, such as an EC2 health check that should always be reporting.
- ignore — keep the current alarm state unchanged during the gap.
- missing — the alarm transitions to INSUFFICIENT_DATA.
Choosing notBreaching for a metric that should always be reporting means the alarm goes silent if the data source disappears. For things like health checks and heartbeat metrics, set this to breaching so the alarm fires when data stops arriving.
When to use CloudWatch alarms
Almost every production AWS workload needs alarms. These are the most common scenarios and what each alarm is protecting against:
- EC2 high CPU or status checks — sustained CPU above 80% may indicate a runaway process or an undersized instance. A failing status check means the instance or underlying hardware is unhealthy and needs immediate attention.
- Lambda errors or throttles — any errors in a function that should be error-free, or throttles showing you’ve hit the concurrency limit and requests are being dropped. See Monitoring Lambda in AWS for function-level alerting strategies.
- RDS low storage or connection pressure — storage exhaustion stops a database instance with no warning. Connection pressure approaching max_connections causes client errors as new connections are rejected.
- ALB 5xx spikes — a spike in 5xx responses from a load balancer means backend instances are returning errors. Alarm separately on ALB-level 5xx and target-level 5xx to distinguish load balancer issues from application issues.
- SQS queue backlog — if the age of the oldest message climbs, your consumer is falling behind. This usually means a consumer is failing silently or scaling hasn’t kicked in.
- Log-based patterns — you can create log-based metrics from CloudWatch Logs and alarm on them, which is useful for custom application errors that don’t map to a standard AWS metric.
Choose the right alarm type
| Type | How it works | Best for |
|---|---|---|
| Metric alarm | Compares a metric (or metric math expression) against a fixed threshold you define | Most alarms. Use this by default: CPU > 80%, errors > 0, storage < 10 GB. |
| Composite alarm | Combines the states of multiple metric alarms using AND, OR, NOT logic | Reducing noise when a single metric spiking isn’t enough to confirm a real incident. Also useful for suppressing child alarms during maintenance windows. |
| Anomaly detection alarm | Learns the metric’s expected range from historical data and fires when it deviates significantly from the predicted band | Metrics with variable patterns: web traffic that spikes on weekdays, batch jobs with fluctuating volume, anything where a fixed threshold would generate constant false positives. |
Start with metric alarms for everything you can express as a fixed threshold. Add composite alarms once you find yourself getting paged on single-signal spikes that resolve on their own. Switch to anomaly detection when traffic patterns vary enough that no stable fixed threshold exists.
How to create a CloudWatch alarm in the AWS Console
The following steps create a metric alarm. If you’re new to CloudWatch, the CloudWatch overview explains how alarms fit into the broader service.
Open CloudWatch. In the AWS Console, search for CloudWatch in the top search bar and open the service. Confirm you’re in the correct region. Alarms are region-scoped and don’t cross regions.
Navigate to Alarms. In the left sidebar, under Alarms, click All alarms. Then click Create alarm.
Select a metric. Click Select metric. You’ll see namespaces for every AWS service that reports to CloudWatch. Browse to the service you want (for example, EC2 > Per-Instance Metrics) or use the search box to find a specific metric. Select the metric row and click Select metric.
Configure the metric and conditions. You’re now on the “Specify metric and conditions” screen.
- Under Metric: choose your statistic (Average is appropriate for most continuous metrics; Sum is better for error counts and event-based metrics) and your period (5 minutes is a sensible default; 1 minute gives faster detection at higher cost).
- Under Conditions: choose Static threshold type (or Anomaly detection if you want a learned band). Set the comparison operator and enter the threshold value.
Set evaluation logic. Expand Additional configuration:
- Set Datapoints to alarm. For example, “2 out of 3” means two of the last three periods must breach the threshold before the alarm fires. This avoids false positives from brief spikes.
- Set Missing data treatment. For most metrics, Treat missing data as missing (transitions to INSUFFICIENT_DATA) is a safe default. Choose Treat missing data as bad if absence of the metric is itself an alert condition.
Click Next.
Configure actions. Under Notification, select the alarm state that triggers the action (In alarm). Choose an existing SNS topic or create a new one. To also receive a recovery notification, click Add notification, choose the OK state, and select the same topic. You can also add EC2 actions (reboot, stop, recover), Auto Scaling policies, or Lambda invocations from this screen. Click Next.
Name the alarm. Give the alarm a descriptive name that includes the service, metric, and environment. For example:
EC2-HighCPU-prod-web-01. A clear name makes it obvious what’s wrong when the alarm fires at 2 AM. Add an optional description. Click Next.Review and create. Check the alarm configuration on the preview screen. Click Create alarm. The alarm will start in
INSUFFICIENT_DATAstate until the first evaluation period completes.
Confirm your SNS subscription before moving on. If you created a new SNS topic with an email subscription, AWS sends a confirmation email immediately. The subscription stays inactive until you click the confirmation link. Alarms will appear to work in the console but no notification will be delivered. Check your spam folder if the email doesn’t arrive.
How to create a CloudWatch alarm with the AWS CLI
The CLI is ideal for scripting alarm creation, managing alarms across many resources at once, or wiring alarm setup into infrastructure automation. All examples use put-metric-alarm.
EC2 high CPU alarm
This fires if average CPU utilization stays above 80% for two consecutive 5-minute periods. Ten minutes of sustained high CPU is a more reliable signal than a single spike, and two evaluation periods prevents false positives from brief bursts.
aws cloudwatch put-metric-alarm \
--alarm-name "EC2-HighCPU-prod-web-01" \
--alarm-description "CPU utilization above 80% for 10 minutes" \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0abc123def456789 \
--statistic Average \
--period 300 \
--evaluation-periods 2 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--treat-missing-data breaching \
--alarm-actions arn:aws:sns:us-east-1:123456789012:production-alerts \
--ok-actions arn:aws:sns:us-east-1:123456789012:production-alertsThe —ok-actions flag points to the same SNS topic so you’re notified when CPU recovers, not just when it spikes. Setting —treat-missing-data breaching means if the instance stops reporting (for example, because it was terminated unexpectedly), the alarm fires rather than going silent.
Lambda error rate alarm using metric math
Lambda’s raw Errors metric counts errors, but the rate matters more when invocation volume varies. This example uses metric math to compute errors divided by invocations and fires when the error rate exceeds 5%. See Monitoring Lambda in AWS for more Lambda-specific alerting patterns.
aws cloudwatch put-metric-alarm \
--alarm-name "Lambda-HighErrorRate-process-orders" \
--alarm-description "Lambda error rate above 5%" \
--metrics '[
{
"Id": "errors",
"MetricStat": {
"Metric": {
"Namespace": "AWS/Lambda",
"MetricName": "Errors",
"Dimensions": [{"Name": "FunctionName", "Value": "process-orders"}]
},
"Period": 300,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "invocations",
"MetricStat": {
"Metric": {
"Namespace": "AWS/Lambda",
"MetricName": "Invocations",
"Dimensions": [{"Name": "FunctionName", "Value": "process-orders"}]
},
"Period": 300,
"Stat": "Sum"
},
"ReturnData": false
},
{
"Id": "error_rate",
"Expression": "errors / invocations * 100",
"Label": "Error Rate",
"ReturnData": true
}
]' \
--threshold 5 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:production-alertsRDS low free storage alarm
RDS reports FreeStorageSpace in bytes, not gigabytes. 10 GB = 10,737,418,240 bytes. Always check a metric’s unit before setting a threshold. Getting it wrong silently creates an alarm that never fires.
aws cloudwatch put-metric-alarm \
--alarm-name "RDS-LowFreeStorage-prod-postgres" \
--alarm-description "RDS free storage below 10 GB" \
--namespace AWS/RDS \
--metric-name FreeStorageSpace \
--dimensions Name=DBInstanceIdentifier,Value=prod-postgres \
--statistic Average \
--period 300 \
--evaluation-periods 3 \
--threshold 10737418240 \
--comparison-operator LessThanThreshold \
--alarm-actions arn:aws:sns:us-east-1:123456789012:production-alertsRDS, EBS, and several other services report storage metrics in bytes. Setting a threshold of 10 thinking it means 10 GB actually means 10 bytes. The alarm will never fire. Always check the CloudWatch metric documentation for the unit before writing the threshold value.
Testing alarms before relying on them
Use set-alarm-state to force an alarm into any state without affecting real metrics. This verifies that your SNS notification path works end to end before an actual incident.
# Force the alarm into ALARM state to test the SNS notification
aws cloudwatch set-alarm-state \
--alarm-name "EC2-HighCPU-prod-web-01" \
--state-value ALARM \
--state-reason "Testing alarm notification path"
# Reset it back to OK when done
aws cloudwatch set-alarm-state \
--alarm-name "EC2-HighCPU-prod-web-01" \
--state-value OK \
--state-reason "Test complete"
# List all alarms currently in ALARM state
aws cloudwatch describe-alarms \
--state-value ALARM \
--query 'MetricAlarms[*].{Name:AlarmName,Metric:MetricName,State:StateValue}' \
--output table
# Delete an alarm
aws cloudwatch delete-alarms \
--alarm-names "EC2-HighCPU-prod-web-01"Setting up notifications with SNS
CloudWatch alarms deliver notifications through Amazon SNS (Simple Notification Service). SNS acts as a pub/sub fanout layer: you create a topic, subscribe one or more endpoints to it, and then point alarms at the topic ARN. When an alarm fires, SNS delivers the notification to every active subscription on the topic.
# Create an SNS topic for alarm notifications
aws sns create-topic --name production-alerts \
--query 'TopicArn' --output text
# Returns: arn:aws:sns:us-east-1:123456789012:production-alerts
# Subscribe an email address to the topic
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-1:123456789012:production-alerts \
--protocol email \
--notification-endpoint your-email@example.com
# You will receive a confirmation email — click the link to activate
# Subscribe a webhook (for Slack, PagerDuty, or OpsGenie)
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-1:123456789012:production-alerts \
--protocol https \
--notification-endpoint https://hooks.example.com/your/webhook/urlEmail subscriptions work well for low-urgency or informational alarms. For production incidents that need immediate human response, connect SNS to a tool like PagerDuty, OpsGenie, or a dedicated Slack channel via HTTPS subscription. The SNS messaging model page covers topic configuration and subscription filtering in more detail.
SNS email subscriptions don’t activate automatically. AWS sends a confirmation email as soon as you subscribe. Until you click the confirmation link, the subscription is pending and all alarm notifications are silently dropped. Alarms will show as firing in the console, but nothing is delivered. Always verify the subscription is confirmed (not pending) before treating any alarm as production-ready.
Composite alarms: reducing alert noise
A composite alarm combines the states of multiple metric alarms using Boolean expressions (AND, OR, NOT). The most common use case: don’t page anyone when a single metric spikes in isolation. Require two or more related signals to be in ALARM simultaneously before notifying.
Think of it like a two-factor alarm
A burglar alarm that fires when the motion sensor OR the door sensor triggers will wake you up every time the cat walks past. One that fires only when the door sensor AND the motion sensor both trigger at 3 AM is actually useful. Composite alarms work the same way.
# Component alarm 1: Lambda errors above threshold
aws cloudwatch put-metric-alarm \
--alarm-name "Lambda-HighErrors" \
--namespace AWS/Lambda \
--metric-name Errors \
--dimensions Name=FunctionName,Value=process-orders \
--statistic Sum \
--period 300 \
--threshold 5 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1
# Component alarm 2: Lambda p99 duration above threshold
aws cloudwatch put-metric-alarm \
--alarm-name "Lambda-HighDuration" \
--namespace AWS/Lambda \
--metric-name Duration \
--dimensions Name=FunctionName,Value=process-orders \
--statistic p99 \
--period 300 \
--threshold 8000 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1
# Composite alarm: fires only when BOTH component alarms are in ALARM state
aws cloudwatch put-composite-alarm \
--alarm-name "Lambda-DegradedService" \
--alarm-rule "ALARM(\"Lambda-HighErrors\") AND ALARM(\"Lambda-HighDuration\")" \
--alarm-description "Both high errors and high duration — service is degraded" \
--alarm-actions arn:aws:sns:us-east-1:123456789012:production-alertsComposite alarms can also suppress child alarms during a known outage or maintenance window. When you set the parent alarm to ALARM state manually, it can be configured to prevent its children from firing separately, which avoids an alert storm when you already know something is wrong. This pattern is covered in the incident response with monitoring guide.
Choosing thresholds without creating noisy alerts
The most common complaint about CloudWatch alarms is that they’re either too noisy or miss real incidents. Both problems usually trace back to thresholds set without looking at actual baseline behavior.
Start with observed behavior
Before setting a threshold, look at your metric’s history in CloudWatch Dashboards. Find the normal range — not just the average, but also peak behavior during traffic spikes and deployments. Set your threshold where you would genuinely want to be paged, not at a round number that happens to be above average.
A good rule of thumb: watch a new service’s metrics for a full week before setting alarm thresholds. You want to see at least one weekday peak, one weekend, and ideally one deployment cycle. Thresholds set from a single hour of data are almost always wrong.
Period length affects both speed and noise
A 1-minute period detects problems faster than a 5-minute period, but it also makes alarms more sensitive to brief spikes. A metric that naturally jitters between 75% and 85% CPU will generate constant false positives with a 1-minute period and an 80% threshold. The same alarm with a 5-minute average period fires only during a sustained problem, which is usually what you actually want.
Use evaluation periods to absorb transient spikes
A single evaluation period with a tight threshold is appropriate only when you need zero tolerance. StatusCheckFailed > 0 is a good example: any failure is immediately serious and should fire without delay. For metrics that naturally vary, use 2 or 3 consecutive evaluation periods to require sustained bad behavior before the alarm fires.
When patterns vary, use anomaly detection
If your application traffic varies significantly by time of day or day of week, a fixed threshold will either be too tight during peak hours or too loose during quiet ones. Anomaly detection alarms learn the expected range from historical patterns and fire when the metric deviates beyond a configurable band, with no manual threshold tuning required.
Combine signals to reduce single-metric noise
If CPU alone fires multiple times per week without a real incident behind it, combine it with another metric. A composite alarm requiring both high CPU and elevated 5xx errors is almost always pointing at a real problem. CPU alone rarely is.
| Service | Metric | Suggested starting threshold | Notes |
|---|---|---|---|
| EC2 | CPUUtilization | > 80% for 10 min | 2 × 5-minute periods; adjust down for latency-sensitive apps |
| EC2 | StatusCheckFailed | > 0 for 1 min | Any failure here is serious; act immediately |
| Lambda | Errors (Sum) | > 0 in 5 min | For functions that should be error-free; use error rate for high-volume functions |
| Lambda | Throttles | > 0 in 5 min | Any throttle means requests are being dropped; check concurrency limits |
| Lambda | Duration | > 80% of timeout | Functions close to the timeout are at risk of timing out; optimize or increase the limit |
| RDS | FreeStorageSpace | < 20% of total | Storage exhaustion stops the instance abruptly; give yourself lead time to respond |
| RDS | DatabaseConnections | > 80% of max_connections | Connection exhaustion causes client errors; check for connection leaks |
| ALB | HTTPCode_ELB_5XX_Count | > 10 in 5 min | Load balancer-level errors; alarm separately from target-level 5xx |
| SQS | ApproximateAgeOfOldestMessage | > 300 seconds | Consumer falling behind or failing silently; investigate consumers first |
Common mistakes
- Using 1 evaluation period for noisy metrics. One data point above the threshold is enough to fire the alarm. For metrics that naturally vary (like CPU on a busy application server), this generates false positives constantly. Use evaluation-periods=2 or 3. Reserve 1 for metrics where any breach is unacceptable, like
StatusCheckFailed. - Not configuring OK actions. If you only configure an ALARM action, you get paged when something breaks but never notified when it recovers. Your on-call engineer has to keep manually checking the console. Add an OK action to the same SNS topic so recovery is communicated automatically.
- Forgetting to confirm SNS email subscriptions. SNS sends a confirmation email immediately after you create a subscription. Until you click the link, the subscription is inactive and alarm notifications are silently dropped. Always confirm subscriptions before treating an alarm as production-ready, and check your spam folder.
- Setting the wrong missing-data treatment. Choosing notBreaching for a metric that should always be reporting means the alarm goes quiet when the data source disappears. Think carefully about what absence of data means for each specific metric before accepting the default.
- Setting thresholds with no baseline. Picking 80% CPU because it sounds reasonable, without checking that your service normally runs at 75%, guarantees constant false positives. Observe metrics in CloudWatch Dashboards for at least a few days before fixing thresholds.
- Alarming on one noisy signal only. A single CPU or memory metric is rarely enough to confirm a real incident. Combine signals using composite alarms, or use anomaly detection for metrics with variable baselines, to reduce noise without sacrificing coverage.
- Never testing the alarm path. Creating an alarm and assuming it works is not the same as verifying it. Use
set-alarm-stateto force the alarm into ALARM state and confirm the notification arrives, routes to the right person, and is actionable. An alarm that silently fails to deliver is worse than no alarm at all.
Summary
- CloudWatch alarms have three states: OK, ALARM, and INSUFFICIENT_DATA. Actions fire on state transitions, not while staying in a state.
- Every alarm defines a metric, statistic, period, threshold, and evaluation logic. The combination of period and evaluation periods controls sensitivity.
- Use metric alarms for fixed thresholds, composite alarms for multi-signal noise reduction, and anomaly detection alarms for metrics with variable baselines.
- Send notifications through SNS topics. Confirm email subscriptions before treating any alarm as production-ready.
- Configure both ALARM and OK actions so recovery is communicated automatically, not just the initial incident.
- Test every alarm with
set-alarm-statebefore relying on it in production. - Set thresholds based on observed baseline behavior, not round numbers.
Frequently asked questions
What is a CloudWatch alarm?
A CloudWatch alarm watches a single metric over a specified time period and changes state when the metric crosses a threshold you define. When an alarm enters the ALARM state, it can trigger actions such as sending an SNS notification, triggering Auto Scaling, stopping or rebooting an EC2 instance, or invoking a Lambda function.
What are the three CloudWatch alarm states?
OK means the metric is within the defined threshold. ALARM means the threshold has been breached for the required evaluation periods. INSUFFICIENT_DATA means CloudWatch does not have enough data points to evaluate. This is the starting state for a new alarm and also occurs when a metric stops reporting, such as when an EC2 instance is stopped.
What is the difference between a metric alarm and a composite alarm?
A metric alarm watches a single CloudWatch metric and fires when it crosses a fixed threshold. A composite alarm combines the states of multiple metric alarms using Boolean logic (AND, OR, NOT). Composite alarms help reduce noise by firing only when both high CPU and high error rate are in ALARM state simultaneously, rather than when either one spikes independently.
How often does CloudWatch evaluate alarms?
CloudWatch evaluates alarms at the end of every period you define. If your period is 5 minutes (300 seconds), the alarm is re-evaluated every 5 minutes. High-resolution alarms using 10-second or 30-second periods are evaluated at that frequency but incur higher costs.
What happens when CloudWatch data is missing?
You control this with the treat-missing-data setting. notBreaching treats missing data as within threshold, which is good for sparse metrics with genuine quiet periods. breaching treats missing data as a threshold violation, which is useful when absence of data is itself a problem. ignore keeps the current alarm state unchanged. missing transitions the alarm to INSUFFICIENT_DATA.