Monitoring and Alerting for Cloud Engineers

Monitoring is how you know whether your system is healthy without someone telling you. Alerting is how you get told when it is not. Both are skills — not just tool configuration. This page covers the concepts, the practical decisions, and the difference between monitoring that helps and monitoring that is just noise.

Metrics, logs, and traces: the three pillars of observability

Observability is the ability to understand the internal state of a system from its external outputs. Three types of data form the basis of it.

Metrics are numeric measurements collected over time — CPU usage, request rate, error rate, latency percentiles, queue depth. They are cheap to store and fast to query, which makes them ideal for dashboards and alerting.

Logs are timestamped records of events — what the application was doing at a specific moment. They are high-detail but high-volume. Useful for understanding the context of a specific failure after the fact.

Traces follow a single request as it moves through multiple services. Each service adds a span to the trace, which records how long that segment took. Traces reveal where time is being spent in a distributed system — which database call is slow, which external API is the bottleneck.

In practice: metrics tell you that something is wrong. Logs tell you what was happening when it went wrong. Traces tell you where in the system the problem occurred.

SLIs and SLOs: measuring what matters

Service Level Indicators (SLIs) and Service Level Objectives (SLOs) give monitoring a purpose beyond “graph everything and hope for the best”.

An SLI is a specific, measurable indicator of service health. Good SLIs represent what users actually experience:

  • Availability: percentage of requests that succeed (non-5xx responses)
  • Latency: percentage of requests that complete within a target time (e.g. 95% under 200ms)
  • Error rate: percentage of requests that return an error
  • Throughput: requests handled per second

An SLO is the target value for an SLI. “99.9% of requests will succeed” or “95% of requests will complete in under 500ms” are SLOs. They give you an objective definition of “healthy” versus “unhealthy”.

An error budget is what you have left before you breach an SLO. If your SLO is 99.9% availability, you have 0.1% of requests per month as your error budget — about 43 minutes of downtime in a 30-day month. When your error budget is shrinking too fast, reliability work takes priority over new features.

For most teams, starting simple is correct: define availability and latency SLOs for your critical services, build dashboards around them, and alert when you are burning through your error budget too quickly.

What to monitor

The most common framework for deciding what to instrument is called the RED method for services and the USE method for resources.

RED (for services):

  • Rate — how many requests per second
  • Errors — how many requests are failing
  • Duration — how long requests take (latency)

USE (for infrastructure resources):

  • Utilisation — what percentage of the resource is being used
  • Saturation — how much work is queued or waiting
  • Errors — how many errors the resource is producing

For a web application, you would apply RED to your API endpoints and USE to your databases, CPU, memory, and network interfaces. This combination covers most of what can go wrong.

Cloud monitoring tools: what each provider gives you

Each major cloud provider includes a monitoring platform. These handle basic metrics automatically, but you need to configure dashboards, custom metrics, and alerts yourself.

ProviderMetricsLogsTraces
AWSCloudWatch MetricsCloudWatch LogsX-Ray
GCPCloud MonitoringCloud LoggingCloud Trace
AzureAzure Monitor / MetricsLog AnalyticsApplication Insights

Most teams also use third-party observability platforms — Datadog, Grafana, New Relic, Prometheus + Grafana. These often provide a better query experience, cross-cloud visibility, and richer alerting capabilities than native cloud tools. In interviews, knowing the native tools is important; knowing that third-party tools exist and why teams use them is also valued.

Prometheus deserves special mention for Kubernetes environments. It is the default metrics system for Kubernetes clusters — most cloud providers’ managed Kubernetes services include it or have a compatible managed offering. Prometheus scrapes metrics from your applications and Kubernetes components. Grafana visualises them.

What good alerting looks like

Bad alerting is expensive and demoralising. A team that gets paged 20 times a night for alerts that are not actionable loses trust in the monitoring system and starts ignoring pages. Good alerting is precise and rare.

Properties of a useful alert:

  • Actionable: The person on call can do something about it. An alert for “CPU at 60%” is usually not actionable. An alert for “error rate above 5% for 5 minutes” is.
  • Symptom-based, not cause-based: Alert on what users experience (errors, latency, unavailability) rather than individual components. A database CPU spike might be harmless; the resulting query latency increase is what matters to users.
  • With context: The alert notification should include the affected service, what the metric is, what the threshold is, and a link to the relevant dashboard or runbook.
  • Not too sensitive: Brief spikes should not trigger pages. Use time-based thresholds — “above 5% for 5 consecutive minutes” is better than “above 5% for any 1 minute”.

A common mistake when first setting up alerting: creating alerts for every metric with aggressive thresholds. This creates alert fatigue — too many pages, most of which resolve themselves or are not worth investigating at night. Start with fewer, higher-confidence alerts. Add more as you understand your system’s normal behaviour.

Dashboards: building something people actually use

A dashboard that nobody looks at provides no value. A useful dashboard tells you the health of a system at a glance — someone should be able to open it and immediately know whether things are normal or not.

Principles for dashboards that get used:

  • Put the most important signal at the top. The first row should answer “is this service healthy right now?” — availability, error rate, and latency. Details go further down.
  • Show trends, not just current values. A single number for “current error rate: 2%” is less informative than a graph of the last 24 hours showing that the rate spiked 30 minutes ago and is now trending down.
  • Use percentiles for latency. Average latency hides problems. P95 (the latency that 95% of requests experience) and P99 reveal the tail — the slow requests that frustrate a subset of users.
  • Include context markers. Annotate deployments, incidents, and significant configuration changes on time-series graphs. This makes it immediately obvious when a metric changed whether there was a corresponding event.

On-call basics for cloud engineers

Many cloud engineering roles involve on-call rotation — being the person responsible for responding to production incidents outside business hours. If you have not done on-call before, here is what to know.

The goal is time to restore service, not time to find root cause. When paged at 3am, your priority is getting the service healthy again. Root cause analysis happens after users are unblocked.

Runbooks reduce stress. A runbook is a documented procedure for a known problem — “if alert X fires, do steps A, B, C”. Good teams maintain runbooks for their most common alerts. Writing a runbook after you investigate a problem for the first time means the next time it happens (or the next person on call), the resolution is documented.

A reasonable on-call alert load: Industry experience suggests that engineers should not need to respond to more than a couple of pages per on-call shift to remain effective. More than that and either the system needs reliability work or the alerts need tuning.

Career insight: Engineers who invest in good monitoring and alerting make life significantly better for their whole team. Improving the alert signal-to-noise ratio, adding useful dashboard annotations, and writing runbooks are the kind of work that teammates notice and appreciate — and that creates a reputation as someone who thinks about operational quality.

Getting started: a practical checklist

If you are setting up monitoring for a service for the first time, work through this in order:

  1. Define your SLOs — availability and latency targets for the service
  2. Instrument the application — ensure it emits metrics for request rate, error rate, and latency
  3. Set up a dashboard showing RED metrics and the current SLO status
  4. Create one alert: error rate exceeds your SLO threshold for more than 5 minutes
  5. Test the alert by intentionally causing an error condition
  6. Write a runbook for the alert
  7. Add more alerts gradually as you understand what normal looks like

Starting with one meaningful alert and a clear runbook is better than having 50 noisy alerts with no documentation. Monitoring is a system you build and improve over time.