How to Build a Cloud Monitoring Portfolio Project
Monitoring portfolio projects are rare in junior portfolios and disproportionately impressive when they exist. Most engineers build systems without adding observability — so demonstrating that you think about what happens after deployment signals a maturity that stands out. This guide covers how to build a monitoring project that goes beyond a “Hello World” dashboard.
Why a monitoring project signals seniority
Anyone can provision an EC2 instance or write a Lambda function. Fewer people understand how to observe a system once it is running. A monitoring project demonstrates:
- You think about systems in operation, not just at deployment
- You understand the difference between a metric, a log, and a trace
- You can define what “working” means before you can define “broken”
- You understand on-call realities — alerts that fire too often are as bad as no alerts
This thinking is expected at mid-level and senior cloud and SRE roles. Having evidence of it at junior level is a genuine advantage.
What to build: choose your tool based on your target role
The tool choice should match your target role:
| Tool stack | Best for |
|---|---|
| Prometheus + Grafana (self-hosted) | SRE, DevOps, and platform engineering roles |
| CloudWatch (AWS) | AWS-specialist cloud engineering roles |
| Cloud Monitoring + Cloud Logging (GCP) | GCP-specialist cloud engineering roles |
| Datadog or New Relic (managed) | Companies using SaaS observability platforms |
Prometheus and Grafana are the most portable and the most commonly expected knowledge in interviews — even at companies using Datadog, understanding what Prometheus does is expected. If you are not sure which to choose, start with Prometheus and Grafana.
Building with Prometheus and Grafana
What to run Prometheus on
Deploy Prometheus and Grafana either:
- On a single VM (a t2.micro/e2-micro is sufficient for a portfolio project)
- Inside a Kubernetes cluster using the kube-prometheus-stack Helm chart (the better option if you already have a Kubernetes project)
The Kubernetes deployment is more complex but demonstrates both Kubernetes and observability skills simultaneously. It is also the production-realistic approach — most companies running Kubernetes use the kube-prometheus-stack.
What to monitor
Monitor an application that you have already built (a REST API, a serverless function, or a web service). At minimum, instrument the application to expose:
- Request rate (requests per second by endpoint and HTTP method)
- Error rate (rate of 4xx and 5xx responses)
- Request latency (p50, p95, p99 percentiles)
These three — request rate, error rate, latency — are the RED method. The RED method is the standard framework for monitoring request-driven services and is a common interview topic.
Additionally, collect system-level metrics using the node_exporter (for VMs) or kube-state-metrics (for Kubernetes): CPU usage, memory usage, disk I/O.
Building dashboards
Create at least two Grafana dashboards:
- An application dashboard: request rate, error rate, and latency panels using the RED method
- A system dashboard: CPU, memory, and disk panels for the infrastructure running the application
Define dashboards as code using Grafana’s JSON export or a tool like Grafonnet or Terraform’s Grafana provider. Do not rely on dashboards configured manually through the UI — they cannot be reproduced from the repository.
Writing real alerting rules
The alerting section is what most monitoring portfolio projects get wrong. Generic rules like “alert when CPU exceeds 90% for 5 minutes” are better than nothing, but they show no understanding of what the alert means in context.
A good alerting rule has three parts:
- Condition — the metric expression and threshold
- Duration — how long the condition must be true before alerting (avoids flapping)
- Severity and runbook — how urgent is this? What should a human do?
Define at least three alerting rules. For each one, document:
- Why you chose that threshold (5% error rate rather than 1% or 20%?)
- Why you chose that duration window (5 minutes rather than 1 minute?)
- What action the alert requires from an on-call engineer
This runbook documentation is the part that will actually impress an SRE or senior engineer reviewing your portfolio. It shows you understand that alerts are not just technical — they require a human response, and that response should be defined in advance.
Adding centralised logging
Metrics tell you something is wrong. Logs tell you why. Add centralised logging to make this a fuller observability project:
- With Prometheus/Grafana: add Loki for log aggregation and configure it as a Grafana data source
- With AWS: ship application logs to CloudWatch Logs and create a metric filter that counts error log lines
- With GCP: use Cloud Logging with a log-based metric and alert
At minimum, configure structured logging in your application (JSON output rather than plain text lines) and create one log-based metric or search query that surfaces error patterns.
Everything as code
Your monitoring setup should be deployable from the repository. This means:
- Prometheus configuration files in version control
- Alert rules defined as code (Prometheus alerting rules in YAML)
- Grafana dashboards exported as JSON and committed to the repository
- The entire stack provisioned with Terraform or a Helm chart
A monitoring setup that requires clicking through a Grafana UI to recreate is not portfolio-ready.
What to document in the README
- Why you chose Prometheus/Grafana (or CloudWatch, or another tool) for this project
- The three alerting rules and the reasoning behind each threshold and duration
- The runbook action for each alert — what a human should do when it fires
- How to deploy the full stack from scratch using the repository
- What you would add in a production environment: longer metric retention, alertmanager routing to PagerDuty or Slack, distributed tracing with OpenTelemetry, SLO-based alerting
For guidance on combining this with a Kubernetes project, see how to build a Kubernetes portfolio project.
Summary
- Monitoring projects are rare in junior portfolios — having one is a genuine differentiator for SRE and senior cloud roles
- The RED method (rate, errors, duration) is the standard framework for monitoring request-driven services and a common interview topic
- Alerting rules need documented thresholds (why that number?), duration windows (why that long?), and runbook actions
- Define dashboards and alert rules as code — a monitoring setup that cannot be reproduced from the repository is not complete
- Adding centralised logging alongside metrics makes this a full observability project, not just a metrics project
- Deploy on Kubernetes using kube-prometheus-stack if you want to combine Kubernetes and observability in one build