Cloud Cost Optimisation: A Practical Engineering Skill
Cloud costs do not spiral because engineers are careless. They spiral because the default state of cloud infrastructure is “leave it running, figure out billing later.” Cost optimisation is an engineering discipline — and treating it that way is what separates good cloud engineers from expensive ones.
Why cloud costs spiral in the first place
Understanding the failure modes helps you avoid them. Cloud cost overruns usually come from a handful of predictable patterns, not rare accidents.
Overprovisioning
A developer asks for a production-sized database instance in a dev environment. Nobody pushes back. That instance runs 24/7 for six months before anyone notices. Multiply this pattern across ten teams and you have a significant bill for infrastructure nobody is actively using.
Overprovisioning happens because it feels safer to over-spec than under-spec. If the service falls over because the instance is too small, that is visible and embarrassing. If the instance is too large and wastes money, nobody notices immediately. The incentives are misaligned.
Forgotten resources
A load balancer gets created during testing. The test passes, the service gets deployed through a different route. The load balancer keeps running. A Kubernetes cluster gets spun up for a proof of concept. The PoC is abandoned. The cluster keeps running. At $0.10 per hour per node, a forgotten three-node cluster costs over $2,000 per year.
Dev environments left running
Development and staging environments often run continuously even when engineers are not working. Scheduling these environments to shut down overnight and at weekends can cut their cost by 65–70% with minimal impact on productivity. This is one of the highest-return cost actions available.
Storage accumulation
Old snapshots, unused disk volumes, log files that are retained forever. Storage costs grow quietly and continuously. Many teams have terabytes of snapshots retained “just in case” for services that were decommissioned years ago.
Right-sizing: matching resources to actual usage
Right-sizing means choosing the instance or service tier that matches what the workload actually needs, not what it might theoretically need at peak load.
The process is straightforward: look at actual CPU, memory, and network utilisation over the past 30 days. If a VM is running at 8% CPU average with peaks of 25%, it is probably overprovisioned by two to three size classes. Most cloud providers offer a right-sizing recommendations tool that does this analysis automatically.
Tools to know:
- AWS Compute Optimizer — analyses EC2, Lambda, and ECS Fargate usage and recommends right-sized alternatives
- GCP Recommender — provides machine type and committed use discount recommendations
- Azure Advisor — cost, security, and performance recommendations including VM right-sizing
A practical starting point: identify your ten most expensive instances, check their average utilisation, and right-size the ones running below 30% CPU. This exercise often uncovers 20–40% savings on compute spend.
Pricing models: on-demand, reserved, and spot
The same compute capacity can cost very different amounts depending on which pricing model you use. Understanding the trade-offs is a core cost engineering skill.
| Model | Cost | Commitment | Best for |
|---|---|---|---|
| On-demand | Baseline (100%) | None | Unpredictable or short-lived workloads |
| Reserved / Committed use | 30–60% cheaper | 1 or 3 years | Stable, always-on workloads |
| Spot / Preemptible | 70–90% cheaper | None (interruptible) | Batch jobs, CI/CD, stateless workloads |
| Savings Plans (AWS) | 20–66% cheaper | 1 or 3 years (flexible) | Mix of instance types or services |
Reserved instances and committed use discounts
If you have a workload that runs continuously and is unlikely to change significantly, reserving capacity for one or three years is almost always the right call. The discount is substantial. The risk is that you commit to capacity you no longer need — which is why it is important to right-size before reserving.
Spot and preemptible instances
Spot instances can be reclaimed by the cloud provider with short notice (typically two minutes). For workloads that can tolerate interruption — batch processing, data pipelines, CI/CD runners, machine learning training jobs — spot instances offer dramatic savings. Many teams run their test suites on spot instances and save 80% compared to on-demand pricing.
Storage tiering
Not all data needs to be stored in the same way. Cloud storage services offer multiple tiers with different cost and access characteristics.
- Standard / Hot tier — highest cost, immediate access, for data read frequently
- Nearline / Cool tier — lower cost, slight access latency penalty, for data accessed monthly
- Coldline / Archive tier — very low cost, higher retrieval cost and latency, for data accessed rarely
The typical mistake: storing application logs, database backups, and old snapshots in the standard tier because nobody thought about it. Moving historical logs to coldline or archive can reduce storage costs by 80–90% on that data with no practical impact on operations.
Most cloud providers support lifecycle policies that automatically transition objects between tiers based on age. A practical default: move objects older than 30 days to nearline, objects older than 90 days to coldline, delete objects older than 365 days (for log data that is no longer operationally useful).
Tagging strategy for cost attribution
You cannot optimise what you cannot see. Tagging every cloud resource with consistent metadata is the foundation of cost attribution — knowing which team, project, or service is responsible for which spend.
A practical minimum tagging schema:
# Example resource tags
environment: production # production, staging, development
team: platform # which team owns this resource
project: payments-service # which project or product it supports
cost-centre: eng-platform # for billing allocation to departments
managed-by: terraform # how the resource is provisionedWithout consistent tags, your cost report shows that $50,000 was spent on EC2 this month but you cannot tell whether it was the data team’s Spark cluster, the platform team’s Kubernetes nodes, or a mix. With good tags, you can break that $50,000 down by team, project, and environment in a few clicks.
A common mistake: tagging as an afterthought on existing infrastructure. The right approach is to enforce tags at the policy level — use AWS Service Control Policies, GCP Organisation Policies, or Azure Policy to require tags before resources can be created. Retrofitting tags to hundreds of existing resources is painful; preventing untagged resources from being created is much easier.
Setting up budgets and alerts
Budget alerts do not reduce costs by themselves. They make cost surprises visible early, which gives you time to investigate and act before the bill arrives.
A practical alert setup:
- Alert at 50% of monthly budget — normal, nothing to act on yet, just awareness
- Alert at 80% — review what is driving spend, check for anomalies
- Alert at 100% — active investigation and possible action required
- Alert at 150% — escalate, something has gone wrong (runaway workload, forgotten resource, mistake)
Cost anomaly detection is worth setting up separately. AWS Cost Anomaly Detection, GCP Budget Alerts with forecasting, and Azure Cost Alerts can detect unusual spending patterns that would not trigger a threshold-based alert. For example: a workload that normally costs $200/day suddenly costing $1,400/day would trigger an anomaly alert even if you have not hit your monthly budget yet.
Cost optimisation at junior vs senior level
What cost awareness looks like in practice changes significantly as you gain experience.
At junior level
You are not expected to own the cost strategy, but you are expected to be conscious. This means: not spinning up large instances without asking, shutting down dev environments when not in use, flagging anything that looks unexpectedly expensive when you notice it, and following the team’s tagging conventions when provisioning resources.
At mid level
You start contributing to cost reviews, identifying right-sizing opportunities, and implementing scheduled shutdowns for dev environments. You can make the case for reserved capacity on stable workloads. You understand the pricing models well enough to choose the right one for new workloads.
At senior level
You own the cost posture for your infrastructure domain. You set tagging policies, run quarterly cost reviews, build the business case for reserved capacity purchases, and design systems with cost efficiency in mind from the start. You can articulate the cost implications of architectural decisions before they are built.
A framework for reviewing costs
A monthly cost review does not need to be complex. A simple structure that works:
- Total spend this month vs last month vs same month last year — is cost growing, flat, or shrinking? Is growth proportional to business growth?
- Top 10 most expensive resources — what are they? Are they all expected? Are any surprising?
- Right-sizing recommendations — any flagged by the cloud provider’s advisor tools?
- Idle or stopped resources still incurring cost — stopped VMs often still charge for attached disks. Unattached disks accumulate quietly.
- Untagged resources — anything without tags cannot be attributed. Treat untagged resources as a problem to fix.
- Spot/reserved coverage — what percentage of eligible compute is on reserved or spot pricing?
Doing this consistently once a month for six months builds the institutional knowledge to spot anomalies quickly and understand what “normal” spend looks like for your environment.
Realistic savings patterns
When teams get serious about cost optimisation for the first time, the savings can be significant. Some common patterns seen when starting from scratch:
- Scheduling dev/staging environments off overnight and weekends: 40–65% reduction on those environments
- Converting always-on workloads from on-demand to reserved: 30–55% reduction on those instances
- Right-sizing overprovisioned instances: 20–40% reduction on compute costs
- Storage lifecycle policies on log data: 70–85% reduction on storage costs for log archives
- Deleting forgotten resources (old snapshots, unused volumes): one-time reduction of 5–15% of total bill
Stacking several of these together is realistic for most teams that have never focused on cost. A 30–50% total cloud cost reduction in the first year of active optimisation is not unusual — and it requires no architecture changes, only operational discipline.
Summary
- Cloud costs spiral from overprovisioning, forgotten resources, and dev environments left running — all preventable with operational discipline
- Right-sizing, reserved/committed use pricing, and spot instances are the highest-leverage cost levers available to cloud engineers
- Tagging every resource consistently is the foundation of cost attribution — enforce tags at the policy level, not as an afterthought
- Budget alerts and anomaly detection give you early warning before surprises arrive on the bill
- Monthly cost reviews using a consistent framework build the knowledge to keep costs under control as infrastructure grows