Debugging Cloud Systems: A Practical Framework
Debugging cloud systems is a skill that separates good engineers from great ones. It is not about knowing every service — it is about having a systematic approach that works regardless of what is broken. This page gives you that framework, explains how to isolate problems across layers, and walks through a realistic debugging scenario end to end.
The debugging mindset
The biggest mistake in debugging is acting on guesses. An experienced engineer and a junior one often face the same unknown problem — the difference is the experienced engineer tests hypotheses before acting on them.
The three principles that matter most:
- Read the error before doing anything else. The error message is the most direct signal you have. Most people skim it, miss the key detail, and start trying random fixes. Slow down and read the full error.
- Change one thing at a time. If you change two things at once, you do not know which one fixed it (or broke it further). Make a change, observe the result, draw a conclusion.
- Work backwards from the symptom. The symptom is where you start, not where you finish. A user cannot connect to your API — that symptom might have ten possible causes. Start at the symptom and work inward until you find where the chain breaks.
The layers of a cloud system
Cloud systems have predictable layers. When something breaks, the problem lives in one (or sometimes two) of these layers. Identifying which layer saves you from looking in the wrong place.
| Layer | What it covers | Symptoms when wrong |
|---|---|---|
| Application | Code bugs, exceptions, logic errors | 500 errors, unexpected output, crashes |
| Configuration | Env vars, feature flags, app config | Wrong behaviour, auth failures, missing settings |
| Container / Runtime | Dockerfile, dependencies, startup | Container exits, CrashLoopBackOff, import errors |
| Compute | VM size, memory, CPU limits | OOMKilled, slow response, CPU throttling |
| Network | Security groups, DNS, routing | Connection refused, timeout, name not found |
| IAM / Permissions | Service account, policies, roles | 403 Forbidden, access denied |
| Platform | Cloud service limits, region issues | Service unavailable, quota exceeded |
The most common layers for production incidents are application (bugs), configuration (wrong environment variables), network (security groups/DNS), and IAM (missing permissions). Platform-level issues are rare but happen.
Starting at the error message
The error message tells you which layer to look at. Here is how common error patterns map to layers:
- “Connection refused” or timeout — network layer. The service is not listening, the firewall is blocking, or you are connecting to the wrong address.
- ”Permission denied” or “403 Forbidden” — IAM layer. The service or user does not have permission to do what it is trying to do.
- ”Name or service not known” — DNS layer. The hostname does not resolve. Check DNS configuration and whether the service actually has a DNS record.
- ”No space left on device” — compute layer. The disk is full. Check disk usage on the instance or container.
- ”OOMKilled” — compute layer. The container or process exceeded its memory limit.
- Unhandled exception with stack trace — application layer. Read the stack trace from bottom to top — the bottom line is where the error originated, the top lines are the call chain.
When an error message is not immediately clear, search for the exact error text. Cloud service error messages are documented. The format “ErrorCode”: “AccessDenied” from an AWS API response, for example, maps to a specific section of the IAM documentation.
Isolation techniques
Once you have identified the likely layer, you isolate the problem by testing the smallest possible thing that could fail.
For network problems: Use curl or wget from inside the failing service’s environment (not from your laptop). If you can exec into a container or SSH into a VM, test connectivity from there:
# Can the service reach the database?
curl -v telnet://db.internal:5432
# Does DNS resolve the hostname?
dig db.internal
# Is the port open?
nc -zv db.internal 5432 # nc = netcatFor IAM problems: Read the full IAM error. It usually tells you exactly what action was denied on what resource. Check the policy attached to the role or service account — look for the specific action (s3:GetObject, ec2:DescribeInstances) on the specific resource ARN.
For application problems: Reproduce the problem locally if possible. Add more logging temporarily if the existing logs are not enough. Use a minimal test case — strip away everything except the code path that is failing.
Checking service status and recent changes
Two questions to ask early in a debugging session, especially for incidents:
Is there a known cloud service issue? Check the status page for your cloud provider (status.aws.amazon.com, status.cloud.google.com, azurestatus.microsoft.com). If a service you depend on is degraded, the problem is probably not yours to fix. Subscribe to these status pages for the services you rely on.
What changed recently? The most common cause of a production incident is a recent change. Check deployment logs, Terraform apply history, recent merged PRs, and recent configuration changes. The pattern “it worked yesterday, now it doesn’t” almost always has a change at the root.
# When did this container image get deployed?
kubectl rollout history deployment/my-app -n production
# What was the last apply in Terraform?
# (Check your CI/CD logs or Terraform Cloud run history)
# What changed in the last 24 hours?
git log --oneline --since="24 hours ago" --allA full debugging walkthrough
Here is a realistic scenario worked through end to end.
Scenario: A new version of a web application was deployed 30 minutes ago. Users are now getting 502 errors from the load balancer. On-call engineer gets paged.
Step 1: Read the error. 502 Bad Gateway means the load balancer received an invalid response from the backend, or the backend timed out. The problem is likely in the application tier or between the load balancer and the application.
Step 2: Check if the deployment was recent. Yes — a new version deployed 30 minutes ago. This is probably the cause.
Step 3: Check pod status.
kubectl get pods -n production
# Output:
# NAME READY STATUS RESTARTS AGE
# my-app-5d8c7f9b4-abc12 0/1 CrashLoopBackOff 5 28m
# my-app-5d8c7f9b4-xyz34 0/1 CrashLoopBackOff 5 28mAll pods are in CrashLoopBackOff. The application is not running. The load balancer has no healthy backend — hence the 502.
Step 4: Read the pod logs.
kubectl logs my-app-5d8c7f9b4-abc12 -n production --previous
# Output:
# Error: DATABASE_URL environment variable is not set
# Process exited with code 1Step 5: Identify the cause. The new version added a required environment variable (DATABASE_URL) that was not added to the deployment configuration. The application starts, tries to read the variable, fails, and exits.
Step 6: Fix and verify. Two options: roll back the deployment to the previous version (fastest, safest) or add the missing environment variable to the deployment manifest. For a production incident, roll back first to restore service, then fix the root cause in a proper PR.
# Roll back to the previous deployment
kubectl rollout undo deployment/my-app -n production
# Watch pods recover
kubectl get pods -n production -wStep 7: Write a short post-mortem. Note what happened, what the root cause was, and what would have prevented it — in this case, a CI check that validates required environment variables are present before deployment.
Common patterns that appear repeatedly
These scenarios come up often enough that recognising them saves significant time:
- Permissions error on startup: Service starts and immediately fails with “access denied”. Usually a missing IAM permission. Check the execution role, find the action name in the error, add it to the policy.
- DNS resolution failure inside a VPC: Service cannot resolve an internal hostname. Check that the VPC has DNS resolution enabled and that the target service has an internal DNS record.
- Resource limit hit: Service was working, now intermittently failing. Check CPU/memory limits (OOMKilled events), disk space, or cloud service quotas.
- Certificate expired: HTTPS connections fail with TLS error. Check certificate expiry date in the cloud console. Most teams use auto-renewal — if it broke, the renewal process failed.
- Secret rotation broke something: Service suddenly cannot authenticate to a database or external API. A secret was rotated but the service was not updated with the new value.
Summary
- Read the full error message before doing anything — it tells you which layer to look at
- Change one thing at a time; work backwards from the symptom
- Error patterns map to layers: connection refused = network, 403 = IAM, name not found = DNS
- Check recent changes and cloud service status pages early in any incident
- For production incidents, restore service first (roll back) then fix the root cause