Reading Cloud Logs: Log Literacy for Cloud Engineers

Logs are the primary way you understand what a running cloud system is doing. Reading them well is a skill — not just opening a log viewer and scrolling, but knowing which logs to look at, how to query them efficiently, and what patterns signal a real problem. This page builds that skill from the ground up.

The different types of cloud logs

Cloud environments produce several distinct types of logs. Each answers a different question. Knowing which type to look at first saves significant time.

Application logs are written by your code. They tell you what the application was doing — which requests came in, what decisions were made, what errors occurred. These are the most useful for debugging application behaviour.

Access logs are written by load balancers, API gateways, and web servers. They record every HTTP request: timestamp, source IP, path, status code, response time. Essential for identifying which requests are failing and whether a problem is widespread or isolated to specific paths.

Audit logs record who did what in your cloud account — API calls, console logins, resource creation and deletion. In AWS, this is CloudTrail. In GCP, it is Cloud Audit Logs. Audit logs are primarily useful for security and compliance, and for understanding what changed in an infrastructure incident.

VPC flow logs record network traffic metadata in your virtual network — source and destination IP, port, protocol, bytes transferred, and whether traffic was accepted or rejected. They do not contain the packet contents, but they show you what tried to connect to what and whether it was allowed.

Platform logs come from cloud services themselves — managed database logs, Kubernetes control plane logs, function execution logs. These are separate from application logs and tell you what the platform was doing.

Structured versus unstructured logs

An unstructured log line is free-form text:

2026-03-20 14:32:01 ERROR Failed to connect to database: connection refused at 10.0.1.50:5432

A structured log line is formatted as a parseable object, usually JSON:

{
  "timestamp": "2026-03-20T14:32:01.234Z",
  "level": "ERROR",
  "message": "Failed to connect to database",
  "error": "connection refused",
  "host": "10.0.1.50",
  "port": 5432,
  "service": "payment-service",
  "request_id": "abc-123"
}

Structured logs are significantly easier to work with in cloud logging systems. Cloud Logging and CloudWatch can filter and query on individual fields — you can search for error.host = “10.0.1.50” or level = “ERROR” AND service = “payment-service”. With unstructured logs, you are doing text matching, which is less precise and slower.

If you are writing application code that will run in a cloud environment, emit structured JSON logs. Most popular logging libraries support this as a configuration option.

Cloud Logging and CloudWatch: basic querying

GCP Cloud Logging and AWS CloudWatch Logs both store and query logs, but with different query languages.

GCP Cloud Logging uses a query language that filters on log entries. Key query patterns:

# Filter by severity
severity >= ERROR

# Filter by resource (a specific Kubernetes container)
resource.type="k8s_container"
resource.labels.namespace_name="production"
resource.labels.container_name="my-app"

# Filter by a field in a structured log
jsonPayload.request_id="abc-123"
jsonPayload.level="ERROR"

# Filter by time and text
timestamp >= "2026-03-20T14:00:00Z"
textPayload:"database connection"

# Combine conditions
severity >= ERROR AND resource.labels.namespace_name="production"

AWS CloudWatch Logs Insights uses a SQL-like query language:

-- Find all errors in the last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

-- Count errors by type
fields @timestamp, @message
| filter level = "ERROR"
| stats count(*) as error_count by error_type
| sort error_count desc

-- Find slow requests
fields @timestamp, duration, path
| filter duration > 1000
| sort duration desc
| limit 50

What to look for in logs

Knowing how to query is one skill. Knowing what to look for is another. These are the patterns that reliably signal problems:

Error spikes: A sudden increase in error log volume usually correlates with a deployment, a configuration change, or an external dependency becoming unavailable. Look at error rate over time, not just absolute error count.

Repeated authentication failures: Multiple failed login or API authentication attempts in a short window could be a misconfigured service (wrong credentials) or a brute-force attack. Either way, it needs attention.

Memory exhaustion patterns: Applications running out of memory often produce warnings before they crash — “heap usage at 90%”, “GC pressure increasing”. These appear in logs before the OOMKill event.

Timeout patterns: Repeated timeout errors calling the same downstream service suggest that service is degraded or overloaded. Check if the target service is healthy and whether retry logic is making the problem worse.

Missing logs: Sometimes the most important signal is absence. If logs stop appearing from a service, the service may have crashed and not restarted, or the log shipping pipeline may have broken.

Reading VPC flow logs for connectivity problems

VPC flow logs tell you about network traffic at the packet level. They are particularly useful when you know a service cannot connect somewhere but the application logs do not give you enough information.

A VPC flow log entry has these key fields:

version account-id interface-id srcaddr dstaddr srcport dstport protocol packets bytes start end action log-status

# Example — ACCEPT
2 123456789 eni-abc123 10.0.1.50 10.0.2.30 54321 5432 6 10 1200 1679318400 1679318460 ACCEPT OK

# Example — REJECT
2 123456789 eni-abc123 10.0.1.50 10.0.2.30 54321 5432 6 5 600 1679318400 1679318460 REJECT OK

The action field is the most important: ACCEPT means traffic was allowed through, REJECT means a security group or network ACL blocked it. If you see REJECT entries for the source and destination you are debugging, a firewall rule is the problem.

Writing log lines that are actually useful

Reading logs is half the skill. Writing code that produces useful logs is the other half. The logs your application writes today are the ones you will be debugging at 2am.

Principles for useful log lines:

Include context. A log line that says ERROR: failed is useless. ERROR: failed to process order {“order_id”: “ord-123”, “error”: “payment gateway timeout”, “attempt”: 2} tells you exactly what failed and what the state was.
Include a request or trace ID. If every log line for a single request shares a request_id, you can filter all logs for that request and see the full story. Without it, you are trying to reconstruct a sequence from timestamps alone.
Log at the right level. DEBUG for detailed execution traces (usually off in production). INFO for significant events (request received, job started). WARN for unexpected but handled situations. ERROR for failures that need investigation.
Do not log sensitive data. Passwords, credit card numbers, API keys, personal information — none of this belongs in logs. Check what you are logging in error handlers, which often capture request data.

import structlog
import uuid

log = structlog.get_logger()

def process_payment(order_id: str, amount: float):
    request_id = str(uuid.uuid4())
    logger = log.bind(request_id=request_id, order_id=order_id)

    logger.info("payment_processing_started", amount=amount)

    try:
        result = payment_gateway.charge(amount)
        logger.info("payment_successful", transaction_id=result.transaction_id)
        return result
    except TimeoutError as e:
        logger.error("payment_gateway_timeout", error=str(e), amount=amount)
        raise

Log retention and cost

Cloud logging services charge for log ingestion and storage. High-volume applications can generate significant logging costs if log retention is not managed. A few practical points:

Set retention periods on log groups or buckets. Most teams keep production logs for 30–90 days. Audit logs may need longer retention for compliance reasons.
Sample high-volume logs where appropriate. If a health check endpoint is called 1,000 times per minute, logging every successful 200 response adds up. Log only failures, or log successes at a sampled rate (1 in 100).
Use log levels to control volume. DEBUG logs should be off in production unless you are actively investigating something — they produce large volumes of data for little operational value.

Career insight: Engineers who write good log lines and know how to query logs quickly become the people their team relies on during incidents. It is a skill that is learned by doing — the next time something breaks, resist the temptation to jump to fixes and spend a few minutes thoroughly reading the logs. Each incident is practice.