Debug AWS Lambda Failures: Logs, Timeouts, OOM & X-Ray

Lambda runtime failures are harder to diagnose than startup errors because the cause depends on what your code does, what input it receives, and how downstream services behave. This page walks through the full diagnostic sequence: reading the REPORT line in CloudWatch Logs, identifying timeouts, memory exhaustion, and exceptions, tracing downstream bottlenecks with X-Ray, and handling async invocation failures that never surface to the caller.

What is a Lambda runtime failure?

A Lambda failure means the function invocation did not complete successfully. There are three distinct patterns:

The function never starts. The deployment package is broken, the handler path is wrong, or a dependency is missing. This is a startup failure. See Lambda Function Failed to Start for that case.
The function starts but fails during execution. Your code threw an unhandled exception, the function ran out of time, the function hit its memory limit, or a downstream service stopped responding.
The function succeeds sometimes but fails under load or on specific inputs. Concurrency throttling, intermittent downstream errors, or input data that triggers an untested code path.

This page covers the second and third patterns: failures that happen at runtime, after the function has started.

When to use this guide

Come here when you see any of these symptoms:

CloudWatch metrics show invocation errors but the log messages are not obvious
Logs end with “Task timed out after X.XX seconds”
The log ends abruptly with no error message or stack trace
An async event (S3, SNS, EventBridge) is not producing results downstream
The function fails intermittently; some invocations succeed, some don’t
Failures appear under load but not in isolated testing

If the function fails immediately on every invocation and never produces a START line, check Lambda Function Failed to Start instead.

How Lambda failure debugging works

Every Lambda invocation writes a log stream to CloudWatch. For runtime failures, the diagnostic sequence is:

Identify the invocation type. Synchronous (API Gateway, direct SDK call) vs asynchronous (S3, SNS, EventBridge) vs poll-based (SQS, Kinesis). Async and poll-based failures don’t automatically surface to the caller; you need a failure destination to see them.
Read the REPORT line. Duration, memory usage, and cold start time are in every invocation log. These three numbers narrow the failure type immediately.
Look for a stack trace or timeout message. A stack trace means a code error. A timeout message means the function ran out of time. No message and an abrupt log ending often points to memory exhaustion.
Separate Lambda problem from downstream problem. A timeout doesn’t necessarily mean your code is wrong. The bottleneck may be a downstream service (DynamoDB, RDS, an external API). Enable X-Ray to see time broken down by service call.
Check async failure handling. For async invocations, failures are retried internally, then discarded unless you’ve configured a dead-letter queue or Lambda destination.

What to check first

Work through this sequence before diving into code changes:

Open CloudWatch Logs for the function. Log group: /aws/lambda/{function-name}.
Find the REPORT line for a failing invocation. Is Duration close to the configured timeout? Likely a timeout.
Is Max Memory Used equal to Memory Size? Possible memory exhaustion.
Is there a stack trace in the log? Unhandled exception. Read the exception type and message.
Does the log end without any error message or trace? Strong signal of memory exhaustion.
Is the function asynchronous? Check whether a DLQ or Lambda destination is configured and whether it received anything.
Is the failure intermittent and only appears under load? Check Lambda throttling metrics in CloudWatch. Lambda scaling and concurrency limits can cause sporadic failures when the concurrency ceiling is hit.

Startup failures vs runtime failures

Lambda failures split into two categories with different debugging paths:

Startup failure: the function never executes your handler. The error appears immediately, either before the START line or right after it. Common messages:

Runtime.ImportModuleError — a dependency is missing from the deployment package
Runtime.ExitError — the runtime process exited before the handler ran
Handler 'lambda_handler' missing on module 'handler' — wrong handler path in the function configuration

Runtime failure: the function starts, executes your handler, and then fails. The log shows real work happening: database calls, processing lines, log statements from your code. Then either a stack trace, a timeout message, or an abrupt ending.

This page is for runtime failures. For startup failures, see Lambda Function Failed to Start.

Read the REPORT line first

Every Lambda invocation ends with a REPORT line in CloudWatch Logs. It’s the fastest diagnostic signal available and should always be the first thing you read.

Mental model Think of the REPORT line as a flight data recorder. It captures the same three measurements on every invocation: how long the function ran, how much memory it used, and whether it started cold. You read it after the failure to understand what the function was doing right before it ended.

START RequestId: abc-123-def-456 Version: $LATEST
...your log output...
END RequestId: abc-123-def-456
REPORT RequestId: abc-123-def-456  Duration: 1234.56 ms  Billed Duration: 1235 ms  Memory Size: 256 MB  Max Memory Used: 87 MB  Init Duration: 412.34 ms

The fields:

Duration — how long your handler ran, in milliseconds
Billed Duration — Duration rounded up to the nearest 1ms (used for billing)
Max Memory Used — the peak memory the function used during this invocation
Memory Size — the memory limit configured for the function
Init Duration — only appears on cold starts; the time Lambda spent initializing the execution environment before the handler ran

Three failure patterns visible from the REPORT line alone:

Timeout: Duration matches the configured timeout, and the log contains “Task timed out after X.XX seconds”.

[ERROR] Task timed out after 30.01 seconds
REPORT RequestId: abc-123  Duration: 30010.00 ms  Billed Duration: 30000 ms  Memory Size: 256 MB  Max Memory Used: 89 MB

Memory exhaustion: Max Memory Used equals Memory Size, and the log ends without a stack trace. When a function hits its memory limit, Lambda terminates the process immediately, which is why there is no error message. The abrupt ending combined with Max Memory Used = Memory Size is the reliable signal.

REPORT RequestId: abc-123  Duration: 5234.56 ms  Billed Duration: 5235 ms  Memory Size: 256 MB  Max Memory Used: 256 MB

Unhandled exception: Duration is well under the timeout, and there is a stack trace before the REPORT line.

[ERROR] ValueError: invalid literal for int() with base 10: 'abc'
Traceback (most recent call last):
  File "/var/task/handler.py", line 23, in lambda_handler
    count = int(event['count'])
ValueError: invalid literal for int() with base 10: 'abc'
END RequestId: abc-123
REPORT RequestId: abc-123  Duration: 45.23 ms  Billed Duration: 46 ms  Memory Size: 256 MB  Max Memory Used: 62 MB

Short Duration and a stack trace means a code error, not a resource limit.

Note: For monitoring Lambda at scale, including tracking error rates, memory trends, and cold start frequency across many invocations, see AWS Lambda Monitoring with CloudWatch.

Failure type comparison

Failure type	Symptom in logs	First thing to check	Likely fix
Timeout	”Task timed out after X.XX seconds”, Duration equals timeout	X-Ray to find the slow segment	Increase timeout or fix the downstream service
Memory exhaustion	Max Memory Used = Memory Size, log ends abruptly	Increase memory allocation, rerun to confirm	Increase Lambda memory or reduce memory usage in code
Unhandled exception	Stack trace before END line, Duration well under timeout	Read the exception type and message	Fix the code error
Downstream bottleneck	High Duration, no exception, downstream call is slow	X-Ray trace, downstream service metrics	Scale or fix the downstream service
Throttle / concurrency limit	`TooManyRequestsException` or errors only under load	Lambda concurrency metrics in CloudWatch	Increase reserved concurrency or request a limit increase
Async failure not received	No errors in Lambda logs, no result downstream	DLQ or Lambda destination configured?	Configure a DLQ or destination for the function

CloudWatch Logs workflow

Every Lambda invocation writes to a log group named /aws/lambda/{function-name}. Useful queries for finding failures:

Stream logs in real time during testing:

aws logs tail /aws/lambda/my-function --follow

Filter for errors in the last hour:

aws logs filter-log-events \
  --log-group-name /aws/lambda/my-function \
  --filter-pattern "ERROR" \
  --start-time $(date -d '1 hour ago' +%s000)

Filter for timeout events specifically:

aws logs filter-log-events \
  --log-group-name /aws/lambda/my-function \
  --filter-pattern "Task timed out"

Filter for cold starts:

aws logs filter-log-events \
  --log-group-name /aws/lambda/my-function \
  --filter-pattern "Init Duration"

What to look for when reading logs:

The REPORT line first, always. It tells you the failure category before you read anything else.
The gap between your last log statement and the REPORT line. If your code logged “Starting DB query” and the next line is the REPORT, the failure happened during that call.
The exception type. TimeoutError, ConnectionError, and AccessDeniedException each point to different root causes.
Whether errors are consistent or intermittent. If the same error appears on every invocation, the cause is systematic. If it appears on some invocations and not others, look at what differs between them: input size, time of day, concurrency level.

For deeper querying, log metric filters, and alarms on Lambda error rates, see CloudWatch Logs.

X-Ray for downstream failures

CloudWatch Logs tell you that something failed. X-Ray tells you where the time went. This distinction matters most for timeout failures where the Lambda code is correct but a downstream service is slow or throttling.

Enable X-Ray before you need it X-Ray traces only exist for invocations that occurred while tracing was enabled. Enable active tracing in production now, while things are working, so the data is there when something breaks.

Enable active tracing on a function:

aws lambda update-function-configuration \
  --function-name my-function \
  --tracing-config Mode=Active

The execution role needs xray:PutTraceSegments and xray:PutTelemetryRecords. The AWSXRayDaemonWriteAccess managed policy covers both.

Once enabled, every invocation produces a trace in the X-Ray console showing:

Initialization segment — cold start time, if applicable
Invocation segment — total handler duration
Subsegments — individual AWS service calls (DynamoDB, S3, SQS, etc.) with per-call durations and error rates

If your function times out because DynamoDB is throttling writes, X-Ray shows the DynamoDB subsegment consuming nearly all the invocation time. Without X-Ray, the log only shows a total duration of 30 seconds with no indication of which call was responsible.

For custom subsegments around your own code (external HTTP calls, database queries):

from aws_xray_sdk.core import xray_recorder

def lambda_handler(event, context):
    with xray_recorder.begin_subsegment('external-api-call') as subsegment:
        result = call_external_api(event['id'])
        subsegment.put_annotation('status_code', result.status_code)
    return result

For how traces, segments, and sampling work in production, see Distributed Tracing in AWS.

Async invocation failures

When Lambda is invoked synchronously (API Gateway, direct SDK call), failures return immediately to the caller. When Lambda is invoked asynchronously by S3 event notifications, SNS topics, or EventBridge rules, failures are retried internally and then silently discarded unless you configure a failure destination.

Analogy Configuring a DLQ or Lambda destination for an async function is like setting up voicemail. Without it, if Lambda can’t answer the call after a few retries, the event just disappears. There is no missed-call log. With a destination configured, every failed event lands somewhere you can inspect and replay.

How async retries work:

Lambda retries a failed async invocation up to two more times by default (three total attempts). You can configure the retry count (0, 1, or 2) and the maximum event age (up to 6 hours). If an event ages out before all retries complete, it goes to the failure destination regardless of the retry count.

Dead-letter queue (DLQ):

Configure an SQS queue to receive events that exhaust all retries:

aws lambda update-function-configuration \
  --function-name my-async-function \
  --dead-letter-config TargetArn=arn:aws:sqs:us-east-1:111122223333:lambda-dlq

Failed events arrive in the DLQ with metadata about the failure. You can inspect the messages to understand what data was being processed when the failure occurred.

Lambda destinations (preferred for new functions):

Destinations route both success and failure outcomes to SQS, SNS, EventBridge, or another Lambda function. Unlike a DLQ, they include the full request payload and error response, making it much easier to understand what failed and why:

aws lambda put-function-event-invoke-config \
  --function-name my-async-function \
  --destination-config '{
    "OnFailure": {"Destination": "arn:aws:sqs:us-east-1:111122223333:failures-queue"},
    "OnSuccess": {"Destination": "arn:aws:sqs:us-east-1:111122223333:success-queue"}
  }'

Use a DLQ when you only need to capture failures. Use destinations when you also want to route successes or need the full event context for debugging.

SQS-triggered Lambda works differently When Lambda is triggered by an SQS queue via an event source mapping, it is a poll-based invocation, not a true async invocation. Lambda reads message batches from the queue and the event source mapping deletes messages only after successful processing. On failure, messages are not deleted — they become visible again after the queue visibility timeout and Lambda retries them. When the SQS maxReceiveCount is exceeded, SQS moves messages to the DLQ configured on the source queue via a redrive policy. Lambda-level DLQs and destinations do not apply to SQS-triggered functions. See AWS Lambda Event Triggers Explained for the full invocation type breakdown.

For SNS delivery failures that happen before the Lambda invocation even starts, see SNS Message Delivery Failures.

Cold start debugging

A cold start happens when Lambda initializes a new execution environment. The Init Duration field in the REPORT line shows how long this took. Cold starts only affect the first invocation on a new environment; subsequent invocations on the same environment run without Init Duration.

When cold starts are the real problem:

Init Duration is large relative to the configured timeout. If Init Duration is 4 seconds and the timeout is 5 seconds, the handler has only 1 second to run on a cold start.
The function is attached to a VPC. VPC cold starts involve ENI attachment, which adds latency.
The deployment package or Lambda layers are very large. Large packages take longer to unzip and initialize.
The function initializes a heavy framework or large model at the module level, outside the handler.

When cold starts are just noise:

Init Duration is short (under 500ms) and timeouts happen on warm invocations too. Cold starts are not the bottleneck.
Failures affect all invocations consistently, not just the first invocation on each environment.

To filter for cold starts in logs:

aws logs filter-log-events \
  --log-group-name /aws/lambda/my-function \
  --filter-pattern "Init Duration"

If cold starts are genuinely causing failures, provisioned concurrency pre-initializes execution environments and eliminates Init Duration for those instances. For package size and memory optimization strategies, see Lambda Cost Optimisation.

Real troubleshooting scenarios

Scenario 1: Lambda times out after months of working fine

A function processes records from DynamoDB and writes summaries to a results table. Timeout is 30 seconds. Logs show:

[ERROR] Task timed out after 30.01 seconds
REPORT RequestId: xyz  Duration: 30010.23 ms  Billed Duration: 30000 ms  Memory Size: 256 MB  Max Memory Used: 88 MB

Memory is fine: 88 MB used out of 256 MB. It ran the full 30 seconds. Something is blocking the function, not exhausting its resources.

Enable X-Ray and trigger another invocation. The trace shows:

DynamoDB Query: 45ms (normal)
DynamoDB PutItem: 29,800ms (abnormal; nearly the entire timeout budget)

The write is taking 30 seconds. That’s throttling: write capacity is exhausted and DynamoDB is back-pressuring the function.

Confirm with CloudWatch metrics:

aws cloudwatch get-metric-statistics \
  --namespace AWS/DynamoDB \
  --metric-name WriteThrottleEvents \
  --dimensions Name=TableName,Value=results-table \
  --start-time $(date -u -d '2 days ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 3600 \
  --statistics Sum

The metrics confirm WriteThrottleEvents spiking as data volume grew past provisioned write capacity.

Fix: Switch to on-demand billing mode, which scales automatically:

aws dynamodb update-table \
  --table-name results-table \
  --billing-mode PAY_PER_REQUEST

The Lambda function code was correct throughout. X-Ray exposed a downstream bottleneck that CloudWatch Logs alone would not have revealed.

Scenario 2: Permission error discovered only in production

A Lambda function creates S3 objects as part of processing. It works during testing but fails in production on some invocations.

Symptom in logs:

[ERROR] ClientError: An error occurred (AccessDeniedException) when calling the
PutObject operation: User: arn:aws:sts::111122223333:assumed-role/process-role/process-function
is not authorized to perform: s3:PutObject on resource: "arn:aws:s3:::prod-results-bucket/..."

Root cause: The function was deployed to production without updating its IAM policy. The process-role grants s3:PutObject on the development bucket ARN only. In production, the bucket name is different and the policy doesn’t cover it.

Fix: Update the IAM policy to include the production bucket ARN, then use the IAM Policy Simulator to confirm before redeploying. See Fixing IAM AccessDenied Errors for the full simulation workflow.

Scenario 3: Async function produces no output and shows no errors

A Lambda function triggered by S3 events is supposed to write processed results to an output bucket. The source bucket is receiving files, but the output bucket has no new objects.

Check invocation count in CloudWatch first. If invocations are zero, the function is not being triggered. Check the S3 event notification configuration. A prefix or suffix filter on the notification may not match the uploaded file names.

If invocations are non-zero but results are missing: check CloudWatch Logs. If logs show successful invocations, the function is running but writing to the wrong bucket name or key prefix.

If invocations show errors: enable a Lambda destination for OnFailure to capture the failed event payloads and understand what input is causing the failures.

Common mistakes

Setting a short timeout without accounting for cold starts. Init Duration counts against the function’s timeout budget. If Init Duration is 3 seconds and the timeout is 5 seconds, the handler has only 2 seconds to run on a cold start.
Not configuring a DLQ or destination for async functions. Without a failure destination, async invocations that exhaust retries are silently discarded. You won’t know they failed and the event data is lost.
Not enabling X-Ray until after a production incident. X-Ray traces only exist for invocations that occurred while tracing was enabled. Enable it proactively so the data is there when failures happen.
Diagnosing memory exhaustion by looking for an error message. Lambda does not produce a clear error message when a function hits its memory limit. The reliable signal is Max Memory Used == Memory Size in the REPORT line combined with an abrupt log ending and no stack trace.
Adding Lambda memory to fix a timeout. Timeouts are caused by time, not memory. More memory gives more vCPU, which helps CPU-bound work, but if the bottleneck is a slow downstream service, more memory won’t help. Use X-Ray to confirm where time is going before changing memory settings.
Configuring a Lambda-level DLQ for SQS-triggered functions. Lambda-level DLQs and destinations apply to async invocations (SNS, S3, EventBridge). For SQS-triggered Lambda, failure handling is managed through the SQS queue’s redrive policy, not through the Lambda function configuration.

Summary

Read the REPORT line first. Duration near the timeout means a timeout. Max Memory Used = Memory Size with no stack trace means memory exhaustion. Stack trace with short Duration means an unhandled exception.
Enable X-Ray active tracing to see time broken down by downstream service call. Essential for diagnosing timeouts caused by DynamoDB throttling, slow database queries, or external API latency.
For async invocations (S3, SNS, EventBridge), configure a Lambda destination or DLQ so failed events are captured, not silently discarded.
For SQS-triggered Lambda, configure a redrive policy and DLQ on the SQS source queue. Lambda-level DLQs do not apply to poll-based invocations.
Cold starts only affect the first invocation on a new execution environment. If failures are consistent across all invocations, cold starts are not the root cause.

Frequently asked questions

How do I tell if a Lambda function timed out versus threw an exception?

Timeout: the REPORT line shows Duration equal to the configured timeout, and the log ends with "Task timed out after X.XX seconds" with no stack trace. Exception: the log shows a stack trace before the END line, and Duration is well below the timeout. Both appear as failed invocations, but the REPORT line and the presence or absence of a stack trace distinguish them immediately.

What happens to messages when a Lambda triggered by SQS fails?

Lambda polls SQS through an event source mapping and deletes each message only after the function processes it successfully. On failure, the message is not deleted. It becomes visible again after the queue visibility timeout expires and Lambda retries it. SQS tracks delivery attempts via the ApproximateReceiveCount attribute. When that count exceeds the queue maxReceiveCount (set in the queue redrive policy), SQS moves the message to the configured dead-letter queue. Without a redrive policy, messages retry until the queue message retention period expires. The DLQ for SQS-triggered Lambda is configured on the SQS source queue, not on the Lambda function. Lambda-level DLQs and destinations do not apply to poll-based invocations.

Does enabling X-Ray cost extra?

X-Ray has a free tier of 100,000 traces recorded and 1,000,000 traces scanned per month. Beyond that, there are per-trace charges. Enabling X-Ray temporarily for debugging adds minimal cost. In production, configure sampling rules to trace a fraction of invocations rather than all of them.

Max Memory Used equals Memory Size in the REPORT line. Does that always mean a memory failure?

Not always. If the function succeeded and Max Memory Used equals Memory Size, the function was close to the limit but may not have been killed. If the function failed and the log ends abruptly with no stack trace, that is a reliable signal the function hit its memory limit. Increase the memory allocation and rerun. If the failure disappears, memory exhaustion was the cause.

My Lambda works with small payloads but times out with large ones. What should I check?

Three things: (1) whether the timeout is set too low for the time required to process large payloads; (2) whether a downstream service (S3 download, database query) takes longer with larger inputs — enable X-Ray to check per-segment timing; (3) whether memory pressure increases significantly with large payloads, causing slow garbage collection. The REPORT line Duration and Max Memory Used together usually point to the answer.

Last verified: 13 May 2026 Cloud services change frequently. Verify details against official documentation before making infrastructure decisions.

Debug AWS Lambda Failures: Logs, Timeouts, OOM & X-Ray

What is a Lambda runtime failure?

When to use this guide

How Lambda failure debugging works

What to check first

Startup failures vs runtime failures

Read the REPORT line first

Failure type comparison

CloudWatch Logs workflow

X-Ray for downstream failures

Async invocation failures

Cold start debugging

Real troubleshooting scenarios

Scenario 1: Lambda times out after months of working fine

Scenario 2: Permission error discovered only in production

Scenario 3: Async function produces no output and shows no errors

Common mistakes

Summary

Related topics to read next

Frequently asked questions