Debug AWS Lambda Failures: Logs, Timeouts, OOM & X-Ray
Lambda runtime failures are harder to diagnose than startup errors because the cause depends on what your code does, what input it receives, and how downstream services behave. This page walks through the full diagnostic sequence: reading the REPORT line in CloudWatch Logs, identifying timeouts, memory exhaustion, and exceptions, tracing downstream bottlenecks with X-Ray, and handling async invocation failures that never surface to the caller.
What is a Lambda runtime failure?
A Lambda failure means the function invocation did not complete successfully. There are three distinct patterns:
- The function never starts. The deployment package is broken, the handler path is wrong, or a dependency is missing. This is a startup failure. See Lambda Function Failed to Start for that case.
- The function starts but fails during execution. Your code threw an unhandled exception, the function ran out of time, the function hit its memory limit, or a downstream service stopped responding.
- The function succeeds sometimes but fails under load or on specific inputs. Concurrency throttling, intermittent downstream errors, or input data that triggers an untested code path.
This page covers the second and third patterns: failures that happen at runtime, after the function has started.
When to use this guide
Come here when you see any of these symptoms:
- CloudWatch metrics show invocation errors but the log messages are not obvious
- Logs end with “Task timed out after X.XX seconds”
- The log ends abruptly with no error message or stack trace
- An async event (S3, SNS, EventBridge) is not producing results downstream
- The function fails intermittently; some invocations succeed, some don’t
- Failures appear under load but not in isolated testing
If the function fails immediately on every invocation and never produces a START line, check Lambda Function Failed to Start instead.
How Lambda failure debugging works
Every Lambda invocation writes a log stream to CloudWatch. For runtime failures, the diagnostic sequence is:
- Identify the invocation type. Synchronous (API Gateway, direct SDK call) vs asynchronous (S3, SNS, EventBridge) vs poll-based (SQS, Kinesis). Async and poll-based failures don’t automatically surface to the caller; you need a failure destination to see them.
- Read the REPORT line. Duration, memory usage, and cold start time are in every invocation log. These three numbers narrow the failure type immediately.
- Look for a stack trace or timeout message. A stack trace means a code error. A timeout message means the function ran out of time. No message and an abrupt log ending often points to memory exhaustion.
- Separate Lambda problem from downstream problem. A timeout doesn’t necessarily mean your code is wrong. The bottleneck may be a downstream service (DynamoDB, RDS, an external API). Enable X-Ray to see time broken down by service call.
- Check async failure handling. For async invocations, failures are retried internally, then discarded unless you’ve configured a dead-letter queue or Lambda destination.
What to check first
Work through this sequence before diving into code changes:
- Open CloudWatch Logs for the function. Log group:
/aws/lambda/{function-name}. - Find the REPORT line for a failing invocation. Is Duration close to the configured timeout? Likely a timeout.
- Is Max Memory Used equal to Memory Size? Possible memory exhaustion.
- Is there a stack trace in the log? Unhandled exception. Read the exception type and message.
- Does the log end without any error message or trace? Strong signal of memory exhaustion.
- Is the function asynchronous? Check whether a DLQ or Lambda destination is configured and whether it received anything.
- Is the failure intermittent and only appears under load? Check Lambda throttling metrics in CloudWatch. Lambda scaling and concurrency limits can cause sporadic failures when the concurrency ceiling is hit.
Startup failures vs runtime failures
Lambda failures split into two categories with different debugging paths:
Startup failure: the function never executes your handler. The error appears immediately, either before the START line or right after it. Common messages:
Runtime.ImportModuleError— a dependency is missing from the deployment packageRuntime.ExitError— the runtime process exited before the handler ranHandler 'lambda_handler' missing on module 'handler'— wrong handler path in the function configuration
Runtime failure: the function starts, executes your handler, and then fails. The log shows real work happening: database calls, processing lines, log statements from your code. Then either a stack trace, a timeout message, or an abrupt ending.
This page is for runtime failures. For startup failures, see Lambda Function Failed to Start.
Read the REPORT line first
Every Lambda invocation ends with a REPORT line in CloudWatch Logs. It’s the fastest diagnostic signal available and should always be the first thing you read.
Mental model Think of the REPORT line as a flight data recorder. It captures the same three measurements on every invocation: how long the function ran, how much memory it used, and whether it started cold. You read it after the failure to understand what the function was doing right before it ended.
START RequestId: abc-123-def-456 Version: $LATEST
...your log output...
END RequestId: abc-123-def-456
REPORT RequestId: abc-123-def-456 Duration: 1234.56 ms Billed Duration: 1235 ms Memory Size: 256 MB Max Memory Used: 87 MB Init Duration: 412.34 msThe fields:
- Duration — how long your handler ran, in milliseconds
- Billed Duration — Duration rounded up to the nearest 1ms (used for billing)
- Max Memory Used — the peak memory the function used during this invocation
- Memory Size — the memory limit configured for the function
- Init Duration — only appears on cold starts; the time Lambda spent initializing the execution environment before the handler ran
Three failure patterns visible from the REPORT line alone:
Timeout: Duration matches the configured timeout, and the log contains “Task timed out after X.XX seconds”.
[ERROR] Task timed out after 30.01 seconds
REPORT RequestId: abc-123 Duration: 30010.00 ms Billed Duration: 30000 ms Memory Size: 256 MB Max Memory Used: 89 MBMemory exhaustion: Max Memory Used equals Memory Size, and the log ends without a stack trace. When a function hits its memory limit, Lambda terminates the process immediately, which is why there is no error message. The abrupt ending combined with Max Memory Used = Memory Size is the reliable signal.
REPORT RequestId: abc-123 Duration: 5234.56 ms Billed Duration: 5235 ms Memory Size: 256 MB Max Memory Used: 256 MBUnhandled exception: Duration is well under the timeout, and there is a stack trace before the REPORT line.
[ERROR] ValueError: invalid literal for int() with base 10: 'abc'
Traceback (most recent call last):
File "/var/task/handler.py", line 23, in lambda_handler
count = int(event['count'])
ValueError: invalid literal for int() with base 10: 'abc'
END RequestId: abc-123
REPORT RequestId: abc-123 Duration: 45.23 ms Billed Duration: 46 ms Memory Size: 256 MB Max Memory Used: 62 MBShort Duration and a stack trace means a code error, not a resource limit.
Note: For monitoring Lambda at scale, including tracking error rates, memory trends, and cold start frequency across many invocations, see AWS Lambda Monitoring with CloudWatch.
Failure type comparison
| Failure type | Symptom in logs | First thing to check | Likely fix |
|---|---|---|---|
| Timeout | ”Task timed out after X.XX seconds”, Duration equals timeout | X-Ray to find the slow segment | Increase timeout or fix the downstream service |
| Memory exhaustion | Max Memory Used = Memory Size, log ends abruptly | Increase memory allocation, rerun to confirm | Increase Lambda memory or reduce memory usage in code |
| Unhandled exception | Stack trace before END line, Duration well under timeout | Read the exception type and message | Fix the code error |
| Downstream bottleneck | High Duration, no exception, downstream call is slow | X-Ray trace, downstream service metrics | Scale or fix the downstream service |
| Throttle / concurrency limit | TooManyRequestsException or errors only under load | Lambda concurrency metrics in CloudWatch | Increase reserved concurrency or request a limit increase |
| Async failure not received | No errors in Lambda logs, no result downstream | DLQ or Lambda destination configured? | Configure a DLQ or destination for the function |
CloudWatch Logs workflow
Every Lambda invocation writes to a log group named /aws/lambda/{function-name}. Useful queries for finding failures:
Stream logs in real time during testing:
aws logs tail /aws/lambda/my-function --followFilter for errors in the last hour:
aws logs filter-log-events \
--log-group-name /aws/lambda/my-function \
--filter-pattern "ERROR" \
--start-time $(date -d '1 hour ago' +%s000)Filter for timeout events specifically:
aws logs filter-log-events \
--log-group-name /aws/lambda/my-function \
--filter-pattern "Task timed out"Filter for cold starts:
aws logs filter-log-events \
--log-group-name /aws/lambda/my-function \
--filter-pattern "Init Duration"What to look for when reading logs:
- The REPORT line first, always. It tells you the failure category before you read anything else.
- The gap between your last log statement and the REPORT line. If your code logged “Starting DB query” and the next line is the REPORT, the failure happened during that call.
- The exception type.
TimeoutError,ConnectionError, andAccessDeniedExceptioneach point to different root causes. - Whether errors are consistent or intermittent. If the same error appears on every invocation, the cause is systematic. If it appears on some invocations and not others, look at what differs between them: input size, time of day, concurrency level.
For deeper querying, log metric filters, and alarms on Lambda error rates, see CloudWatch Logs.
X-Ray for downstream failures
CloudWatch Logs tell you that something failed. X-Ray tells you where the time went. This distinction matters most for timeout failures where the Lambda code is correct but a downstream service is slow or throttling.
Enable X-Ray before you need it X-Ray traces only exist for invocations that occurred while tracing was enabled. Enable active tracing in production now, while things are working, so the data is there when something breaks.
Enable active tracing on a function:
aws lambda update-function-configuration \
--function-name my-function \
--tracing-config Mode=ActiveThe execution role needs xray:PutTraceSegments and xray:PutTelemetryRecords. The AWSXRayDaemonWriteAccess managed policy covers both.
Once enabled, every invocation produces a trace in the X-Ray console showing:
- Initialization segment — cold start time, if applicable
- Invocation segment — total handler duration
- Subsegments — individual AWS service calls (DynamoDB, S3, SQS, etc.) with per-call durations and error rates
If your function times out because DynamoDB is throttling writes, X-Ray shows the DynamoDB subsegment consuming nearly all the invocation time. Without X-Ray, the log only shows a total duration of 30 seconds with no indication of which call was responsible.
For custom subsegments around your own code (external HTTP calls, database queries):
from aws_xray_sdk.core import xray_recorder
def lambda_handler(event, context):
with xray_recorder.begin_subsegment('external-api-call') as subsegment:
result = call_external_api(event['id'])
subsegment.put_annotation('status_code', result.status_code)
return resultFor how traces, segments, and sampling work in production, see Distributed Tracing in AWS.
Async invocation failures
When Lambda is invoked synchronously (API Gateway, direct SDK call), failures return immediately to the caller. When Lambda is invoked asynchronously by S3 event notifications, SNS topics, or EventBridge rules, failures are retried internally and then silently discarded unless you configure a failure destination.
Analogy Configuring a DLQ or Lambda destination for an async function is like setting up voicemail. Without it, if Lambda can’t answer the call after a few retries, the event just disappears. There is no missed-call log. With a destination configured, every failed event lands somewhere you can inspect and replay.
How async retries work:
Lambda retries a failed async invocation up to two more times by default (three total attempts). You can configure the retry count (0, 1, or 2) and the maximum event age (up to 6 hours). If an event ages out before all retries complete, it goes to the failure destination regardless of the retry count.
Dead-letter queue (DLQ):
Configure an SQS queue to receive events that exhaust all retries:
aws lambda update-function-configuration \
--function-name my-async-function \
--dead-letter-config TargetArn=arn:aws:sqs:us-east-1:111122223333:lambda-dlqFailed events arrive in the DLQ with metadata about the failure. You can inspect the messages to understand what data was being processed when the failure occurred.
Lambda destinations (preferred for new functions):
Destinations route both success and failure outcomes to SQS, SNS, EventBridge, or another Lambda function. Unlike a DLQ, they include the full request payload and error response, making it much easier to understand what failed and why:
aws lambda put-function-event-invoke-config \
--function-name my-async-function \
--destination-config '{
"OnFailure": {"Destination": "arn:aws:sqs:us-east-1:111122223333:failures-queue"},
"OnSuccess": {"Destination": "arn:aws:sqs:us-east-1:111122223333:success-queue"}
}'Use a DLQ when you only need to capture failures. Use destinations when you also want to route successes or need the full event context for debugging.
SQS-triggered Lambda works differently
When Lambda is triggered by an SQS queue via an event source mapping, it is a poll-based invocation, not a true async invocation. Lambda reads message batches from the queue and the event source mapping deletes messages only after successful processing. On failure, messages are not deleted — they become visible again after the queue visibility timeout and Lambda retries them. When the SQS maxReceiveCount is exceeded, SQS moves messages to the DLQ configured on the source queue via a redrive policy. Lambda-level DLQs and destinations do not apply to SQS-triggered functions. See AWS Lambda Event Triggers Explained for the full invocation type breakdown.
For SNS delivery failures that happen before the Lambda invocation even starts, see SNS Message Delivery Failures.
Cold start debugging
A cold start happens when Lambda initializes a new execution environment. The Init Duration field in the REPORT line shows how long this took. Cold starts only affect the first invocation on a new environment; subsequent invocations on the same environment run without Init Duration.
When cold starts are the real problem:
- Init Duration is large relative to the configured timeout. If Init Duration is 4 seconds and the timeout is 5 seconds, the handler has only 1 second to run on a cold start.
- The function is attached to a VPC. VPC cold starts involve ENI attachment, which adds latency.
- The deployment package or Lambda layers are very large. Large packages take longer to unzip and initialize.
- The function initializes a heavy framework or large model at the module level, outside the handler.
When cold starts are just noise:
- Init Duration is short (under 500ms) and timeouts happen on warm invocations too. Cold starts are not the bottleneck.
- Failures affect all invocations consistently, not just the first invocation on each environment.
To filter for cold starts in logs:
aws logs filter-log-events \
--log-group-name /aws/lambda/my-function \
--filter-pattern "Init Duration"If cold starts are genuinely causing failures, provisioned concurrency pre-initializes execution environments and eliminates Init Duration for those instances. For package size and memory optimization strategies, see Lambda Cost Optimisation.
Real troubleshooting scenarios
Scenario 1: Lambda times out after months of working fine
A function processes records from DynamoDB and writes summaries to a results table. Timeout is 30 seconds. Logs show:
[ERROR] Task timed out after 30.01 seconds
REPORT RequestId: xyz Duration: 30010.23 ms Billed Duration: 30000 ms Memory Size: 256 MB Max Memory Used: 88 MBMemory is fine: 88 MB used out of 256 MB. It ran the full 30 seconds. Something is blocking the function, not exhausting its resources.
Enable X-Ray and trigger another invocation. The trace shows:
- DynamoDB
Query: 45ms (normal) - DynamoDB
PutItem: 29,800ms (abnormal; nearly the entire timeout budget)
The write is taking 30 seconds. That’s throttling: write capacity is exhausted and DynamoDB is back-pressuring the function.
Confirm with CloudWatch metrics:
aws cloudwatch get-metric-statistics \
--namespace AWS/DynamoDB \
--metric-name WriteThrottleEvents \
--dimensions Name=TableName,Value=results-table \
--start-time $(date -u -d '2 days ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 3600 \
--statistics SumThe metrics confirm WriteThrottleEvents spiking as data volume grew past provisioned write capacity.
Fix: Switch to on-demand billing mode, which scales automatically:
aws dynamodb update-table \
--table-name results-table \
--billing-mode PAY_PER_REQUESTThe Lambda function code was correct throughout. X-Ray exposed a downstream bottleneck that CloudWatch Logs alone would not have revealed.
Scenario 2: Permission error discovered only in production
A Lambda function creates S3 objects as part of processing. It works during testing but fails in production on some invocations.
Symptom in logs:
[ERROR] ClientError: An error occurred (AccessDeniedException) when calling the
PutObject operation: User: arn:aws:sts::111122223333:assumed-role/process-role/process-function
is not authorized to perform: s3:PutObject on resource: "arn:aws:s3:::prod-results-bucket/..."Root cause: The function was deployed to production without updating its IAM policy. The process-role grants s3:PutObject on the development bucket ARN only. In production, the bucket name is different and the policy doesn’t cover it.
Fix: Update the IAM policy to include the production bucket ARN, then use the IAM Policy Simulator to confirm before redeploying. See Fixing IAM AccessDenied Errors for the full simulation workflow.
Scenario 3: Async function produces no output and shows no errors
A Lambda function triggered by S3 events is supposed to write processed results to an output bucket. The source bucket is receiving files, but the output bucket has no new objects.
Check invocation count in CloudWatch first. If invocations are zero, the function is not being triggered. Check the S3 event notification configuration. A prefix or suffix filter on the notification may not match the uploaded file names.
If invocations are non-zero but results are missing: check CloudWatch Logs. If logs show successful invocations, the function is running but writing to the wrong bucket name or key prefix.
If invocations show errors: enable a Lambda destination for OnFailure to capture the failed event payloads and understand what input is causing the failures.
Common mistakes
- Setting a short timeout without accounting for cold starts. Init Duration counts against the function’s timeout budget. If Init Duration is 3 seconds and the timeout is 5 seconds, the handler has only 2 seconds to run on a cold start.
- Not configuring a DLQ or destination for async functions. Without a failure destination, async invocations that exhaust retries are silently discarded. You won’t know they failed and the event data is lost.
- Not enabling X-Ray until after a production incident. X-Ray traces only exist for invocations that occurred while tracing was enabled. Enable it proactively so the data is there when failures happen.
- Diagnosing memory exhaustion by looking for an error message. Lambda does not produce a clear error message when a function hits its memory limit. The reliable signal is
Max Memory Used == Memory Sizein the REPORT line combined with an abrupt log ending and no stack trace. - Adding Lambda memory to fix a timeout. Timeouts are caused by time, not memory. More memory gives more vCPU, which helps CPU-bound work, but if the bottleneck is a slow downstream service, more memory won’t help. Use X-Ray to confirm where time is going before changing memory settings.
- Configuring a Lambda-level DLQ for SQS-triggered functions. Lambda-level DLQs and destinations apply to async invocations (SNS, S3, EventBridge). For SQS-triggered Lambda, failure handling is managed through the SQS queue’s redrive policy, not through the Lambda function configuration.
Summary
- Read the REPORT line first. Duration near the timeout means a timeout. Max Memory Used = Memory Size with no stack trace means memory exhaustion. Stack trace with short Duration means an unhandled exception.
- Enable X-Ray active tracing to see time broken down by downstream service call. Essential for diagnosing timeouts caused by DynamoDB throttling, slow database queries, or external API latency.
- For async invocations (S3, SNS, EventBridge), configure a Lambda destination or DLQ so failed events are captured, not silently discarded.
- For SQS-triggered Lambda, configure a redrive policy and DLQ on the SQS source queue. Lambda-level DLQs do not apply to poll-based invocations.
- Cold starts only affect the first invocation on a new execution environment. If failures are consistent across all invocations, cold starts are not the root cause.
Frequently asked questions
How do I tell if a Lambda function timed out versus threw an exception?
Timeout: the REPORT line shows Duration equal to the configured timeout, and the log ends with "Task timed out after X.XX seconds" with no stack trace. Exception: the log shows a stack trace before the END line, and Duration is well below the timeout. Both appear as failed invocations, but the REPORT line and the presence or absence of a stack trace distinguish them immediately.
What happens to messages when a Lambda triggered by SQS fails?
Lambda polls SQS through an event source mapping and deletes each message only after the function processes it successfully. On failure, the message is not deleted. It becomes visible again after the queue visibility timeout expires and Lambda retries it. SQS tracks delivery attempts via the ApproximateReceiveCount attribute. When that count exceeds the queue maxReceiveCount (set in the queue redrive policy), SQS moves the message to the configured dead-letter queue. Without a redrive policy, messages retry until the queue message retention period expires. The DLQ for SQS-triggered Lambda is configured on the SQS source queue, not on the Lambda function. Lambda-level DLQs and destinations do not apply to poll-based invocations.
Does enabling X-Ray cost extra?
X-Ray has a free tier of 100,000 traces recorded and 1,000,000 traces scanned per month. Beyond that, there are per-trace charges. Enabling X-Ray temporarily for debugging adds minimal cost. In production, configure sampling rules to trace a fraction of invocations rather than all of them.
Max Memory Used equals Memory Size in the REPORT line. Does that always mean a memory failure?
Not always. If the function succeeded and Max Memory Used equals Memory Size, the function was close to the limit but may not have been killed. If the function failed and the log ends abruptly with no stack trace, that is a reliable signal the function hit its memory limit. Increase the memory allocation and rerun. If the failure disappears, memory exhaustion was the cause.
My Lambda works with small payloads but times out with large ones. What should I check?
Three things: (1) whether the timeout is set too low for the time required to process large payloads; (2) whether a downstream service (S3 download, database query) takes longer with larger inputs — enable X-Ray to check per-segment timing; (3) whether memory pressure increases significantly with large payloads, causing slow garbage collection. The REPORT line Duration and Max Memory Used together usually point to the answer.