Domain 4: Troubleshooting and Optimization
Topic 4 of 4 · Study notes
AWS Certified Developer – Associate (DVA-C02)
Domain 4: Troubleshooting and Optimization
Exam Code: DVA-C02 | Level: Associate
Domain Weight: 18% | Total Domains: 4 | Passing Score: 720/1000
Table of Contents
- Amazon CloudWatch
- AWS X-Ray — Distributed Tracing
- AWS CloudTrail
- Lambda — Optimization & Troubleshooting
- DynamoDB — Performance & Optimization
- API Gateway — Optimization & Troubleshooting
- SQS & SNS — Troubleshooting
- Caching Strategies
- Common Error Patterns & HTTP Status Codes
- Application Optimization Patterns
- Exam Tips & Quick Reference
1. Amazon CloudWatch
CloudWatch is the unified observability service for AWS. It collects metrics, logs, and events, and enables automated responses through alarms and actions.
1.1 Metrics — Built-in & Custom
Metric Fundamentals:
| Concept | Definition |
|---|---|
| Namespace | Container for metrics (AWS/Lambda, AWS/EC2, MyApp/Checkout) |
| Metric | A time-ordered set of data points (e.g., Errors, Duration, CPUUtilization) |
| Dimension | A name-value pair that identifies a metric (FunctionName=MyFunc) |
| Resolution | Standard (1-minute granularity) or High Resolution (1-second granularity) |
| Retention | 3h for < 60s resolution; 15 days for 60s; 63 days for 5m; 15 months for 1h |
Critical Lambda CloudWatch Metrics:
| Metric | What It Measures | Troubleshooting Use |
|---|---|---|
Invocations |
Number of function invocations | Traffic volume |
Errors |
Invocations that threw an error | Function failure rate |
Duration |
Execution time per invocation | Performance; timeout risk |
Throttles |
Invocations rejected due to concurrency limit | Concurrency exhaustion |
ConcurrentExecutions |
Running functions at a given time | Concurrency usage |
IteratorAge |
Age of last processed Kinesis/DDB record | Stream processing lag |
DeadLetterErrors |
Failed DLQ deliveries | DLQ misconfiguration |
Critical DynamoDB CloudWatch Metrics:
| Metric | What It Measures |
|---|---|
ConsumedReadCapacityUnits |
RCUs used per minute |
ConsumedWriteCapacityUnits |
WCUs used per minute |
ProvisionedReadCapacityUnits |
Provisioned RCUs |
SystemErrors |
DynamoDB server-side errors (5xx) |
UserErrors |
Client-side errors (4xx) |
SuccessfulRequestLatency |
Per-operation latency |
ThrottledRequests |
Requests throttled due to exceeding capacity |
Custom Metrics:
import boto3
cloudwatch = boto3.client('cloudwatch')
# Publish a custom metric
cloudwatch.put_metric_data(
Namespace='MyApp/Checkout',
MetricData=[
{
'MetricName': 'OrderProcessingTime',
'Value': 245.5,
'Unit': 'Milliseconds',
'Dimensions': [
{'Name': 'Environment', 'Value': 'production'},
{'Name': 'Region', 'Value': 'us-east-1'}
],
'StorageResolution': 1 # 1 = High Resolution (1s); 60 = Standard
}
]
)
High Resolution Metrics: Standard metrics have 1-minute granularity. High Resolution metrics (StorageResolution=1) have 1-second granularity and support alarms at 10s or 30s periods. They are more expensive but essential for fast-responding systems.
Metric Math: Perform mathematical operations on metrics to derive new values. Example: calculate error rate = Errors / Invocations * 100 directly in CloudWatch.
1.2 CloudWatch Logs & Log Insights
Log Hierarchy:
Log Group (/aws/lambda/MyFunction)
└── Log Stream (2024/01/15/[$LATEST]abc123)
└── Log Events (individual log entries)
| Component | Description |
|---|---|
| Log Group | Container for log streams from the same source |
| Log Stream | Sequence of log events from the same source instance |
| Retention Policy | Configure per log group: 1 day to 10 years (or never expire) |
| Metric Filter | Extract numeric values from log events to create custom metrics |
| Subscription Filter | Stream log events to Lambda, Kinesis, or Firehose in real time |
Metric Filter Example (count errors in logs):
Filter Pattern: [timestamp, requestId, level="ERROR", ...]
# Matches log lines containing "ERROR" at the level field
# Creates a CloudWatch metric counting these events
CloudWatch Log Insights:
Interactive query language for analyzing log data. Supports filtering, aggregation, sorting, and visualization.
-- Find slowest Lambda invocations in the last hour
fields @timestamp, @duration, @requestId
| filter @type = "REPORT"
| sort @duration desc
| limit 20
-- Count errors by error type
fields @message
| filter @message like /ERROR/
| parse @message "* ERROR: *" as timestamp, errorMessage
| stats count(*) as errorCount by errorMessage
| sort errorCount desc
-- Find all throttle events in API Gateway
fields @timestamp, @message
| filter @message like /429/
| stats count(*) as throttleCount by bin(5m)
Exporting Logs:
| Destination | Latency | Use Case |
|---|---|---|
| S3 (Export) | Up to 12 hours | Batch archiving, compliance |
| Kinesis Firehose | Near-real-time | S3/Redshift/OpenSearch delivery |
| Lambda (Subscription) | Real-time | Custom processing, alerting |
| Kinesis Data Streams | Real-time | Custom real-time analytics |
1.3 CloudWatch Alarms
Alarm States:
| State | Meaning |
|---|---|
| OK | Metric is within the defined threshold |
| ALARM | Metric has breached the threshold |
| INSUFFICIENT_DATA | Not enough data to evaluate (common on new alarms or during metric gaps) |
Alarm Configuration:
Alarm: Lambda Error Rate > 5% for 2 consecutive 1-minute periods
Period: 60 seconds ← Granularity of evaluation
Evaluation Periods: 2 ← Number of periods that must breach
Datapoints to Alarm: 2 ← Out of 2 periods, 2 must breach (can be different: M of N)
Threshold: 5 ← The breach value
Alarm Actions:
| Action | Service | Use Case |
|---|---|---|
| SNS Notification | SNS Topic | Email, Slack, PagerDuty alerts |
| Auto Scaling | EC2/ECS Auto Scaling | Scale in/out based on metric |
| EC2 Action | Stop, Terminate, Reboot | Self-healing |
| Systems Manager | OpsCenter OpsItem | Automated runbook |
| CodeDeploy Rollback | CodeDeploy | Auto-rollback deployment |
Composite Alarms:
Combine multiple alarms using AND/OR logic. Reduces alarm noise — only alert when multiple conditions are true simultaneously.
Composite Alarm: ALARM if (Lambda Errors ALARM) AND (DynamoDB Throttles ALARM)
→ Only alert when BOTH conditions are true; ignore individual spikes
Best Practice: Use composite alarms for high-severity pages. Use individual alarms for metrics dashboards. This reduces alert fatigue.
1.4 CloudWatch Dashboards & Contributor Insights
Dashboards are customizable monitoring views. They can include metrics from multiple regions (cross-region dashboards) and support math expressions.
CloudWatch Contributor Insights:
Analyzes structured log data to identify the top contributors (the "heavy hitters") causing a problem.
- Find which IP addresses generate the most errors.
- Find which DynamoDB partition keys are getting the most throttled.
- Built-in rules for VPC Flow Logs, CloudTrail, Route 53 Resolver.
Rule: Find top 10 DynamoDB partition keys with ThrottledRequests
→ Instantly identifies hot partitions from DynamoDB logs
→ No manual log analysis required
1.5 CloudWatch Synthetics
Canary scripts (Node.js or Python) that continuously test your API endpoints, UI workflows, and URLs — even when no real users are active.
| Canary Blueprint | Tests |
|---|---|
| API Canary | REST API endpoints; checks status codes and response time |
| GUI Workflow | Simulates user login flow, form submission |
| Heartbeat Monitor | Basic availability check for URLs |
| Broken Link Checker | Scans for broken hyperlinks |
| Visual Monitoring | Screenshots and compares to baseline |
1.6 CloudWatch Events (EventBridge)
CloudWatch Events is now EventBridge. See Domain 1, Section 7.
Key integrations for troubleshooting:
- EC2 state changes → trigger Lambda for auto-remediation.
- CodeDeploy deployment failure → SNS notification.
- Scheduled events (cron) → trigger health check Lambda.
2. AWS X-Ray — Distributed Tracing
X-Ray provides distributed tracing for serverless and microservices architectures. It visualizes the full path of a request as it travels through your application, identifying bottlenecks and errors at the service level.
2.1 Core Concepts — Traces, Segments & Subsegments
┌─────────────────────────────────────────────────────────────────────────────────┐
│ X-Ray Trace Structure │
│ │
│ Trace (one end-to-end request): │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Segment: API Gateway [0ms ──────────────────────────────── 450ms] │ │
│ │ └── Subsegment: Lambda Cold Start [0ms ─── 150ms] │ │
│ │ Segment: Lambda Function [150ms ──────── 400ms] │ │
│ │ └── Subsegment: DynamoDB GetItem [160ms ── 200ms] │ │
│ │ └── Subsegment: External HTTP call [210ms ─────── 380ms] │ │
│ │ └── Subsegment: SQS SendMessage [385ms ─ 400ms] │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ The timeline shows exactly where time is spent — immediately identifies │
│ the "External HTTP call" as the bottleneck at 170ms. │
└─────────────────────────────────────────────────────────────────────────────────┘
| Concept | Definition |
|---|---|
| Trace | The complete end-to-end path of a single request. Has a unique Trace ID. |
| Segment | Work done by a single service/application within a trace |
| Subsegment | Granular unit of work within a segment (DB query, HTTP call, function call) |
| Service Map | Visual graph of all services and connections, with health and latency |
| Trace ID | A unique identifier passed in the X-Amzn-Trace-Id HTTP header |
X-Ray Service Map:
┌──────────┐
Client ─────────────► API GW ├────► Lambda ────► DynamoDB
(200, 45ms)│ (200, 45ms)│ (200, 30ms) (200, 15ms)
└──────────┘
└────► SES (error rate: 5%)
2.2 Annotations, Metadata & Groups
Annotations vs Metadata:
| Annotations | Metadata | |
|---|---|---|
| Indexed | Yes — can be searched and filtered | No — only viewable in trace details |
| Type | Key-value (string, number, boolean) | Key-value (any JSON) |
| Use case | Filter traces by business context | Debug data; large payloads |
| Size limit | 50 per segment | No strict limit |
# Python X-Ray SDK
from aws_xray_sdk.core import xray_recorder
# Add annotation — INDEXED, searchable
xray_recorder.put_annotation('userId', 'user-123')
xray_recorder.put_annotation('orderValue', 99.99)
# Add metadata — NOT indexed, debug only
xray_recorder.put_metadata('requestPayload', {'items': 3, 'currency': 'USD'})
# Create a custom subsegment
with xray_recorder.in_subsegment('processPayment') as subsegment:
subsegment.put_annotation('paymentMethod', 'credit_card')
result = process_payment(order)
X-Ray Groups:
Filter traces by annotation using filter expressions. Create separate dashboards/alarms per group.
# Filter expression: show only traces from the payment service with high latency
service("PaymentLambda") AND responsetime > 5 AND annotation.userId != null
2.3 Sampling Rules
X-Ray does NOT record every request by default — sampling reduces cost and noise.
Default Sampling Rule:
- First request per second per host → always recorded (reservoir).
- 5% of additional requests → sampled.
Custom Sampling Rules:
Define rules based on URL, host, method, and service name. Rules are evaluated in priority order (lower number = higher priority).
| Priority | Service Name | URL Path | Fixed Rate | Reservoir |
|---|---|---|---|---|
| 1 | PaymentLambda |
/checkout |
100% | 10 |
| 2 | * |
/health |
0% | 0 (ignore health checks) |
| 3 | * |
* |
5% | 1 (default) |
Critical: Setting rate to 0% for
/healthendpoints prevents health check polling from polluting your trace data. Always exclude health checks from sampling.
2.4 X-Ray SDK & Daemon
X-Ray Architecture:
Application Code (with X-Ray SDK)
│
│ (sends trace data via UDP to localhost:2000)
▼
X-Ray Daemon
(runs alongside app)
│
│ (batches and sends to X-Ray API over HTTPS)
▼
AWS X-Ray Service
X-Ray SDK Key Capabilities:
from aws_xray_sdk.core import xray_recorder, patch_all
# Patch all supported libraries (boto3, requests, urllib, etc.)
patch_all()
# Lambda handler — SDK auto-creates a segment for each invocation
def lambda_handler(event, context):
# All boto3 calls are automatically traced as subsegments
response = dynamodb.get_item(...)
return response
# Manual subsegment for custom code paths
with xray_recorder.in_subsegment('validate-business-rule') as subseg:
subseg.put_annotation('ruleId', 'RULE-001')
result = validate(data)
X-Ray Daemon Deployment:
| Environment | Daemon Location |
|---|---|
| EC2 / On-Premises | Install and run as a service |
| Lambda | Auto-included (just enable Active Tracing) |
| ECS (Fargate) | Add X-Ray daemon as a sidecar container |
| ECS (EC2) | Run daemon on EC2 host or as a sidecar |
| Elastic Beanstalk | Enable via EB console or .ebextensions |
ECS X-Ray Sidecar Configuration:
// task definition — add X-Ray daemon as a sidecar container
{
"name": "xray-daemon",
"image": "amazon/aws-xray-daemon",
"cpu": 32,
"memoryReservation": 256,
"portMappings": [{"containerPort": 2000, "protocol": "udp"}]
}
// application container — link to X-Ray sidecar
{
"name": "myapp",
"image": "myapp:latest",
"environment": [
{"name": "AWS_XRAY_DAEMON_ADDRESS", "value": "xray-daemon:2000"}
],
"links": ["xray-daemon"]
}
2.5 X-Ray with Lambda, API Gateway & ECS
Lambda X-Ray Enablement:
# SAM template — enable X-Ray Active Tracing
Globals:
Function:
Tracing: Active
# Or per function
MyFunction:
Type: AWS::Serverless::Function
Properties:
Tracing: Active
| Tracing Mode | Behavior |
|---|---|
| Active | Lambda samples and sends trace data to X-Ray. Follows sampling rules. |
| PassThrough | Lambda passes trace header through but only records if called by another traced service. |
API Gateway X-Ray:
Enable "X-Ray Tracing" on a REST API stage. API Gateway creates a segment for each request, which is connected to downstream Lambda segments.
Key Concept: For end-to-end tracing (API Gateway → Lambda → DynamoDB), you must enable X-Ray on ALL services in the chain. Enabling it on Lambda alone shows Lambda internals but misses API Gateway and downstream service connections.
3. AWS CloudTrail
CloudTrail records API calls made to AWS services. Every action taken via console, CLI, SDK, or another AWS service generates a CloudTrail event. Think of it as the "audit log" for your AWS account.
3.1 Event Types & Trail Configuration
| Event Type | Records | Default Enabled |
|---|---|---|
| Management Events | Control plane operations (CreateBucket, RunInstances, UpdateFunction, etc.) | Yes |
| Data Events | Data plane operations on specific resources (S3:GetObject, Lambda:Invoke, DynamoDB:GetItem) | No (extra cost) |
| Insights Events | Unusual API activity patterns (spike in EC2 RunInstances, IAM policy changes) | No (extra cost) |
Trail Configuration:
| Setting | Recommendation |
|---|---|
| Multi-Region Trail | Enable to capture events from all regions |
| S3 Storage | Logs delivered to S3 within 15 minutes |
| Log File Validation | Enable SHA-256 digest files to detect tampering |
| CloudWatch Logs Integration | Send events to CloudWatch Logs for real-time alerting |
| KMS Encryption | Encrypt logs with a CMK |
| Organization Trail | One trail covering all accounts in an AWS Organization |
Critical Timing: CloudTrail delivers log files within 15 minutes of an API call. For real-time monitoring, stream events to CloudWatch Logs via a Trail, then create metric filters and alarms. CloudTrail alone is NOT real-time.
Event Lookup:
# Find all DeleteObject calls in the last 24 hours
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=EventName,AttributeValue=DeleteObject \
--start-time 2024-01-14T00:00:00Z
# Find all actions by a specific user
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=Username,[email protected]
Useful CloudTrail Log Fields:
{
"eventTime": "2024-01-15T10:23:45Z",
"eventName": "DeleteObject",
"userIdentity": {
"type": "IAMUser",
"userName": "alice",
"arn": "arn:aws:iam::123456789:user/alice"
},
"sourceIPAddress": "192.168.1.1",
"requestParameters": {
"bucketName": "my-bucket",
"key": "sensitive-file.txt"
},
"responseElements": null,
"errorCode": null
}
3.2 CloudTrail Insights
CloudTrail Insights detects unusual write API activity by learning the normal baseline of management events. When activity deviates significantly, an Insights event is generated.
Baseline: 5 RunInstances calls/hour
Spike: 200 RunInstances calls in 10 minutes → Insights Event generated
→ Possible causes: cryptomining, unauthorized access, application bug
CloudTrail vs CloudWatch vs X-Ray:
| CloudTrail | CloudWatch | X-Ray | |
|---|---|---|---|
| Records | Who called what AWS API | Metrics and logs from AWS services | Distributed request traces |
| Focus | Audit and compliance | Monitoring and alerting | Performance and debugging |
| Granularity | Per API call | Per metric/log entry | Per request end-to-end |
| Retention | 90 days (free), longer in S3 | Configurable | 30 days |
4. Lambda — Optimization & Troubleshooting
4.1 Lambda Performance Tuning
Memory and CPU Relationship:
Lambda CPU is proportional to memory:
128 MB → 1/8 vCPU (very slow CPU operations)
1,769 MB → 1 full vCPU (linear increase)
3,538 MB → 2 vCPUs
10,240 MB → ~6 vCPUs
Key Insight: Increasing memory also increases CPU. A function that uses
barely any memory but is CPU-intensive should still get more memory
to get more CPU allocation.
Lambda Power Tuning (AWS Lambda Power Tuning tool):
AWS open-source tool that tests your function at multiple memory configurations and finds the optimal setting for cost vs. performance.
Test 128MB → 512MB → 1024MB → 2048MB → 3008MB
Measure: duration × price per GB-second
Find: sweet spot where cost is lowest or performance is highest
Initialization Best Practices:
import boto3
# ✅ CORRECT: Initialize outside handler (warm reuse)
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('MyTable')
# Loaded once per execution environment; reused across warm invocations
# ✅ CORRECT: Load config once
import os
TABLE_NAME = os.environ['TABLE_NAME'] # Cached in memory
def lambda_handler(event, context):
# ✅ Reuse the initialized clients
response = table.get_item(Key={'id': event['id']})
# ❌ WRONG: Creating new clients inside the handler (slow)
# db = boto3.resource('dynamodb') # Don't do this
return response['Item']
Reducing Package Size:
- Use Lambda Layers for shared libraries.
- Use Docker container images for dependencies over 50 MB.
- Enable tree-shaking in JavaScript/TypeScript builds.
- Use
aws-sdkv3 modular imports (import only what you need).
// ❌ WRONG: Import entire SDK
const AWS = require('aws-sdk');
const s3 = new AWS.S3();
// ✅ CORRECT: Import only S3 client (v3, smaller bundle)
import { S3Client, GetObjectCommand } from "@aws-sdk/client-s3";
const s3 = new S3Client({region: 'us-east-1'});
SnapStart (Java only):
Lambda SnapStart creates a snapshot of a fully initialized execution environment, dramatically reducing cold start time for Java functions. Enable on function configuration → takes a snapshot when you publish a new version.
4.2 Lambda Error Patterns
| Error | Cause | Solution |
|---|---|---|
Task timed out after X seconds |
Function exceeds configured timeout | Increase timeout; optimize slow operations; add connection timeout to SDK clients |
Runtime exited with error: signal: killed |
Out of memory | Increase memory allocation |
Process exited before completing request |
Unhandled exception; process crash | Add global error handling; check CloudWatch Logs for traceback |
Error: EACCES permission denied |
File permissions in container | Check file permissions; use /tmp for writes |
Unable to import module |
Missing dependency in deployment package | Add dependency to package; check for layer compatibility |
Calling the invoke API failed |
IAM permissions missing | Add lambda:InvokeFunction permission to caller's policy |
EndpointResolutionError |
Wrong region or invalid endpoint in SDK client | Verify region configuration |
Structured Error Handling Pattern:
import json
import logging
import traceback
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
try:
result = process(event)
logger.info(json.dumps({
'level': 'INFO',
'operation': 'process',
'requestId': context.aws_request_id,
'status': 'success'
}))
return {'statusCode': 200, 'body': json.dumps(result)}
except ValueError as e:
logger.error(json.dumps({
'level': 'ERROR',
'operation': 'process',
'requestId': context.aws_request_id,
'error': str(e),
'errorType': 'ValidationError'
}))
return {'statusCode': 400, 'body': json.dumps({'error': str(e)})}
except Exception as e:
logger.error(json.dumps({
'level': 'ERROR',
'requestId': context.aws_request_id,
'error': str(e),
'traceback': traceback.format_exc()
}))
raise # Re-raise for Lambda retry/DLQ
4.3 Lambda Concurrency Troubleshooting
Diagnosing Throttling:
Symptom: API Gateway returning 502 or 429 errors
CloudWatch Throttles metric > 0
Root Cause Analysis:
1. Check ConcurrentExecutions vs account/function Reserved Concurrency limit
2. Check if Reserved Concurrency is set too low
3. Check if another function is consuming all unreserved concurrency
4. Check Burst Limit (3000 for us-east-1)
Solutions:
→ Request account concurrency limit increase
→ Add Reserved Concurrency to protect critical functions
→ Implement exponential backoff in the calling service
→ Use SQS to buffer requests (decouple spike from Lambda)
IteratorAge for Stream Processing (Kinesis/DynamoDB Streams):
IteratorAge = Current time - timestamp of last processed record
High IteratorAge means Lambda is falling behind the stream:
Causes: Lambda timeout, insufficient concurrency, slow processing
Solutions:
→ Increase Lambda memory/CPU for faster processing
→ Increase Kinesis shard count (more parallelism)
→ Reduce batch size to process faster per invocation
→ Enable Parallelization Factor (1-10 concurrent batches per shard)
5. DynamoDB — Performance & Optimization
5.1 Hot Partitions & Write Sharding
Hot Partition Problem:
DynamoDB distributes data across partitions using the partition key hash. If most writes go to the same partition key, you get a "hot partition" that exceeds the per-partition throughput limit (3,000 RCU + 1,000 WCU per partition).
❌ Hot Partition Example:
Table: stock-trades
Partition Key: stock_symbol = "AAPL" ← 90% of writes go here → HOT
✅ Write Sharding Fix:
Append random suffix: "AAPL#1", "AAPL#2", ..., "AAPL#10"
Writes distributed across 10 partitions
Read: query all 10 shards in parallel and aggregate
Strategies:
| Strategy | How | Use Case |
|---|---|---|
| Write Sharding | Append random suffix (1-N) to PK | High-write items (stock prices, IoT sensors) |
| Time-based sharding | Include time bucket in PK (date, hour) | Time-series data |
| Composite PK | Use natural high-cardinality key | User data, order data |
| Caching (DAX) | Cache hot reads in DAX | Read-heavy items |
5.2 DynamoDB Error Patterns
| Error | HTTP Code | Cause | Solution |
|---|---|---|---|
ProvisionedThroughputExceededException |
400 | Read/write capacity exceeded | Exponential backoff; increase capacity; use On-Demand mode; fix hot partitions |
ConditionalCheckFailedException |
400 | Condition expression evaluated to false | Expected; handle in application logic |
ResourceNotFoundException |
400 | Table or index doesn't exist | Verify table name and region |
ValidationException |
400 | Invalid request (wrong attribute type, missing key) | Fix the request format |
ItemCollectionSizeLimitExceededException |
400 | LSI item collection exceeds 10 GB | Redesign schema; use GSI instead of LSI |
TransactionConflictException |
400 | Two transactions tried to modify the same item | Retry with backoff |
RequestLimitExceeded |
400 | API call rate limit (different from throughput) | Rate-limit your API calls |
Critical:
ProvisionedThroughputExceededExceptionmust be retried with exponential backoff. The AWS SDK retries automatically, but you should also monitorThrottledRequestsin CloudWatch and proactively increase capacity or redesign access patterns.
5.3 Query Optimization
Key Principles:
Use Query over Scan: Query reads only the partition you specify. Scan reads the entire table. For a 100 GB table, a Scan reads all 100 GB before filtering.
Use Projection Expressions: Only retrieve the attributes you need. Reduces data transfer and RCU usage.
# ❌ Returns the entire item (wastes RCU if you only need 2 fields)
response = table.get_item(Key={'userId': 'U-001'})
# ✅ Only fetch needed attributes
response = table.get_item(
Key={'userId': 'U-001'},
ProjectionExpression='firstName, #em',
ExpressionAttributeNames={'#em': 'email'} # 'email' is reserved word
)
Use FilterExpression only on Sort Key / Non-Key Attributes AFTER fetching: FilterExpression is applied server-side AFTER reading data. It reduces response size but NOT the RCUs consumed. The full partition's matching items are read first.
Parallel Scan for ETL: For full-table scans (migrations, exports), use parallel scan:
# Split the scan across N workers
import threading
def scan_segment(segment, total_segments):
response = table.scan(Segment=segment, TotalSegments=total_segments)
items = response['Items']
# handle LastEvaluatedKey pagination
threads = [threading.Thread(target=scan_segment, args=(i, 4)) for i in range(4)]
[t.start() for t in threads]
[t.join() for t in threads]
6. API Gateway — Optimization & Troubleshooting
6.1 API Gateway Error Codes
| HTTP Code | Error | Cause | Solution |
|---|---|---|---|
| 400 | Bad Request | Missing required parameter; malformed request | Fix request; enable request validation |
| 403 | Forbidden | IAM authorization failed; WAF block; invalid API key; resource policy deny | Check IAM policy; WAF rules; API key |
| 404 | Not Found | Stage or resource doesn't exist; wrong URL | Verify stage and resource path |
| 429 | Too Many Requests | Throttled (stage/method limit or account limit) | Implement backoff; increase throttle limits; add caching |
| 500 | Internal Server Error | Lambda threw an exception and returned error; integration misconfiguration | Check Lambda logs; verify integration config |
| 502 | Bad Gateway | Lambda returned malformed response; Lambda throttled; Lambda out of memory | Check Lambda response format for proxy integration; check Lambda concurrency |
| 503 | Service Unavailable | Backend unavailable | Check Lambda; circuit breaker pattern |
| 504 | Gateway Timeout | Integration timeout exceeded (default 29s for REST API) | Optimize backend; use async integration pattern |
Critical 502 Debug: For Lambda Proxy integration, the response MUST follow this exact format. Any deviation causes 502:
# ✅ Correct Lambda Proxy response format
return {
'statusCode': 200, # Required: integer
'headers': { # Optional: dict
'Content-Type': 'application/json',
'Access-Control-Allow-Origin': '*'
},
'body': json.dumps({'message': 'OK'}), # Required: string (not dict!)
'isBase64Encoded': False # Optional
}
# ❌ Wrong: body is a dict, not a string → 502
return {'statusCode': 200, 'body': {'message': 'OK'}}
API Gateway Logging:
| Log Type | Content | Enable In |
|---|---|---|
| Access Logs | Request details (IP, method, latency, status, requestId) | Stage settings |
| Execution Logs | Full request/response bodies, integration details | Stage settings |
Warning: Enable Execution Logs only in development. They log full request bodies and can expose sensitive data. Use Access Logs for production.
6.2 Caching & Throttling Strategies
API Gateway Cache Invalidation:
# Client requests fresh data bypassing cache
headers = {
'Authorization': 'Bearer token...',
'Cache-Control': 'max-age=0' # Invalidates cached response
}
# Note: Caller needs execute-api:InvalidateCache IAM permission
Throttling Layers (REST API):
Request comes in
│
▼
Account-Level Throttle (10,000 RPS) ──► 429 if exceeded at account level
│
▼
Stage-Level Throttle (default = account limit) ──► 429 if stage limit hit
│
▼
Method-Level Throttle (optional, overrides stage) ──► 429 if method limit hit
│
▼
Usage Plan + API Key (optional) ──► 429 if plan limit hit; 403 if key invalid
│
▼
Backend (Lambda, etc.)
7. SQS & SNS — Troubleshooting
7.1 Common SQS Issues
Messages Returning to Queue (Double Processing):
Symptom: Same message processed multiple times
Cause: Lambda takes longer than Visibility Timeout → message reappears
Fix 1: Increase Visibility Timeout ≥ 6 × Lambda timeout
Fix 2: Lambda calls ChangeMessageVisibility to extend timeout during processing
Fix 3: Use FIFO queue with message deduplication (exactly-once)
Messages Going to DLQ Unexpectedly:
Symptom: Messages appear in DLQ without apparent processing failure
Cause: maxReceiveCount exceeded (even if processing succeeded each time)
Investigation:
1. Check Lambda function logs for exceptions
2. Check if Lambda is deleting the message after processing
(For Event Source Mapping, Lambda auto-deletes on success)
3. Check Visibility Timeout vs Lambda timeout
4. Check for Lambda concurrency throttling (causes receive without processing)
Large Message Handling:
# SQS Extended Client Pattern (messages > 256 KB)
from sqs_extended_client import SQSExtendedClientSession
session = SQSExtendedClientSession()
sqs = session.client('sqs',
sqs_large_payload_support='my-bucket', # S3 bucket for large payloads
always_through_s3=False # Only use S3 when necessary
)
# Messages > 256 KB automatically stored in S3; pointer in SQS
sqs.send_message(QueueUrl=queue_url, MessageBody=large_payload)
FIFO Queue Throughput Issues:
Symptom: FIFO queue throughput capped at 300 TPS
Cause: All messages using same MessageGroupId → single consumer, no parallelism
Fix: Use multiple MessageGroupIds to parallelize:
- OrderID as MessageGroupId → each order processed independently
- 10 distinct MessageGroupIds → up to 3,000 TPS with batching
7.2 SNS Delivery Failures
Delivery Status Logging:
Enable SNS delivery status logging for HTTP/Lambda/SQS subscribers to diagnose failures.
Delivery Status Logs go to CloudWatch Logs and show:
- Delivery attempt timestamps
- HTTP status from subscriber
- Error messages for failures
Dead-Letter Queue for SNS:
Configure a DLQ (SQS) for an SNS subscription to catch messages that fail delivery after all retries.
| Subscriber | Retry Policy | DLQ Support |
|---|---|---|
| HTTP/HTTPS | Up to 23 times over 23 days | Yes |
| Lambda | Immediate retry, then DLQ | Yes |
| SQS | No retries (SQS ACKs delivery) | SQS handles its own DLQ |
| 3 attempts, 72 hours | No |
8. Caching Strategies
8.1 Caching Patterns
| Pattern | Description | Use Case | Risk |
|---|---|---|---|
| Cache-Aside (Lazy Loading) | App checks cache → cache miss → fetch from DB → store in cache | General purpose reads | Cache can have stale data |
| Write-Through | Write to cache AND DB simultaneously | Write + read heavy | Wasted cache space for infrequently read items |
| Write-Behind (Write-Back) | Write to cache → async write to DB later | Write-heavy; DB offload | Risk of data loss if cache fails before sync |
| Read-Through | Cache fetches from DB on miss (cache manages itself) | Transparent to app | Less control |
| Refresh-Ahead | Cache proactively refreshes before TTL expires | Predictable access patterns | Can refresh items never re-requested |
Cache-Aside Implementation (most common):
import json
def get_user(user_id, redis_client, dynamodb_table):
cache_key = f"user:{user_id}"
# 1. Check cache
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached) # Cache hit
# 2. Cache miss — fetch from DynamoDB
response = dynamodb_table.get_item(Key={'userId': user_id})
user = response.get('Item')
if user:
# 3. Store in cache with TTL
redis_client.setex(cache_key, 300, json.dumps(user)) # 5 min TTL
return user
def update_user(user_id, new_data, redis_client, dynamodb_table):
# Update DB
dynamodb_table.update_item(Key={'userId': user_id}, ...)
# Invalidate cache (NOT update — avoids race conditions)
redis_client.delete(f"user:{user_id}")
Cache Stampede / Thundering Herd Problem:
When a cached item expires, multiple concurrent requests miss the cache simultaneously and all hit the database at once.
Solutions:
- Mutex/Lock: Only one process fetches from DB; others wait.
- Jitter on TTL: Add random time to TTL so items expire at different times.
- Background refresh: Refresh before expiry using a background job.
8.2 Amazon ElastiCache — Redis vs Memcached
| Feature | Redis | Memcached |
|---|---|---|
| Data Structures | Strings, Lists, Sets, Sorted Sets, Hashes, Streams | Strings only |
| Persistence | RDB snapshots + AOF (append-only file) | None |
| Replication | Multi-AZ with automatic failover (Cluster mode disabled / enabled) | None |
| Cluster Mode | Yes (horizontal sharding) | Yes (multi-node) |
| Pub/Sub | Yes | No |
| Lua Scripting | Yes | No |
| Sorted Sets (Leaderboards) | Yes | No |
| Session Store | Yes | Yes |
| Horizontal Scaling | Redis Cluster (16,384 shards) | Simple multi-thread scale |
When to Use Each:
| Use Case | Choose |
|---|---|
| Session store (simple key-value) | Redis or Memcached |
| Leaderboards, ranked lists | Redis (Sorted Sets) |
| Pub/Sub messaging | Redis |
| Need persistence and backup | Redis |
| Need multi-AZ failover | Redis |
| Simple caching; multi-threaded scale | Memcached |
| Real-time analytics | Redis |
Exam Tip: For almost all use cases on DVA-C02, Redis is the right answer. Memcached is simpler but lacks persistence, replication, and advanced data structures.
ElastiCache Security:
- In-transit encryption: TLS between clients and cluster.
- At-rest encryption: For Redis. Not available for Memcached.
- Redis AUTH: Password authentication for Redis. Added via
AUTHcommand. - IAM-based authentication: Available for Redis using IAM roles (newer feature).
- Security Groups: Control network access to ElastiCache cluster.
9. Common Error Patterns & HTTP Status Codes
9.1 AWS API Error Categories
┌────────────────────────────────────────────────────────────────────────────┐
│ AWS Error Classification │
│ │
│ 4xx — Client Errors (your fault, fix the request) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 400 Bad Request: Invalid parameter, malformed request body │ │
│ │ 401 Unauthorized: No/expired credentials │ │
│ │ 403 Forbidden: Valid credentials but no permission │ │
│ │ 404 Not Found: Resource doesn't exist │ │
│ │ 409 Conflict: Resource already exists; state conflict │ │
│ │ 429 Too Many Requests: Throttled; retry with backoff │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ 5xx — Server Errors (AWS's fault, always retry with backoff) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 500 Internal Server Error: AWS service error │ │
│ │ 503 Service Unavailable: Service is down/overloaded │ │
│ │ 504 Gateway Timeout: Request took too long │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────────────┘
Retry Decision:
RETRYABLE_ERRORS = {
# 5xx — always retry
500: True,
502: True,
503: True,
504: True,
# 4xx throttling — retry with backoff
429: True,
# Client errors — don't retry
400: False,
401: False,
403: False,
404: False,
}
9.2 Service-Specific Error Reference
Lambda:
| Error | Code | Meaning |
|---|---|---|
TooManyRequestsException |
429 | Concurrency limit hit; throttled |
ResourceConflictException |
409 | Concurrent update conflict |
InvalidParameterValueException |
400 | Invalid function configuration |
DynamoDB:
| Error | Retryable? |
|---|---|
ProvisionedThroughputExceededException |
Yes |
RequestLimitExceeded |
Yes |
InternalServerError |
Yes |
ConditionalCheckFailedException |
No (expected condition failure) |
ValidationException |
No (fix the request) |
ResourceNotFoundException |
No (resource doesn't exist) |
S3:
| Error | HTTP | Meaning |
|---|---|---|
NoSuchBucket |
404 | Bucket doesn't exist |
NoSuchKey |
404 | Object key doesn't exist |
AccessDenied |
403 | Permissions issue |
SlowDown |
503 | S3 throttle; use backoff |
BucketAlreadyExists |
409 | Bucket name already taken globally |
10. Application Optimization Patterns
10.1 S3 Performance Optimization
Request Rate Performance:
- 3,500 PUT/COPY/POST/DELETE requests per second per prefix.
- 5,500 GET/HEAD requests per second per prefix.
- Distribute objects across multiple prefixes to achieve higher aggregate throughput.
❌ All objects in one prefix:
s3://bucket/uploads/file1.jpg ← all 3,500 PUT/s here
✅ Distributed across prefixes:
s3://bucket/2024/01/file1.jpg ← 3,500 PUT/s
s3://bucket/2024/02/file2.jpg ← another 3,500 PUT/s
s3://bucket/2024/03/file3.jpg ← another 3,500 PUT/s
Total: 10,500 PUT/s
Byte-Range Fetches:
Download specific byte ranges of a large object in parallel.
# Download a 1 GB file in 10 parallel 100 MB chunks
import boto3
import concurrent.futures
def download_chunk(bucket, key, start, end, part_num):
response = s3.get_object(Bucket=bucket, Key=key,
Range=f'bytes={start}-{end}')
return part_num, response['Body'].read()
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
futures = [executor.submit(download_chunk, bucket, key,
i * 100_000_000, (i+1) * 100_000_000 - 1, i)
for i in range(10)]
10.2 Kinesis Optimization
Shard Calculation:
Required Shards (write) = CEIL(Max Writes per second / 1,000)
OR CEIL(Max MB per second / 1)
— use whichever is larger
Required Shards (read — standard) = CEIL(Max Reads per second / 2 MB)
Example:
Write: 5,000 records/s at 500 bytes each = 2.5 MB/s
Required write shards: max(CEIL(5000/1000), CEIL(2.5/1)) = max(5, 3) = 5 shards
Resharding:
| Operation | Action | Time |
|---|---|---|
| Split Shard | One shard → two shards (increase capacity) | Minutes |
| Merge Shards | Two adjacent shards → one shard (decrease capacity) | Minutes |
| Scale Up/Down | Update shard count via console/CLI | Minutes |
Parent vs Child Shards: After a split/merge, parent shards are CLOSED but still readable until all existing records expire. Always read from parent shards first (ordered reads).
Kinesis Producer Library (KPL) vs Kinesis Client Library (KCL):
| KPL (Producer) | KCL (Consumer) | |
|---|---|---|
| Purpose | High-throughput producer | Robust consumer (checkpoint, lease, multi-worker) |
| Batching | Yes (aggregation + collection) | N/A |
| Retry | Yes (with backoff) | Yes (via checkpointing) |
| Language | Java (C++ core) | Java, Python, Ruby, .NET |
| Async | Yes | Yes |
10.3 ECS and Fargate Optimization
Task CPU and Memory:
- CPU values: 256 (.25 vCPU), 512 (.5 vCPU), 1024 (1 vCPU), 2048, 4096 (for Fargate).
- Memory must be within valid ranges for the chosen CPU.
- Under-provisioning CPU causes slow performance; over-provisioning wastes cost.
ECS Auto Scaling:
| Scale Target | Metric | Direction |
|---|---|---|
| CPU Utilization | ECSServiceAverageCPUUtilization |
Scale out if > 70% |
| Memory Utilization | ECSServiceAverageMemoryUtilization |
Scale out if > 80% |
| ALB Request Count | ALBRequestCountPerTarget |
Scale for traffic-based |
| SQS Queue Depth | ApproximateNumberOfMessages |
Worker tier scale |
Fargate Spot:
Use Fargate Spot for fault-tolerant, stateless workloads. Up to 70% cost savings vs on-demand Fargate. Tasks may be interrupted with 2-minute notice.
11. Exam Tips & Quick Reference
Scenario-to-Answer Mapping
| Scenario Keyword / Requirement | Correct Answer |
|---|---|
| "Find which Lambda invocations are slowest" | CloudWatch Log Insights query on REPORT logs |
| "Trace a request from API Gateway through Lambda to DynamoDB" | AWS X-Ray with Active Tracing on Lambda + API Gateway stage |
| "Who deleted the S3 bucket at 2AM?" | CloudTrail (lookup-events for DeleteBucket) |
| "Alert when Lambda error rate exceeds 5%" | CloudWatch Alarm on Errors metric + SNS notification |
| "Detect unusual spike in EC2 RunInstances API calls" | CloudTrail Insights |
| "Monitor and alert when DynamoDB has hot partitions" | CloudWatch Contributor Insights on DynamoDB |
| "Lambda execution time is slow; CPU is the bottleneck" | Increase Lambda memory (CPU scales with memory) |
| "Messages processed multiple times from SQS" | Increase SQS Visibility Timeout |
| "DynamoDB returns ProvisionedThroughputExceededException" | Exponential backoff; increase capacity; fix hot partition |
| "API Gateway returns 502 from Lambda" | Check Lambda response format (body must be a string) |
| "API Gateway returns 504 Gateway Timeout" | Lambda execution exceeds 29s API GW timeout; use async integration |
| "Cache session data with persistence and failover" | ElastiCache Redis (Multi-AZ) |
| "Cache simple key-value without persistence" | ElastiCache Memcached or Redis |
| "Kinesis consumers falling behind; IteratorAge increasing" | Increase shards or enable Enhanced Fan-out |
| "S3 performance slow for high-throughput writes" | Distribute across multiple key prefixes |
| "X-Ray missing traces from production API" | Check sampling rules; ensure X-Ray enabled on API Gateway stage AND Lambda |
| "Lambda cold starts causing API latency spikes" | Provisioned Concurrency |
| "Detect that CloudFormation template has drifted" | CloudFormation Drift Detection |
Common Traps
- CloudTrail is NOT real-time: Log files arrive within 15 minutes. For real-time alerts on API activity, stream CloudTrail to CloudWatch Logs and create metric filters.
- X-Ray daemon vs SDK: The SDK instruments your code and sends data. The daemon collects and forwards to AWS. Both are needed. On Lambda, the daemon is included — just enable Active Tracing.
- Annotations are indexed; Metadata is not: If you need to filter/search traces by a field, it must be an Annotation. Metadata is for debugging details only.
- CloudWatch Logs Insights vs Metric Filters: Metric Filters create new metrics from existing logs (useful for alarms). Log Insights queries historical logs for ad-hoc analysis. Use the right tool for the right job.
- ElastiCache and Lambda in VPC: Both Lambda and ElastiCache must be in the same VPC (and Lambda must be in the same or peered VPC as ElastiCache). Lambda outside a VPC cannot access ElastiCache.
- DAX vs ElastiCache for DynamoDB: DAX is DynamoDB-compatible (same API). ElastiCache is a general-purpose cache requiring manual invalidation logic. DAX handles cache invalidation automatically for DynamoDB.
- SQS Visibility Timeout is PER RECEIVE: Each time a message is received, the visibility timeout clock resets. If Lambda receives the message and then Lambda is throttled, a second Lambda may receive the same message after the timeout expires — even if the first Lambda hasn't finished.
- IteratorAge ≠ Lambda Errors: High IteratorAge means Lambda is processing slowly, not necessarily failing. It could be under-provisioned memory/CPU, not an error condition.
Key Terms — Domain 4
| Term | One-Line Definition |
|---|---|
| Trace | An X-Ray record of the complete end-to-end journey of a single request |
| Segment | An X-Ray record of work done by one service/application within a trace |
| Subsegment | A granular unit of work within a segment (e.g., a DB query or HTTP call) |
| Annotation | A key-value pair on an X-Ray segment that is INDEXED and filterable |
| Metadata | A key-value pair on an X-Ray segment that is NOT indexed; debug-only |
| Sampling Rate | The percentage of requests X-Ray records and traces |
| Management Events | CloudTrail records of control-plane API calls (create, delete, update) |
| Data Events | CloudTrail records of data-plane operations (S3 GetObject, Lambda Invoke) |
| CloudTrail Insights | Feature that detects anomalous API call volume patterns |
| Cache-Aside | Caching pattern where the app checks the cache before querying the database |
| Thundering Herd | Cache stampede where many concurrent requests miss and hit the DB simultaneously |
| IteratorAge | Age of the last record processed from a Kinesis/DynamoDB Stream — measures lag |
| Hot Partition | A DynamoDB partition receiving disproportionately high read/write traffic |
| Write Sharding | Appending a random suffix to a partition key to distribute writes across shards |
| Composite Alarm | A CloudWatch alarm combining multiple individual alarms with AND/OR logic |
| Metric Filter | A pattern that extracts values from CloudWatch Logs to create CloudWatch metrics |
End of Domain 4 — You have completed all four DVA-C02 domain notes.
DVA-C02 — Full Exam Summary
| Domain | Weight | Key Services |
|---|---|---|
| Domain 1: Development with AWS Services | 32% | Lambda, API Gateway, DynamoDB, S3, SQS, SNS, EventBridge, Kinesis, Step Functions, SAM |
| Domain 2: Security | 26% | IAM, STS, Cognito, KMS, Secrets Manager, SSM Parameter Store, WAF, ACM, SigV4 |
| Domain 3: Deployment | 24% | CodeCommit, CodeBuild, CodeDeploy, CodePipeline, Elastic Beanstalk, CloudFormation, SAM, ECS, ECR, Lambda Aliases |
| Domain 4: Troubleshooting & Optimization | 18% | CloudWatch, X-Ray, CloudTrail, ElastiCache, Performance Tuning |
Top 10 Things to Master Before the Exam:
- Lambda invocation models (sync vs async vs event source mapping) and concurrency math
- DynamoDB key design, RCU/WCU calculation, GSI vs LSI, and ProvisionedThroughputExceededException
- IAM policy evaluation order (Explicit Deny → SCP → Resource Policy → Boundary → Identity Policy)
- Cognito User Pool (authentication/JWT) vs Identity Pool (temporary AWS credentials)
- KMS envelope encryption and the GenerateDataKey API
- CodeDeploy deployment lifecycle hooks and traffic shifting strategies
- CloudFormation Change Sets, nested stacks, cross-stack references, and DeletionPolicy
- Elastic Beanstalk deployment policies and
.ebextensionswithleader_only - X-Ray trace structure, annotations vs metadata, and sampling rules
- SQS Visibility Timeout and DLQ configuration (maxReceiveCount, FIFO DLQ rule)
Ready to test yourself?
Practice questions for this topic