Courses/DVA-C02/Domain 4: Troubleshooting and Optimization
Practice questions →
AWSDVA-C02

Domain 4: Troubleshooting and Optimization

Topic 4 of 4 · Study notes

AWS Certified Developer – Associate (DVA-C02)

Domain 4: Troubleshooting and Optimization

Exam Code: DVA-C02  |  Level: Associate
Domain Weight: 18%  |  Total Domains: 4  |  Passing Score: 720/1000


Table of Contents

  1. Amazon CloudWatch
  2. AWS X-Ray — Distributed Tracing
  3. AWS CloudTrail
  4. Lambda — Optimization & Troubleshooting
  5. DynamoDB — Performance & Optimization
  6. API Gateway — Optimization & Troubleshooting
  7. SQS & SNS — Troubleshooting
  8. Caching Strategies
  9. Common Error Patterns & HTTP Status Codes
  10. Application Optimization Patterns
  11. Exam Tips & Quick Reference

1. Amazon CloudWatch

CloudWatch is the unified observability service for AWS. It collects metrics, logs, and events, and enables automated responses through alarms and actions.

1.1 Metrics — Built-in & Custom

Metric Fundamentals:

Concept Definition
Namespace Container for metrics (AWS/Lambda, AWS/EC2, MyApp/Checkout)
Metric A time-ordered set of data points (e.g., Errors, Duration, CPUUtilization)
Dimension A name-value pair that identifies a metric (FunctionName=MyFunc)
Resolution Standard (1-minute granularity) or High Resolution (1-second granularity)
Retention 3h for < 60s resolution; 15 days for 60s; 63 days for 5m; 15 months for 1h

Critical Lambda CloudWatch Metrics:

Metric What It Measures Troubleshooting Use
Invocations Number of function invocations Traffic volume
Errors Invocations that threw an error Function failure rate
Duration Execution time per invocation Performance; timeout risk
Throttles Invocations rejected due to concurrency limit Concurrency exhaustion
ConcurrentExecutions Running functions at a given time Concurrency usage
IteratorAge Age of last processed Kinesis/DDB record Stream processing lag
DeadLetterErrors Failed DLQ deliveries DLQ misconfiguration

Critical DynamoDB CloudWatch Metrics:

Metric What It Measures
ConsumedReadCapacityUnits RCUs used per minute
ConsumedWriteCapacityUnits WCUs used per minute
ProvisionedReadCapacityUnits Provisioned RCUs
SystemErrors DynamoDB server-side errors (5xx)
UserErrors Client-side errors (4xx)
SuccessfulRequestLatency Per-operation latency
ThrottledRequests Requests throttled due to exceeding capacity

Custom Metrics:

import boto3

cloudwatch = boto3.client('cloudwatch')

# Publish a custom metric
cloudwatch.put_metric_data(
    Namespace='MyApp/Checkout',
    MetricData=[
        {
            'MetricName': 'OrderProcessingTime',
            'Value': 245.5,
            'Unit': 'Milliseconds',
            'Dimensions': [
                {'Name': 'Environment', 'Value': 'production'},
                {'Name': 'Region', 'Value': 'us-east-1'}
            ],
            'StorageResolution': 1  # 1 = High Resolution (1s); 60 = Standard
        }
    ]
)

High Resolution Metrics: Standard metrics have 1-minute granularity. High Resolution metrics (StorageResolution=1) have 1-second granularity and support alarms at 10s or 30s periods. They are more expensive but essential for fast-responding systems.

Metric Math: Perform mathematical operations on metrics to derive new values. Example: calculate error rate = Errors / Invocations * 100 directly in CloudWatch.

1.2 CloudWatch Logs & Log Insights

Log Hierarchy:

Log Group (/aws/lambda/MyFunction)
    └── Log Stream (2024/01/15/[$LATEST]abc123)
            └── Log Events (individual log entries)
Component Description
Log Group Container for log streams from the same source
Log Stream Sequence of log events from the same source instance
Retention Policy Configure per log group: 1 day to 10 years (or never expire)
Metric Filter Extract numeric values from log events to create custom metrics
Subscription Filter Stream log events to Lambda, Kinesis, or Firehose in real time

Metric Filter Example (count errors in logs):

Filter Pattern: [timestamp, requestId, level="ERROR", ...]
# Matches log lines containing "ERROR" at the level field
# Creates a CloudWatch metric counting these events

CloudWatch Log Insights:
Interactive query language for analyzing log data. Supports filtering, aggregation, sorting, and visualization.

-- Find slowest Lambda invocations in the last hour
fields @timestamp, @duration, @requestId
| filter @type = "REPORT"
| sort @duration desc
| limit 20

-- Count errors by error type
fields @message
| filter @message like /ERROR/
| parse @message "* ERROR: *" as timestamp, errorMessage
| stats count(*) as errorCount by errorMessage
| sort errorCount desc

-- Find all throttle events in API Gateway
fields @timestamp, @message
| filter @message like /429/
| stats count(*) as throttleCount by bin(5m)

Exporting Logs:

Destination Latency Use Case
S3 (Export) Up to 12 hours Batch archiving, compliance
Kinesis Firehose Near-real-time S3/Redshift/OpenSearch delivery
Lambda (Subscription) Real-time Custom processing, alerting
Kinesis Data Streams Real-time Custom real-time analytics

1.3 CloudWatch Alarms

Alarm States:

State Meaning
OK Metric is within the defined threshold
ALARM Metric has breached the threshold
INSUFFICIENT_DATA Not enough data to evaluate (common on new alarms or during metric gaps)

Alarm Configuration:

Alarm: Lambda Error Rate > 5% for 2 consecutive 1-minute periods

Period: 60 seconds          ← Granularity of evaluation
Evaluation Periods: 2       ← Number of periods that must breach
Datapoints to Alarm: 2      ← Out of 2 periods, 2 must breach (can be different: M of N)
Threshold: 5                ← The breach value

Alarm Actions:

Action Service Use Case
SNS Notification SNS Topic Email, Slack, PagerDuty alerts
Auto Scaling EC2/ECS Auto Scaling Scale in/out based on metric
EC2 Action Stop, Terminate, Reboot Self-healing
Systems Manager OpsCenter OpsItem Automated runbook
CodeDeploy Rollback CodeDeploy Auto-rollback deployment

Composite Alarms:
Combine multiple alarms using AND/OR logic. Reduces alarm noise — only alert when multiple conditions are true simultaneously.

Composite Alarm: ALARM if (Lambda Errors ALARM) AND (DynamoDB Throttles ALARM)
→ Only alert when BOTH conditions are true; ignore individual spikes

Best Practice: Use composite alarms for high-severity pages. Use individual alarms for metrics dashboards. This reduces alert fatigue.

1.4 CloudWatch Dashboards & Contributor Insights

Dashboards are customizable monitoring views. They can include metrics from multiple regions (cross-region dashboards) and support math expressions.

CloudWatch Contributor Insights:
Analyzes structured log data to identify the top contributors (the "heavy hitters") causing a problem.

  • Find which IP addresses generate the most errors.
  • Find which DynamoDB partition keys are getting the most throttled.
  • Built-in rules for VPC Flow Logs, CloudTrail, Route 53 Resolver.
Rule: Find top 10 DynamoDB partition keys with ThrottledRequests
→ Instantly identifies hot partitions from DynamoDB logs
→ No manual log analysis required

1.5 CloudWatch Synthetics

Canary scripts (Node.js or Python) that continuously test your API endpoints, UI workflows, and URLs — even when no real users are active.

Canary Blueprint Tests
API Canary REST API endpoints; checks status codes and response time
GUI Workflow Simulates user login flow, form submission
Heartbeat Monitor Basic availability check for URLs
Broken Link Checker Scans for broken hyperlinks
Visual Monitoring Screenshots and compares to baseline

1.6 CloudWatch Events (EventBridge)

CloudWatch Events is now EventBridge. See Domain 1, Section 7.

Key integrations for troubleshooting:

  • EC2 state changes → trigger Lambda for auto-remediation.
  • CodeDeploy deployment failure → SNS notification.
  • Scheduled events (cron) → trigger health check Lambda.

2. AWS X-Ray — Distributed Tracing

X-Ray provides distributed tracing for serverless and microservices architectures. It visualizes the full path of a request as it travels through your application, identifying bottlenecks and errors at the service level.

2.1 Core Concepts — Traces, Segments & Subsegments

┌─────────────────────────────────────────────────────────────────────────────────┐
│                            X-Ray Trace Structure                                 │
│                                                                                   │
│  Trace (one end-to-end request):                                                 │
│  ┌─────────────────────────────────────────────────────────────────────────┐    │
│  │ Segment: API Gateway  [0ms ──────────────────────────────── 450ms]     │    │
│  │   └── Subsegment: Lambda Cold Start           [0ms ─── 150ms]          │    │
│  │   Segment: Lambda Function                    [150ms ──────── 400ms]   │    │
│  │      └── Subsegment: DynamoDB GetItem         [160ms ── 200ms]         │    │
│  │      └── Subsegment: External HTTP call       [210ms ─────── 380ms]   │    │
│  │      └── Subsegment: SQS SendMessage          [385ms ─ 400ms]          │    │
│  └─────────────────────────────────────────────────────────────────────────┘    │
│                                                                                   │
│  The timeline shows exactly where time is spent — immediately identifies          │
│  the "External HTTP call" as the bottleneck at 170ms.                            │
└─────────────────────────────────────────────────────────────────────────────────┘
Concept Definition
Trace The complete end-to-end path of a single request. Has a unique Trace ID.
Segment Work done by a single service/application within a trace
Subsegment Granular unit of work within a segment (DB query, HTTP call, function call)
Service Map Visual graph of all services and connections, with health and latency
Trace ID A unique identifier passed in the X-Amzn-Trace-Id HTTP header

X-Ray Service Map:

                    ┌──────────┐
Client ─────────────► API GW    ├────► Lambda ────► DynamoDB
         (200, 45ms)│ (200, 45ms)│  (200, 30ms)  (200, 15ms)
                    └──────────┘
                                    └────► SES      (error rate: 5%)

2.2 Annotations, Metadata & Groups

Annotations vs Metadata:

Annotations Metadata
Indexed Yes — can be searched and filtered No — only viewable in trace details
Type Key-value (string, number, boolean) Key-value (any JSON)
Use case Filter traces by business context Debug data; large payloads
Size limit 50 per segment No strict limit
# Python X-Ray SDK
from aws_xray_sdk.core import xray_recorder

# Add annotation — INDEXED, searchable
xray_recorder.put_annotation('userId', 'user-123')
xray_recorder.put_annotation('orderValue', 99.99)

# Add metadata — NOT indexed, debug only
xray_recorder.put_metadata('requestPayload', {'items': 3, 'currency': 'USD'})

# Create a custom subsegment
with xray_recorder.in_subsegment('processPayment') as subsegment:
    subsegment.put_annotation('paymentMethod', 'credit_card')
    result = process_payment(order)

X-Ray Groups:
Filter traces by annotation using filter expressions. Create separate dashboards/alarms per group.

# Filter expression: show only traces from the payment service with high latency
service("PaymentLambda") AND responsetime > 5 AND annotation.userId != null

2.3 Sampling Rules

X-Ray does NOT record every request by default — sampling reduces cost and noise.

Default Sampling Rule:

  • First request per second per host → always recorded (reservoir).
  • 5% of additional requests → sampled.

Custom Sampling Rules:
Define rules based on URL, host, method, and service name. Rules are evaluated in priority order (lower number = higher priority).

Priority Service Name URL Path Fixed Rate Reservoir
1 PaymentLambda /checkout 100% 10
2 * /health 0% 0 (ignore health checks)
3 * * 5% 1 (default)

Critical: Setting rate to 0% for /health endpoints prevents health check polling from polluting your trace data. Always exclude health checks from sampling.

2.4 X-Ray SDK & Daemon

X-Ray Architecture:

Application Code (with X-Ray SDK)
         │
         │ (sends trace data via UDP to localhost:2000)
         ▼
    X-Ray Daemon
    (runs alongside app)
         │
         │ (batches and sends to X-Ray API over HTTPS)
         ▼
    AWS X-Ray Service

X-Ray SDK Key Capabilities:

from aws_xray_sdk.core import xray_recorder, patch_all

# Patch all supported libraries (boto3, requests, urllib, etc.)
patch_all()

# Lambda handler — SDK auto-creates a segment for each invocation
def lambda_handler(event, context):
    # All boto3 calls are automatically traced as subsegments
    response = dynamodb.get_item(...)
    return response

# Manual subsegment for custom code paths
with xray_recorder.in_subsegment('validate-business-rule') as subseg:
    subseg.put_annotation('ruleId', 'RULE-001')
    result = validate(data)

X-Ray Daemon Deployment:

Environment Daemon Location
EC2 / On-Premises Install and run as a service
Lambda Auto-included (just enable Active Tracing)
ECS (Fargate) Add X-Ray daemon as a sidecar container
ECS (EC2) Run daemon on EC2 host or as a sidecar
Elastic Beanstalk Enable via EB console or .ebextensions

ECS X-Ray Sidecar Configuration:

// task definition — add X-Ray daemon as a sidecar container
{
  "name": "xray-daemon",
  "image": "amazon/aws-xray-daemon",
  "cpu": 32,
  "memoryReservation": 256,
  "portMappings": [{"containerPort": 2000, "protocol": "udp"}]
}
// application container — link to X-Ray sidecar
{
  "name": "myapp",
  "image": "myapp:latest",
  "environment": [
    {"name": "AWS_XRAY_DAEMON_ADDRESS", "value": "xray-daemon:2000"}
  ],
  "links": ["xray-daemon"]
}

2.5 X-Ray with Lambda, API Gateway & ECS

Lambda X-Ray Enablement:

# SAM template — enable X-Ray Active Tracing
Globals:
  Function:
    Tracing: Active

# Or per function
MyFunction:
  Type: AWS::Serverless::Function
  Properties:
    Tracing: Active
Tracing Mode Behavior
Active Lambda samples and sends trace data to X-Ray. Follows sampling rules.
PassThrough Lambda passes trace header through but only records if called by another traced service.

API Gateway X-Ray:
Enable "X-Ray Tracing" on a REST API stage. API Gateway creates a segment for each request, which is connected to downstream Lambda segments.

Key Concept: For end-to-end tracing (API Gateway → Lambda → DynamoDB), you must enable X-Ray on ALL services in the chain. Enabling it on Lambda alone shows Lambda internals but misses API Gateway and downstream service connections.


3. AWS CloudTrail

CloudTrail records API calls made to AWS services. Every action taken via console, CLI, SDK, or another AWS service generates a CloudTrail event. Think of it as the "audit log" for your AWS account.

3.1 Event Types & Trail Configuration

Event Type Records Default Enabled
Management Events Control plane operations (CreateBucket, RunInstances, UpdateFunction, etc.) Yes
Data Events Data plane operations on specific resources (S3:GetObject, Lambda:Invoke, DynamoDB:GetItem) No (extra cost)
Insights Events Unusual API activity patterns (spike in EC2 RunInstances, IAM policy changes) No (extra cost)

Trail Configuration:

Setting Recommendation
Multi-Region Trail Enable to capture events from all regions
S3 Storage Logs delivered to S3 within 15 minutes
Log File Validation Enable SHA-256 digest files to detect tampering
CloudWatch Logs Integration Send events to CloudWatch Logs for real-time alerting
KMS Encryption Encrypt logs with a CMK
Organization Trail One trail covering all accounts in an AWS Organization

Critical Timing: CloudTrail delivers log files within 15 minutes of an API call. For real-time monitoring, stream events to CloudWatch Logs via a Trail, then create metric filters and alarms. CloudTrail alone is NOT real-time.

Event Lookup:

# Find all DeleteObject calls in the last 24 hours
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=DeleteObject \
  --start-time 2024-01-14T00:00:00Z

# Find all actions by a specific user
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=Username,[email protected]

Useful CloudTrail Log Fields:

{
  "eventTime": "2024-01-15T10:23:45Z",
  "eventName": "DeleteObject",
  "userIdentity": {
    "type": "IAMUser",
    "userName": "alice",
    "arn": "arn:aws:iam::123456789:user/alice"
  },
  "sourceIPAddress": "192.168.1.1",
  "requestParameters": {
    "bucketName": "my-bucket",
    "key": "sensitive-file.txt"
  },
  "responseElements": null,
  "errorCode": null
}

3.2 CloudTrail Insights

CloudTrail Insights detects unusual write API activity by learning the normal baseline of management events. When activity deviates significantly, an Insights event is generated.

Baseline: 5 RunInstances calls/hour
Spike: 200 RunInstances calls in 10 minutes → Insights Event generated
→ Possible causes: cryptomining, unauthorized access, application bug

CloudTrail vs CloudWatch vs X-Ray:

CloudTrail CloudWatch X-Ray
Records Who called what AWS API Metrics and logs from AWS services Distributed request traces
Focus Audit and compliance Monitoring and alerting Performance and debugging
Granularity Per API call Per metric/log entry Per request end-to-end
Retention 90 days (free), longer in S3 Configurable 30 days

4. Lambda — Optimization & Troubleshooting

4.1 Lambda Performance Tuning

Memory and CPU Relationship:

Lambda CPU is proportional to memory:
  128 MB  → 1/8 vCPU   (very slow CPU operations)
  1,769 MB → 1 full vCPU (linear increase)
  3,538 MB → 2 vCPUs
  10,240 MB → ~6 vCPUs

Key Insight: Increasing memory also increases CPU. A function that uses
barely any memory but is CPU-intensive should still get more memory
to get more CPU allocation.

Lambda Power Tuning (AWS Lambda Power Tuning tool):
AWS open-source tool that tests your function at multiple memory configurations and finds the optimal setting for cost vs. performance.

Test 128MB → 512MB → 1024MB → 2048MB → 3008MB
Measure: duration × price per GB-second
Find: sweet spot where cost is lowest or performance is highest

Initialization Best Practices:

import boto3

# ✅ CORRECT: Initialize outside handler (warm reuse)
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('MyTable')
# Loaded once per execution environment; reused across warm invocations

# ✅ CORRECT: Load config once
import os
TABLE_NAME = os.environ['TABLE_NAME']  # Cached in memory

def lambda_handler(event, context):
    # ✅ Reuse the initialized clients
    response = table.get_item(Key={'id': event['id']})
    
    # ❌ WRONG: Creating new clients inside the handler (slow)
    # db = boto3.resource('dynamodb')  # Don't do this
    
    return response['Item']

Reducing Package Size:

  • Use Lambda Layers for shared libraries.
  • Use Docker container images for dependencies over 50 MB.
  • Enable tree-shaking in JavaScript/TypeScript builds.
  • Use aws-sdk v3 modular imports (import only what you need).
// ❌ WRONG: Import entire SDK
const AWS = require('aws-sdk');
const s3 = new AWS.S3();

// ✅ CORRECT: Import only S3 client (v3, smaller bundle)
import { S3Client, GetObjectCommand } from "@aws-sdk/client-s3";
const s3 = new S3Client({region: 'us-east-1'});

SnapStart (Java only):
Lambda SnapStart creates a snapshot of a fully initialized execution environment, dramatically reducing cold start time for Java functions. Enable on function configuration → takes a snapshot when you publish a new version.

4.2 Lambda Error Patterns

Error Cause Solution
Task timed out after X seconds Function exceeds configured timeout Increase timeout; optimize slow operations; add connection timeout to SDK clients
Runtime exited with error: signal: killed Out of memory Increase memory allocation
Process exited before completing request Unhandled exception; process crash Add global error handling; check CloudWatch Logs for traceback
Error: EACCES permission denied File permissions in container Check file permissions; use /tmp for writes
Unable to import module Missing dependency in deployment package Add dependency to package; check for layer compatibility
Calling the invoke API failed IAM permissions missing Add lambda:InvokeFunction permission to caller's policy
EndpointResolutionError Wrong region or invalid endpoint in SDK client Verify region configuration

Structured Error Handling Pattern:

import json
import logging
import traceback

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    try:
        result = process(event)
        logger.info(json.dumps({
            'level': 'INFO',
            'operation': 'process',
            'requestId': context.aws_request_id,
            'status': 'success'
        }))
        return {'statusCode': 200, 'body': json.dumps(result)}
    
    except ValueError as e:
        logger.error(json.dumps({
            'level': 'ERROR',
            'operation': 'process',
            'requestId': context.aws_request_id,
            'error': str(e),
            'errorType': 'ValidationError'
        }))
        return {'statusCode': 400, 'body': json.dumps({'error': str(e)})}
    
    except Exception as e:
        logger.error(json.dumps({
            'level': 'ERROR',
            'requestId': context.aws_request_id,
            'error': str(e),
            'traceback': traceback.format_exc()
        }))
        raise  # Re-raise for Lambda retry/DLQ

4.3 Lambda Concurrency Troubleshooting

Diagnosing Throttling:

Symptom: API Gateway returning 502 or 429 errors
         CloudWatch Throttles metric > 0
         
Root Cause Analysis:
1. Check ConcurrentExecutions vs account/function Reserved Concurrency limit
2. Check if Reserved Concurrency is set too low
3. Check if another function is consuming all unreserved concurrency
4. Check Burst Limit (3000 for us-east-1)

Solutions:
  → Request account concurrency limit increase
  → Add Reserved Concurrency to protect critical functions
  → Implement exponential backoff in the calling service
  → Use SQS to buffer requests (decouple spike from Lambda)

IteratorAge for Stream Processing (Kinesis/DynamoDB Streams):

IteratorAge = Current time - timestamp of last processed record

High IteratorAge means Lambda is falling behind the stream:
Causes: Lambda timeout, insufficient concurrency, slow processing
Solutions:
  → Increase Lambda memory/CPU for faster processing
  → Increase Kinesis shard count (more parallelism)
  → Reduce batch size to process faster per invocation
  → Enable Parallelization Factor (1-10 concurrent batches per shard)

5. DynamoDB — Performance & Optimization

5.1 Hot Partitions & Write Sharding

Hot Partition Problem:
DynamoDB distributes data across partitions using the partition key hash. If most writes go to the same partition key, you get a "hot partition" that exceeds the per-partition throughput limit (3,000 RCU + 1,000 WCU per partition).

❌ Hot Partition Example:
Table: stock-trades
Partition Key: stock_symbol = "AAPL"  ← 90% of writes go here → HOT

✅ Write Sharding Fix:
Append random suffix: "AAPL#1", "AAPL#2", ..., "AAPL#10"
Writes distributed across 10 partitions
Read: query all 10 shards in parallel and aggregate

Strategies:

Strategy How Use Case
Write Sharding Append random suffix (1-N) to PK High-write items (stock prices, IoT sensors)
Time-based sharding Include time bucket in PK (date, hour) Time-series data
Composite PK Use natural high-cardinality key User data, order data
Caching (DAX) Cache hot reads in DAX Read-heavy items

5.2 DynamoDB Error Patterns

Error HTTP Code Cause Solution
ProvisionedThroughputExceededException 400 Read/write capacity exceeded Exponential backoff; increase capacity; use On-Demand mode; fix hot partitions
ConditionalCheckFailedException 400 Condition expression evaluated to false Expected; handle in application logic
ResourceNotFoundException 400 Table or index doesn't exist Verify table name and region
ValidationException 400 Invalid request (wrong attribute type, missing key) Fix the request format
ItemCollectionSizeLimitExceededException 400 LSI item collection exceeds 10 GB Redesign schema; use GSI instead of LSI
TransactionConflictException 400 Two transactions tried to modify the same item Retry with backoff
RequestLimitExceeded 400 API call rate limit (different from throughput) Rate-limit your API calls

Critical: ProvisionedThroughputExceededException must be retried with exponential backoff. The AWS SDK retries automatically, but you should also monitor ThrottledRequests in CloudWatch and proactively increase capacity or redesign access patterns.

5.3 Query Optimization

Key Principles:

  1. Use Query over Scan: Query reads only the partition you specify. Scan reads the entire table. For a 100 GB table, a Scan reads all 100 GB before filtering.

  2. Use Projection Expressions: Only retrieve the attributes you need. Reduces data transfer and RCU usage.

# ❌ Returns the entire item (wastes RCU if you only need 2 fields)
response = table.get_item(Key={'userId': 'U-001'})

# ✅ Only fetch needed attributes
response = table.get_item(
    Key={'userId': 'U-001'},
    ProjectionExpression='firstName, #em',
    ExpressionAttributeNames={'#em': 'email'}  # 'email' is reserved word
)
  1. Use FilterExpression only on Sort Key / Non-Key Attributes AFTER fetching: FilterExpression is applied server-side AFTER reading data. It reduces response size but NOT the RCUs consumed. The full partition's matching items are read first.

  2. Parallel Scan for ETL: For full-table scans (migrations, exports), use parallel scan:

# Split the scan across N workers
import threading

def scan_segment(segment, total_segments):
    response = table.scan(Segment=segment, TotalSegments=total_segments)
    items = response['Items']
    # handle LastEvaluatedKey pagination

threads = [threading.Thread(target=scan_segment, args=(i, 4)) for i in range(4)]
[t.start() for t in threads]
[t.join() for t in threads]

6. API Gateway — Optimization & Troubleshooting

6.1 API Gateway Error Codes

HTTP Code Error Cause Solution
400 Bad Request Missing required parameter; malformed request Fix request; enable request validation
403 Forbidden IAM authorization failed; WAF block; invalid API key; resource policy deny Check IAM policy; WAF rules; API key
404 Not Found Stage or resource doesn't exist; wrong URL Verify stage and resource path
429 Too Many Requests Throttled (stage/method limit or account limit) Implement backoff; increase throttle limits; add caching
500 Internal Server Error Lambda threw an exception and returned error; integration misconfiguration Check Lambda logs; verify integration config
502 Bad Gateway Lambda returned malformed response; Lambda throttled; Lambda out of memory Check Lambda response format for proxy integration; check Lambda concurrency
503 Service Unavailable Backend unavailable Check Lambda; circuit breaker pattern
504 Gateway Timeout Integration timeout exceeded (default 29s for REST API) Optimize backend; use async integration pattern

Critical 502 Debug: For Lambda Proxy integration, the response MUST follow this exact format. Any deviation causes 502:

# ✅ Correct Lambda Proxy response format
return {
    'statusCode': 200,                    # Required: integer
    'headers': {                          # Optional: dict
        'Content-Type': 'application/json',
        'Access-Control-Allow-Origin': '*'
    },
    'body': json.dumps({'message': 'OK'}), # Required: string (not dict!)
    'isBase64Encoded': False              # Optional
}

# ❌ Wrong: body is a dict, not a string → 502
return {'statusCode': 200, 'body': {'message': 'OK'}}

API Gateway Logging:

Log Type Content Enable In
Access Logs Request details (IP, method, latency, status, requestId) Stage settings
Execution Logs Full request/response bodies, integration details Stage settings

Warning: Enable Execution Logs only in development. They log full request bodies and can expose sensitive data. Use Access Logs for production.

6.2 Caching & Throttling Strategies

API Gateway Cache Invalidation:

# Client requests fresh data bypassing cache
headers = {
    'Authorization': 'Bearer token...',
    'Cache-Control': 'max-age=0'  # Invalidates cached response
}
# Note: Caller needs execute-api:InvalidateCache IAM permission

Throttling Layers (REST API):

Request comes in
    │
    ▼
Account-Level Throttle (10,000 RPS) ──► 429 if exceeded at account level
    │
    ▼
Stage-Level Throttle (default = account limit) ──► 429 if stage limit hit
    │
    ▼
Method-Level Throttle (optional, overrides stage) ──► 429 if method limit hit
    │
    ▼
Usage Plan + API Key (optional) ──► 429 if plan limit hit; 403 if key invalid
    │
    ▼
Backend (Lambda, etc.)

7. SQS & SNS — Troubleshooting

7.1 Common SQS Issues

Messages Returning to Queue (Double Processing):

Symptom: Same message processed multiple times
Cause: Lambda takes longer than Visibility Timeout → message reappears

Fix 1: Increase Visibility Timeout ≥ 6 × Lambda timeout
Fix 2: Lambda calls ChangeMessageVisibility to extend timeout during processing
Fix 3: Use FIFO queue with message deduplication (exactly-once)

Messages Going to DLQ Unexpectedly:

Symptom: Messages appear in DLQ without apparent processing failure
Cause: maxReceiveCount exceeded (even if processing succeeded each time)

Investigation:
1. Check Lambda function logs for exceptions
2. Check if Lambda is deleting the message after processing
   (For Event Source Mapping, Lambda auto-deletes on success)
3. Check Visibility Timeout vs Lambda timeout
4. Check for Lambda concurrency throttling (causes receive without processing)

Large Message Handling:

# SQS Extended Client Pattern (messages > 256 KB)
from sqs_extended_client import SQSExtendedClientSession

session = SQSExtendedClientSession()
sqs = session.client('sqs',
    sqs_large_payload_support='my-bucket',  # S3 bucket for large payloads
    always_through_s3=False  # Only use S3 when necessary
)
# Messages > 256 KB automatically stored in S3; pointer in SQS
sqs.send_message(QueueUrl=queue_url, MessageBody=large_payload)

FIFO Queue Throughput Issues:

Symptom: FIFO queue throughput capped at 300 TPS
Cause: All messages using same MessageGroupId → single consumer, no parallelism

Fix: Use multiple MessageGroupIds to parallelize:
  - OrderID as MessageGroupId → each order processed independently
  - 10 distinct MessageGroupIds → up to 3,000 TPS with batching

7.2 SNS Delivery Failures

Delivery Status Logging:
Enable SNS delivery status logging for HTTP/Lambda/SQS subscribers to diagnose failures.

Delivery Status Logs go to CloudWatch Logs and show:
- Delivery attempt timestamps
- HTTP status from subscriber
- Error messages for failures

Dead-Letter Queue for SNS:
Configure a DLQ (SQS) for an SNS subscription to catch messages that fail delivery after all retries.

Subscriber Retry Policy DLQ Support
HTTP/HTTPS Up to 23 times over 23 days Yes
Lambda Immediate retry, then DLQ Yes
SQS No retries (SQS ACKs delivery) SQS handles its own DLQ
Email 3 attempts, 72 hours No

8. Caching Strategies

8.1 Caching Patterns

Pattern Description Use Case Risk
Cache-Aside (Lazy Loading) App checks cache → cache miss → fetch from DB → store in cache General purpose reads Cache can have stale data
Write-Through Write to cache AND DB simultaneously Write + read heavy Wasted cache space for infrequently read items
Write-Behind (Write-Back) Write to cache → async write to DB later Write-heavy; DB offload Risk of data loss if cache fails before sync
Read-Through Cache fetches from DB on miss (cache manages itself) Transparent to app Less control
Refresh-Ahead Cache proactively refreshes before TTL expires Predictable access patterns Can refresh items never re-requested

Cache-Aside Implementation (most common):

import json

def get_user(user_id, redis_client, dynamodb_table):
    cache_key = f"user:{user_id}"
    
    # 1. Check cache
    cached = redis_client.get(cache_key)
    if cached:
        return json.loads(cached)  # Cache hit
    
    # 2. Cache miss — fetch from DynamoDB
    response = dynamodb_table.get_item(Key={'userId': user_id})
    user = response.get('Item')
    
    if user:
        # 3. Store in cache with TTL
        redis_client.setex(cache_key, 300, json.dumps(user))  # 5 min TTL
    
    return user

def update_user(user_id, new_data, redis_client, dynamodb_table):
    # Update DB
    dynamodb_table.update_item(Key={'userId': user_id}, ...)
    
    # Invalidate cache (NOT update — avoids race conditions)
    redis_client.delete(f"user:{user_id}")

Cache Stampede / Thundering Herd Problem:
When a cached item expires, multiple concurrent requests miss the cache simultaneously and all hit the database at once.

Solutions:

  • Mutex/Lock: Only one process fetches from DB; others wait.
  • Jitter on TTL: Add random time to TTL so items expire at different times.
  • Background refresh: Refresh before expiry using a background job.

8.2 Amazon ElastiCache — Redis vs Memcached

Feature Redis Memcached
Data Structures Strings, Lists, Sets, Sorted Sets, Hashes, Streams Strings only
Persistence RDB snapshots + AOF (append-only file) None
Replication Multi-AZ with automatic failover (Cluster mode disabled / enabled) None
Cluster Mode Yes (horizontal sharding) Yes (multi-node)
Pub/Sub Yes No
Lua Scripting Yes No
Sorted Sets (Leaderboards) Yes No
Session Store Yes Yes
Horizontal Scaling Redis Cluster (16,384 shards) Simple multi-thread scale

When to Use Each:

Use Case Choose
Session store (simple key-value) Redis or Memcached
Leaderboards, ranked lists Redis (Sorted Sets)
Pub/Sub messaging Redis
Need persistence and backup Redis
Need multi-AZ failover Redis
Simple caching; multi-threaded scale Memcached
Real-time analytics Redis

Exam Tip: For almost all use cases on DVA-C02, Redis is the right answer. Memcached is simpler but lacks persistence, replication, and advanced data structures.

ElastiCache Security:

  • In-transit encryption: TLS between clients and cluster.
  • At-rest encryption: For Redis. Not available for Memcached.
  • Redis AUTH: Password authentication for Redis. Added via AUTH command.
  • IAM-based authentication: Available for Redis using IAM roles (newer feature).
  • Security Groups: Control network access to ElastiCache cluster.

9. Common Error Patterns & HTTP Status Codes

9.1 AWS API Error Categories

┌────────────────────────────────────────────────────────────────────────────┐
│                       AWS Error Classification                               │
│                                                                              │
│  4xx — Client Errors (your fault, fix the request)                          │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ 400 Bad Request: Invalid parameter, malformed request body          │   │
│  │ 401 Unauthorized: No/expired credentials                             │   │
│  │ 403 Forbidden: Valid credentials but no permission                   │   │
│  │ 404 Not Found: Resource doesn't exist                               │   │
│  │ 409 Conflict: Resource already exists; state conflict               │   │
│  │ 429 Too Many Requests: Throttled; retry with backoff                │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  5xx — Server Errors (AWS's fault, always retry with backoff)               │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ 500 Internal Server Error: AWS service error                        │   │
│  │ 503 Service Unavailable: Service is down/overloaded                 │   │
│  │ 504 Gateway Timeout: Request took too long                          │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
└────────────────────────────────────────────────────────────────────────────┘

Retry Decision:

RETRYABLE_ERRORS = {
    # 5xx — always retry
    500: True,
    502: True,
    503: True,
    504: True,
    # 4xx throttling — retry with backoff
    429: True,
    # Client errors — don't retry
    400: False,
    401: False,
    403: False,
    404: False,
}

9.2 Service-Specific Error Reference

Lambda:

Error Code Meaning
TooManyRequestsException 429 Concurrency limit hit; throttled
ResourceConflictException 409 Concurrent update conflict
InvalidParameterValueException 400 Invalid function configuration

DynamoDB:

Error Retryable?
ProvisionedThroughputExceededException Yes
RequestLimitExceeded Yes
InternalServerError Yes
ConditionalCheckFailedException No (expected condition failure)
ValidationException No (fix the request)
ResourceNotFoundException No (resource doesn't exist)

S3:

Error HTTP Meaning
NoSuchBucket 404 Bucket doesn't exist
NoSuchKey 404 Object key doesn't exist
AccessDenied 403 Permissions issue
SlowDown 503 S3 throttle; use backoff
BucketAlreadyExists 409 Bucket name already taken globally

10. Application Optimization Patterns

10.1 S3 Performance Optimization

Request Rate Performance:

  • 3,500 PUT/COPY/POST/DELETE requests per second per prefix.
  • 5,500 GET/HEAD requests per second per prefix.
  • Distribute objects across multiple prefixes to achieve higher aggregate throughput.
❌ All objects in one prefix:
s3://bucket/uploads/file1.jpg  ← all 3,500 PUT/s here

✅ Distributed across prefixes:
s3://bucket/2024/01/file1.jpg  ← 3,500 PUT/s
s3://bucket/2024/02/file2.jpg  ← another 3,500 PUT/s
s3://bucket/2024/03/file3.jpg  ← another 3,500 PUT/s
Total: 10,500 PUT/s

Byte-Range Fetches:
Download specific byte ranges of a large object in parallel.

# Download a 1 GB file in 10 parallel 100 MB chunks
import boto3
import concurrent.futures

def download_chunk(bucket, key, start, end, part_num):
    response = s3.get_object(Bucket=bucket, Key=key,
                              Range=f'bytes={start}-{end}')
    return part_num, response['Body'].read()

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    futures = [executor.submit(download_chunk, bucket, key,
               i * 100_000_000, (i+1) * 100_000_000 - 1, i)
               for i in range(10)]

10.2 Kinesis Optimization

Shard Calculation:

Required Shards (write) = CEIL(Max Writes per second / 1,000) 
                        OR CEIL(Max MB per second / 1)
                        — use whichever is larger

Required Shards (read — standard) = CEIL(Max Reads per second / 2 MB)

Example:
  Write: 5,000 records/s at 500 bytes each = 2.5 MB/s
  Required write shards: max(CEIL(5000/1000), CEIL(2.5/1)) = max(5, 3) = 5 shards

Resharding:

Operation Action Time
Split Shard One shard → two shards (increase capacity) Minutes
Merge Shards Two adjacent shards → one shard (decrease capacity) Minutes
Scale Up/Down Update shard count via console/CLI Minutes

Parent vs Child Shards: After a split/merge, parent shards are CLOSED but still readable until all existing records expire. Always read from parent shards first (ordered reads).

Kinesis Producer Library (KPL) vs Kinesis Client Library (KCL):

KPL (Producer) KCL (Consumer)
Purpose High-throughput producer Robust consumer (checkpoint, lease, multi-worker)
Batching Yes (aggregation + collection) N/A
Retry Yes (with backoff) Yes (via checkpointing)
Language Java (C++ core) Java, Python, Ruby, .NET
Async Yes Yes

10.3 ECS and Fargate Optimization

Task CPU and Memory:

  • CPU values: 256 (.25 vCPU), 512 (.5 vCPU), 1024 (1 vCPU), 2048, 4096 (for Fargate).
  • Memory must be within valid ranges for the chosen CPU.
  • Under-provisioning CPU causes slow performance; over-provisioning wastes cost.

ECS Auto Scaling:

Scale Target Metric Direction
CPU Utilization ECSServiceAverageCPUUtilization Scale out if > 70%
Memory Utilization ECSServiceAverageMemoryUtilization Scale out if > 80%
ALB Request Count ALBRequestCountPerTarget Scale for traffic-based
SQS Queue Depth ApproximateNumberOfMessages Worker tier scale

Fargate Spot:
Use Fargate Spot for fault-tolerant, stateless workloads. Up to 70% cost savings vs on-demand Fargate. Tasks may be interrupted with 2-minute notice.


11. Exam Tips & Quick Reference

Scenario-to-Answer Mapping

Scenario Keyword / Requirement Correct Answer
"Find which Lambda invocations are slowest" CloudWatch Log Insights query on REPORT logs
"Trace a request from API Gateway through Lambda to DynamoDB" AWS X-Ray with Active Tracing on Lambda + API Gateway stage
"Who deleted the S3 bucket at 2AM?" CloudTrail (lookup-events for DeleteBucket)
"Alert when Lambda error rate exceeds 5%" CloudWatch Alarm on Errors metric + SNS notification
"Detect unusual spike in EC2 RunInstances API calls" CloudTrail Insights
"Monitor and alert when DynamoDB has hot partitions" CloudWatch Contributor Insights on DynamoDB
"Lambda execution time is slow; CPU is the bottleneck" Increase Lambda memory (CPU scales with memory)
"Messages processed multiple times from SQS" Increase SQS Visibility Timeout
"DynamoDB returns ProvisionedThroughputExceededException" Exponential backoff; increase capacity; fix hot partition
"API Gateway returns 502 from Lambda" Check Lambda response format (body must be a string)
"API Gateway returns 504 Gateway Timeout" Lambda execution exceeds 29s API GW timeout; use async integration
"Cache session data with persistence and failover" ElastiCache Redis (Multi-AZ)
"Cache simple key-value without persistence" ElastiCache Memcached or Redis
"Kinesis consumers falling behind; IteratorAge increasing" Increase shards or enable Enhanced Fan-out
"S3 performance slow for high-throughput writes" Distribute across multiple key prefixes
"X-Ray missing traces from production API" Check sampling rules; ensure X-Ray enabled on API Gateway stage AND Lambda
"Lambda cold starts causing API latency spikes" Provisioned Concurrency
"Detect that CloudFormation template has drifted" CloudFormation Drift Detection

Common Traps

  • CloudTrail is NOT real-time: Log files arrive within 15 minutes. For real-time alerts on API activity, stream CloudTrail to CloudWatch Logs and create metric filters.
  • X-Ray daemon vs SDK: The SDK instruments your code and sends data. The daemon collects and forwards to AWS. Both are needed. On Lambda, the daemon is included — just enable Active Tracing.
  • Annotations are indexed; Metadata is not: If you need to filter/search traces by a field, it must be an Annotation. Metadata is for debugging details only.
  • CloudWatch Logs Insights vs Metric Filters: Metric Filters create new metrics from existing logs (useful for alarms). Log Insights queries historical logs for ad-hoc analysis. Use the right tool for the right job.
  • ElastiCache and Lambda in VPC: Both Lambda and ElastiCache must be in the same VPC (and Lambda must be in the same or peered VPC as ElastiCache). Lambda outside a VPC cannot access ElastiCache.
  • DAX vs ElastiCache for DynamoDB: DAX is DynamoDB-compatible (same API). ElastiCache is a general-purpose cache requiring manual invalidation logic. DAX handles cache invalidation automatically for DynamoDB.
  • SQS Visibility Timeout is PER RECEIVE: Each time a message is received, the visibility timeout clock resets. If Lambda receives the message and then Lambda is throttled, a second Lambda may receive the same message after the timeout expires — even if the first Lambda hasn't finished.
  • IteratorAge ≠ Lambda Errors: High IteratorAge means Lambda is processing slowly, not necessarily failing. It could be under-provisioned memory/CPU, not an error condition.

Key Terms — Domain 4

Term One-Line Definition
Trace An X-Ray record of the complete end-to-end journey of a single request
Segment An X-Ray record of work done by one service/application within a trace
Subsegment A granular unit of work within a segment (e.g., a DB query or HTTP call)
Annotation A key-value pair on an X-Ray segment that is INDEXED and filterable
Metadata A key-value pair on an X-Ray segment that is NOT indexed; debug-only
Sampling Rate The percentage of requests X-Ray records and traces
Management Events CloudTrail records of control-plane API calls (create, delete, update)
Data Events CloudTrail records of data-plane operations (S3 GetObject, Lambda Invoke)
CloudTrail Insights Feature that detects anomalous API call volume patterns
Cache-Aside Caching pattern where the app checks the cache before querying the database
Thundering Herd Cache stampede where many concurrent requests miss and hit the DB simultaneously
IteratorAge Age of the last record processed from a Kinesis/DynamoDB Stream — measures lag
Hot Partition A DynamoDB partition receiving disproportionately high read/write traffic
Write Sharding Appending a random suffix to a partition key to distribute writes across shards
Composite Alarm A CloudWatch alarm combining multiple individual alarms with AND/OR logic
Metric Filter A pattern that extracts values from CloudWatch Logs to create CloudWatch metrics

End of Domain 4 — You have completed all four DVA-C02 domain notes.


DVA-C02 — Full Exam Summary

Domain Weight Key Services
Domain 1: Development with AWS Services 32% Lambda, API Gateway, DynamoDB, S3, SQS, SNS, EventBridge, Kinesis, Step Functions, SAM
Domain 2: Security 26% IAM, STS, Cognito, KMS, Secrets Manager, SSM Parameter Store, WAF, ACM, SigV4
Domain 3: Deployment 24% CodeCommit, CodeBuild, CodeDeploy, CodePipeline, Elastic Beanstalk, CloudFormation, SAM, ECS, ECR, Lambda Aliases
Domain 4: Troubleshooting & Optimization 18% CloudWatch, X-Ray, CloudTrail, ElastiCache, Performance Tuning

Top 10 Things to Master Before the Exam:

  1. Lambda invocation models (sync vs async vs event source mapping) and concurrency math
  2. DynamoDB key design, RCU/WCU calculation, GSI vs LSI, and ProvisionedThroughputExceededException
  3. IAM policy evaluation order (Explicit Deny → SCP → Resource Policy → Boundary → Identity Policy)
  4. Cognito User Pool (authentication/JWT) vs Identity Pool (temporary AWS credentials)
  5. KMS envelope encryption and the GenerateDataKey API
  6. CodeDeploy deployment lifecycle hooks and traffic shifting strategies
  7. CloudFormation Change Sets, nested stacks, cross-stack references, and DeletionPolicy
  8. Elastic Beanstalk deployment policies and .ebextensions with leader_only
  9. X-Ray trace structure, annotations vs metadata, and sampling rules
  10. SQS Visibility Timeout and DLQ configuration (maxReceiveCount, FIFO DLQ rule)

Ready to test yourself?

Practice questions for this topic

Start Practicing →

DVA-C02 Topics

Topic 4 of 4