AWSDVA-C02

Domain 4: Troubleshooting and Optimization

Topic 4 of 4 · Study notes

AWS Certified Developer – Associate (DVA-C02)

Domain 4: Troubleshooting and Optimization

Exam Code: DVA-C02 | Level: Associate
Domain Weight: 18% | Total Domains: 4 | Passing Score: 720/1000

Amazon CloudWatch
AWS CloudTrail
- 2.1 Event Types & Insights
Amazon EventBridge
- 3.1 Rules, Targets & Scheduling
AWS X-Ray
- 4.1 Core Concepts & Sampling
- 4.2 X-Ray on EC2, Lambda, ECS & Elastic Beanstalk
AWS CLI & SDK Troubleshooting
- 5.1 Credential Resolution & Common Errors
- 5.2 Exponential Backoff & API Rate Limits
Lambda Troubleshooting & Optimization
DynamoDB Troubleshooting & Optimization
SQS Troubleshooting
Kinesis Troubleshooting
API Gateway Troubleshooting
S3 Troubleshooting & Performance
EC2 & Network Troubleshooting
RDS & ElastiCache Optimization
Exam Tips & Quick Reference

1. Amazon CloudWatch

CloudWatch is the primary observability service for AWS. It collects metrics, logs, and events from AWS services and your applications, and lets you take automated actions based on thresholds.

1.1 Metrics — Default vs Custom vs Agent

EC2 metrics available by default (no agent needed):

Metric	Available Without Agent
`CPUUtilization`	Yes
`NetworkIn` / `NetworkOut`	Yes
`DiskReadOps` / `DiskWriteOps` (instance store only)	Yes
`StatusCheckFailed`	Yes
RAM / Memory utilization	No — requires CloudWatch Agent
Disk space used/free	No — requires CloudWatch Agent
Swap space	No — requires CloudWatch Agent
Process count	No — requires CloudWatch Agent

Critical Exam Rule: Anything that lives inside the EC2 operating system (RAM, disk space, swap, processes) requires the CloudWatch Agent. AWS can only see network-level and hypervisor-level metrics from the outside.

Monitoring frequency:

Mode	Granularity	Cost
Basic Monitoring	Every 5 minutes	Free
Detailed Monitoring	Every 1 minute	Additional charge

Exam Tip: Enable Detailed Monitoring when faster Auto Scaling reactions are needed. ASG scaling policies respond to CloudWatch data — 1-minute granularity allows the ASG to react up to 5× faster than with Basic Monitoring.

Custom Metrics:

# Publish a custom metric via CLI
aws cloudwatch put-metric-data \
  --namespace "MyApplication" \
  --metric-name "ActiveUsers" \
  --value 247 \
  --unit Count \
  --dimensions Environment=Production,Region=us-east-1

Setting	Value
Standard resolution	1-minute granularity
High resolution	1, 5, 10, or 30 second granularity
Timestamp — past limit	Up to 2 weeks in the past
Timestamp — future limit	Up to 2 hours in the future

Exam Trap: put-metric-alarm creates an alarm on an existing metric — it does NOT create the metric itself. If a metric does not exist in CloudWatch (because no data has been published via put-metric-data), the alarm will not appear and cannot be triggered.

CloudWatch Agents:

Agent	Sends Logs	Sends System Metrics	Recommended
CloudWatch Logs Agent (legacy)	Yes	No	No — use Unified Agent
CloudWatch Unified Agent	Yes	Yes (RAM, disk, swap, processes, netstat)	Yes

Unified Agent stores its configuration in SSM Parameter Store for centralized management.
On on-premises servers: install the Unified Agent and configure it with IAM user credentials (no Instance Profile available outside EC2).

1.2 CloudWatch Logs — Features & Subscriptions

Structure:

Log Group (/aws/lambda/my-function)
  └── Log Stream (2024/01/15/[$LATEST]abc123)
        └── Log Events (each individual line with timestamp)

Sources that send logs automatically (no agent required):
Lambda, ECS, Elastic Beanstalk, API Gateway (when logging enabled at stage level), VPC Flow Logs, CloudTrail, Route 53.

Sources that require the CloudWatch Unified Agent:
EC2 instances, on-premises servers.

CloudWatch Logs Insights — query language:

# Find the 10 most expensive Lambda invocations by duration
fields @timestamp, @duration, @requestId
| filter @message like /REPORT/
| sort @duration desc
| limit 10

Queries stored data — not real-time.
Can query multiple Log Groups simultaneously.
Queries can be saved and added to CloudWatch Dashboards.

Metric Filters:
Create a CloudWatch metric by counting pattern matches in log events.

Log Group → Metric Filter (count lines matching "ERROR") → CloudWatch Metric
                                                            → CloudWatch Alarm
                                                               → SNS Notification

Critical: Metric Filters are not retroactive. They only process new log events received after the filter is created. Historical log data is not counted.

Real-time log delivery (Subscription Filters):

Destination	Use Case
Lambda	Real-time processing, custom filtering, enrichment
Kinesis Data Streams	Fan-out to multiple consumers
Kinesis Firehose	Near-real-time delivery to S3, OpenSearch, Splunk

S3 export via CreateExportTask:

Batch export of log data to S3. Takes up to 12 hours to complete.
NOT suitable for real-time use cases. For near-real-time delivery to S3, use a Subscription Filter → Kinesis Firehose → S3.

Cross-account log aggregation:
Account A creates a Subscription Filter that sends logs to a Kinesis Data Stream in Account B. Account B's Kinesis stream aggregates logs from multiple accounts for centralized analysis.

1.3 Alarms, Composite Alarms & Synthetics

Alarm states:

OK — metric is within the defined threshold.
ALARM — metric has breached the threshold.
INSUFFICIENT_DATA — not enough data points to evaluate (new alarm, or metric stopped publishing).

Alarm actions:

Action Type	Options
EC2 Actions	Stop, Terminate, Reboot, or Recover the instance
Auto Scaling	Scale out or scale in the ASG
SNS notification	Trigger an SNS topic → email, Lambda, SQS, etc.

EC2 Instance Recovery: Triggered by a StatusCheckFailed_System alarm. CloudFormation moves the instance to new hardware while preserving the private IP, public IP, Elastic IP, metadata, and placement group.

Composite Alarms:
Combine multiple alarms using AND/OR Boolean logic. Fire only when a combination of conditions are met simultaneously. Reduces alert noise significantly.

Example: Only alert on-call if BOTH of these are true:
  - CPUUtilization > 90%   AND
  - MemoryUtilization > 80%

Using separate alarms would generate noise when only one threshold is breached.
A Composite Alarm fires only when both underlying alarms are in ALARM state.

Testing alarms without generating load:

# Force an alarm into ALARM state for testing purposes
aws cloudwatch set-alarm-state \
  --alarm-name "HighCPUAlarm" \
  --state-value ALARM \
  --state-reason "Testing alarm action"

CloudWatch Synthetics Canary:

Scripts that proactively monitor APIs, URLs, and user flows before real users are affected.
Written in Node.js or Python. Use headless Chrome for browser-based tests.
Run on a schedule or on-demand.
Integrate with CloudWatch Alarms to trigger automated remediation.

2. AWS CloudTrail

CloudTrail records all API calls made to your AWS account. It answers the question: who did what, when, from where, and to which resource?

2.1 Event Types & Insights

Event Type	Enabled by Default	Description
Management Events	Yes	Control plane: CreateBucket, PutRolePolicy, RunInstances, DeleteTable
Data Events	No (additional cost)	Data plane: S3 GetObject/PutObject, Lambda Invoke, DynamoDB GetItem
Insights Events	No (additional cost)	Detects unusual patterns in write management API call rates

Retention:

CloudTrail retains events for 90 days in the console.
For longer retention: create a Trail to continuously deliver events to an S3 bucket.
Analyze historical CloudTrail data in S3 using Amazon Athena (serverless SQL queries).

CloudTrail + EventBridge — automated security response:

User calls DeleteBucket on a production S3 bucket
  ──► CloudTrail records the API call
       ──► EventBridge matches event (source: aws.s3, eventName: DeleteBucket)
            ──► SNS alert to security team
            ──► Lambda function attempts to restore bucket from backup

3. Amazon EventBridge

EventBridge is a serverless event router that connects AWS services, SaaS partners, and your own applications using events. It supersedes and extends CloudWatch Events.

3.1 Rules, Targets & Scheduling

Three event bus types:

Type	Receives Events From
Default Event Bus	All AWS services automatically
Partner Event Bus	SaaS providers (Datadog, Zendesk, Shopify, PagerDuty)
Custom Event Bus	Your applications via `PutEvents` API

Rule types:

// Pattern-based rule — react to specific events
{
  "source": ["aws.ec2"],
  "detail-type": ["EC2 Instance State-change Notification"],
  "detail": { "state": ["stopped", "terminated"] }
}

// Schedule-based rule — trigger on a cron or rate expression
// rate(5 minutes)        — every 5 minutes
// cron(0 9 * * ? *)      — every day at 9 AM UTC

Targets (up to 5 per rule): Lambda, SQS, SNS, Kinesis Data Streams, Kinesis Firehose, Step Functions, CodePipeline, EC2 Run Command, API Gateway, and more.

Key EventBridge features:

Archive and Replay: Archive all or filtered events. Replay archived events to reprocess historical data.
Schema Registry: Auto-discovers and stores event schemas. Generates typed code bindings for Python, Java, TypeScript.
EventBridge Pipes: Point-to-point integration with optional enrichment (Lambda or Step Functions) between source and target.

Critical: EventBridge is the evolution of CloudWatch Events. They share the same underlying API — the default EventBridge event bus IS the CloudWatch Events bus. For all new development, use EventBridge.

4. AWS X-Ray

X-Ray provides distributed tracing across microservices. It shows you the complete request path — which services were called, how long each took, and where errors occurred.

4.1 Core Concepts & Sampling

Key concepts:

Concept	Description
Segment	Data about a single service's work on a request (start time, end time, errors)
Subsegment	More granular detail within a segment (a DynamoDB call, an HTTP call to an external API)
Trace	A collection of segments from all services that handled one end-to-end request
Service Map	Visual graph showing all services and their connection latencies and error rates
Sampling	Controls what percentage of requests generate trace data (to reduce cost)
Annotations	Key-value pairs that are indexed — use for filtering and searching traces
Metadata	Key-value pairs that are NOT indexed — for additional context, not searchable

Critical: Annotations are indexed and filterable in the X-Ray console. Metadata is not. If you need to search traces by a custom attribute, it must be stored as an Annotation.

Default sampling rule:

First request per second per host (reservoir) is always sampled.
5% of additional requests are sampled.
Sampling rules can be customized without redeploying code.

Two IAM policies required for X-Ray:

AWSXRayDaemonWriteAccess — required on the service/role that sends traces (daemon, Lambda execution role).
AWSXRayReadOnlyAccess — required on the user/role that reads and analyzes traces in the console.

4.2 X-Ray on EC2, Lambda, ECS & Elastic Beanstalk

On EC2:

Your Application (instrumented with X-Ray SDK)
    │
    │ UDP port 2000 (localhost)
    ▼
X-Ray Daemon (separate process on the instance)
    │
    │ HTTPS → X-Ray service endpoint
    ▼
AWS X-Ray

Requirements:

Install and run the X-Ray daemon on the instance.
Instrument application code with the X-Ray SDK.
EC2 instance role must have AWSXRayDaemonWriteAccess.

On Lambda:

Lambda (Active Tracing enabled)
    │
    │ Lambda provides a managed X-Ray daemon automatically
    ▼
AWS X-Ray

Requirements:

Enable Active Tracing in Lambda function configuration.
Lambda execution role must have AWSXRayDaemonWriteAccess.
Optionally instrument code with X-Ray SDK to create custom subsegments.

Critical: Lambda manages the X-Ray daemon automatically — you do NOT install or configure a daemon on Lambda. Only Active Tracing and the IAM permission are required.

On ECS:

Launch Type	X-Ray Daemon Pattern
EC2	Run X-Ray daemon as a sidecar container in each task definition, OR as a daemon service (one per EC2 host)
Fargate	Sidecar container only — there is no underlying host to install a daemon on

ECS task role must include AWSXRayDaemonWriteAccess.

On Elastic Beanstalk:

The X-Ray daemon is pre-installed on most EB platform versions.
Enable via the console (Configuration → Software → X-Ray daemon) or via .ebextensions.
Instance profile must include AWSXRayDaemonWriteAccess.

5. AWS CLI & SDK Troubleshooting

5.1 Credential Resolution & Common Errors

AWS CLI and SDK credential resolution order (first match wins):

1. Command-line flags       (--profile, explicit parameters)
2. Environment variables    (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN)
3. ~/.aws/credentials file  (named profiles)
4. ~/.aws/config file       (named profiles with role assumption)
5. ECS container credentials
6. EC2 Instance Profile (IMDS) ← lowest priority

Critical: Environment variables take priority over the Instance Profile. If credentials are accidentally set as environment variables on an EC2 instance, the application will use those credentials instead of the Instance Profile — even if the Instance Profile has the correct permissions. Always remove hardcoded credentials from EC2 instances.

The --dry-run flag:
Tests whether the calling identity has the required permissions without actually executing the action.

# Test EC2 launch permissions without actually launching an instance
aws ec2 run-instances --dry-run --image-id ami-xxx --instance-type t3.micro

# Success response (has permission):   DryRunOperation
# Failure response (no permission):    UnauthorizedOperation

Common CLI and SDK errors:

Error	Cause	Fix
`aws: command not found`	CLI executable not in PATH	Add CLI installation directory to PATH
`UnauthorizedOperation`	IAM permission denied	Add the required IAM action to the policy
`InvalidClientTokenId`	Incorrect or expired access key ID	Verify credentials with `aws sts get-caller-identity`
`ExpiredTokenException`	STS temporary token has expired	Re-run `sts get-session-token` or refresh the role credentials
`NoCredentialProviders`	No credentials found in any location in the chain	Configure credentials or attach an Instance Profile
`ThrottlingException`	API call rate limit exceeded	Implement exponential backoff; request quota increase if sustained
`InsufficientCapabilitiesException`	CloudFormation needs IAM capability	Add `--capabilities CAPABILITY_NAMED_IAM`

Critical debugging commands:

# Identify which identity the CLI is currently using
aws sts get-caller-identity

# Decode the encoded error message returned in Access Denied responses
aws sts decode-authorization-message --encoded-message <value>

# Get temporary credentials using MFA
aws sts get-session-token \
  --serial-number arn:aws:iam::123456789012:mfa/alice \
  --token-code 123456

5.2 Exponential Backoff & API Rate Limits

Which errors to retry:

Category	Retry?	Examples
5xx errors	Yes	HTTP 500, HTTP 503, `ServiceUnavailableException`
Throttling errors	Yes	HTTP 429, `ThrottlingException`, `ProvisionedThroughputExceededException`
4xx client errors	No	HTTP 400, `ValidationException`, `AccessDeniedException`, `ResourceNotFoundException`

Critical: Never retry 4xx client errors. They indicate a problem with the request itself — retrying the same bad request will always produce the same failure. Only 5xx and throttling errors are transient and worth retrying.

Exponential backoff with jitter:

import time
import random

def call_with_backoff(api_call, max_retries=5):
    for attempt in range(max_retries):
        try:
            return api_call()
        except (ThrottlingException, ServiceUnavailableException) as e:
            if attempt == max_retries - 1:
                raise
            # Full jitter: randomize to prevent thundering herd
            wait = random.uniform(0, min(30, (2 ** attempt)))
            time.sleep(wait)

Exam Tip: The AWS SDK implements exponential backoff automatically for most retryable errors. You only need to implement it manually when using raw HTTP calls or when using an SDK version that does not handle a specific error code.

Common API rate limits:

Service	API	Limit
EC2	`DescribeInstances`	100 calls/second
S3	`GET` per prefix	5,500 requests/second
S3	`PUT/POST/DELETE` per prefix	3,500 requests/second
KMS	Symmetric CMK (us-east-1)	30,000 requests/second
API Gateway	Account-wide	10,000 requests/second
Lambda	Default concurrency	1,000 concurrent executions

6. Lambda Troubleshooting & Optimization

No logs appearing in CloudWatch:

The Lambda execution role is missing logs:CreateLogGroup, logs:CreateLogStream, and logs:PutLogEvents permissions.
The AWSLambdaBasicExecutionRole managed policy covers these.
Lambda writes logs to stdout and stderr — log statements must use print() / console.log(), NOT write to a file.

Lambda timeout issues:

Lambda max timeout: 15 minutes.
API Gateway hard timeout: 29 seconds — API Gateway will return 504 even if Lambda is still running.
For tasks that may exceed 29 seconds: return 202 Accepted immediately, process asynchronously, and provide a polling endpoint.

Duplicate processing (same event processed twice):

Asynchronous Lambda is invoked at-least-once. Duplicate invocations can occur.
Duplicate log entries with the same requestId indicate Lambda retried a failed invocation.
Fix: design Lambda functions to be idempotent — processing the same event twice produces the same result.

Lambda can't connect to RDS:

Lambda must be configured to run inside the same VPC as RDS.
Requires: subnets, security groups, and AWSLambdaVPCAccessExecutionRole IAM permission.
Use RDS Proxy to avoid exhausting RDS connection limits when many Lambda instances run concurrently.

Lambda in VPC — no internet access:

Lambda placed in a public VPC subnet does NOT automatically receive a public IP.
For internet access from a VPC: place Lambda in a private subnet with a NAT Gateway in a public subnet.
For S3 and DynamoDB: use VPC Gateway Endpoints — no NAT Gateway required and no data transfer charges.

Performance optimization summary:

Problem	Solution
Slow cold starts	Provisioned Concurrency; reduce package size; avoid Spring Framework in Java
CPU-intensive task slow	Increase memory (CPU scales linearly with memory — it is the only lever)
DB connection overhead on every invoke	Initialize DB connection outside the handler function
Package > 250 MB	Use Lambda Layers for shared libraries; deploy as container image (up to 10 GB)
Too many RDS connections	Use RDS Proxy
SQS message reprocessed	Increase visibility timeout to at least 6× Lambda timeout

7. DynamoDB Troubleshooting & Optimization

ProvisionedThroughputExceededException:

Cause 1: Hot partition key
  → Low-cardinality partition key concentrates traffic on one partition
  → Fix: use higher-cardinality keys; add a random suffix for write sharding

Cause 2: Total table capacity insufficient
  → Fix: increase provisioned WCU/RCU; switch to On-Demand mode

Cause 3: GSI write throughput insufficient
  → Fix: increase GSI WCU specifically — GSI throttling causes main table throttling

GSI throttling affects the main table:
If a write to the main table requires updating a GSI and the GSI does not have sufficient WCU, the main table write is throttled — even if the main table has ample capacity. Always monitor and provision GSI capacity independently.

DynamoDB Streams not triggering Lambda:
Enabling DynamoDB Streams alone is insufficient. You must also create an Event Source Mapping in Lambda that connects the stream to the function. Both configurations are required.

FilterExpression does not save RCU:
A FilterExpression on a Scan or Query is applied after DynamoDB has already read the data. You are charged for all data read, not just the items that pass the filter. To reduce RCU consumption, design your primary key and indexes so the key conditions themselves narrow the result set.

DynamoDB Scan performance:

Scan is the most expensive operation — it reads every item in the table.
For faster full-table scans: use Parallel Scan with multiple workers, each scanning a distinct segment of the table.
Minimize impact: set a Limit on each page request and add pauses between pages to avoid consuming all provisioned capacity.

8. SQS Troubleshooting

Messages being processed multiple times (duplicates):

Root cause: visibility timeout is too short relative to message processing time.
The message becomes visible again before the consumer finishes and calls DeleteMessage.
Fix: increase the visibility timeout to at least 6× the average processing time.
Intermediate fix: call ChangeMessageVisibility to extend the timeout mid-processing.

Messages stuck in the queue:

Consumer crashed or timed out without deleting the message.
Visibility timeout is very long — message stays invisible until the timeout expires.
Fix: reduce the visibility timeout so the message returns to the queue faster after a consumer crash.

High SQS cost despite low message volume:

Root cause: short polling — the application makes frequent ReceiveMessage calls that return empty responses.
Each empty response is billed.
Fix: enable long polling by setting ReceiveMessageWaitTimeSeconds to up to 20 seconds.

DLQ not receiving messages:

Verify maxReceiveCount is set on the source queue's redrive policy.
For Lambda + SQS: the DLQ must be on the SQS queue, not on the Lambda function configuration.
FIFO queue DLQ must itself be a FIFO queue.

9. Kinesis Troubleshooting

ProvisionedThroughputExceededException on writes:

Cause: total write traffic exceeds total shard capacity (1 MB/s or 1,000 records/s per shard), or a single partition key is routing too much traffic to one shard (hot shard).
Fix: increase the shard count via UpdateShardCount (Kinesis does not auto-scale).
Fix: use higher-cardinality partition keys to spread writes evenly across shards.

Consumer falling behind — high IteratorAgeMilliseconds:

GetRecords.IteratorAgeMilliseconds measures how far behind the consumer is from the tip of the stream.
Fix: add more shards (adds read capacity) or add more consumer instances (up to 1 KCL worker per shard maximum).
Consider Enhanced Fan-out for consumers that need dedicated 2 MB/s read throughput per shard.

Lambda shard blocked by a bad record:

One bad record causes Lambda to repeatedly fail and retry, halting all processing on that shard.
Fix: set BisectBatchOnFunctionError: true — Lambda splits the batch in half on failure to isolate the bad record.
Fix: set MaximumRetryAttempts to limit the number of retries before discarding.
Fix: set DestinationConfig with an on-failure destination (SQS or SNS) to capture failed records.

KCL checkpointing failures:

KCL stores checkpoints in a DynamoDB table. If DynamoDB is throttled, checkpoint writes fail.
Symptom: the same records are reprocessed after a consumer restart.
Fix: increase DynamoDB provisioned throughput for the checkpoint table, or switch to on-demand mode.

10. API Gateway Troubleshooting

Error code reference:

Code	Root Cause	Fix
400 Bad Request	Malformed request syntax or failed request validation	Fix the request or update the request model/validator
403 Forbidden	IAM authorization denied, resource policy blocked, or WAF rule matched	Check IAM policy, resource policy, or WAF rules
429 Too Many Requests	Throttling — account, stage, or method rate limit exceeded	Implement exponential backoff; increase limits; add usage plan
502 Bad Gateway	Lambda returned a malformed response (missing `statusCode`, wrong body format)	Fix Lambda response to include `statusCode`, `headers`, `body`
504 Gateway Timeout	Backend integration took longer than the 29-second hard timeout	Optimize backend; use async pattern (202 + polling)

Critical: The 29-second API Gateway integration timeout is a hard limit. Even though Lambda supports 15 minutes, API Gateway will return 504 at 29 seconds. Design long-running tasks to return quickly and process asynchronously.

API changes not taking effect:
API Gateway changes (new routes, updated authorizers, modified integrations) are NOT live until a Deployment is created to a Stage. Always deploy after making changes.

New API key returning 403:
The key was created but not associated with a Usage Plan. Call CreateUsagePlanKey to link the key to the correct plan. Without this association, all requests with that key return 403.

CORS error in browser console:

For Lambda Proxy integration: Lambda must return the Access-Control-Allow-Origin header in its response. Enabling CORS in API Gateway console adds the OPTIONS method but does not add CORS headers to Lambda responses.
For non-proxy integrations: enabling CORS in the console is sufficient.

Diagnosing latency problems:

Metric	What It Measures	Interpretation
`IntegrationLatency`	Time from API GW to backend and back	High value = backend is slow
`Latency`	Total end-to-end time (client → API GW → backend → API GW → client)	Latency - IntegrationLatency = API GW overhead

11. S3 Troubleshooting & Performance

Infinite log growth:
Configuring S3 access logging to write to the same bucket being monitored creates an infinite feedback loop. Access logs generate new S3 API calls, which generate more access logs. The bucket grows exponentially. Always configure a separate destination bucket for access logs.

Large file upload error: "Your proposed upload exceeds the maximum allowed object size":
The single PUT limit is 5 GB. Files larger than 5 GB must use the Multipart Upload API. Files larger than 100 MB should use multipart upload for performance.

Cross-origin resource blocked (CORS 403):
Configure CORS on the target bucket (the bucket serving the assets), not the bucket hosting the website. The CORS configuration must include the requesting origin in AllowedOrigins.

CloudFront serving stale S3 content:
Content is cached at edge locations after the original fetch. When the S3 object is updated, CloudFront continues to serve the cached version until the TTL expires or an invalidation is created.

# Invalidate specific paths in CloudFront after updating S3 objects
aws cloudfront create-invalidation \
  --distribution-id ABCDEF123456 \
  --paths "/images/*" "/documents/report.pdf"

S3 performance at high request rates:

S3 provides 3,500 PUT/s and 5,500 GET/s per prefix (prefix = the string before the last / in the key name).
To exceed these limits, distribute objects across multiple prefixes.
Using date-based prefixes (2024/01/15/file.jpg) concentrates traffic on today's prefix — use random or hash-based prefixes for write-heavy workloads.

12. EC2 & Network Troubleshooting

Cannot connect to EC2:

Connection Timeout (the connection attempt hangs, then times out):
  → Security Group is blocking the inbound traffic
  → Check the SG inbound rules for the correct port and source

Connection Refused (immediately rejected):
  → Traffic reached EC2 but the application is not running
  → Application is running on a different port than expected
  → Check application status and port configuration

EC2 in public subnet, no internet access:

Instance was launched without a public IP and without an Elastic IP.
Fix: associate an Elastic IP with the instance.
Verify: route table has 0.0.0.0/0 → Internet Gateway route.

NAT Instance not routing private subnet traffic:

The EC2 instance acting as NAT must have Source/Destination Check disabled.
By default, EC2 drops packets where the destination IP does not match its own IP.
Disabling this check allows the NAT instance to forward traffic on behalf of other instances.

EC2 instance public IP — how to find it from code:

# From within the instance — use IMDS (link-local, no credentials needed)
curl http://169.254.169.254/latest/meta-data/public-ipv4

# Note: ifconfig / ip addr only shows the PRIVATE IP
# The public IP is NATted by AWS at the network layer — not visible inside the OS

VPC Flow Logs — what they show:

Source IP, destination IP, source port, destination port, protocol, packet count, byte count, action (ACCEPT/REJECT).
Does NOT show application-layer content. Does NOT show HTTP headers, request bodies, or application logs.
Useful for: diagnosing blocked traffic, security analysis, identifying unexpected traffic patterns.

13. RDS & ElastiCache Optimization

RDS read performance:

All read traffic hitting the primary instance → add a Read Replica and update the application's read connection string to use the replica endpoint.
Note: Multi-AZ standby does NOT serve read traffic — it exists only for failover.

RDS + Lambda connection exhaustion:
Lambda functions scale rapidly and create many short-lived database connections. RDS has connection limits. Use RDS Proxy to pool and reuse connections, preventing the database from being overwhelmed.

ElastiCache stale data:
Application reads old data from cache after the source database was updated.

Root cause: cache was not invalidated after the write.
Fix (Lazy Loading): after a DB write, explicitly delete the cache key (DEL key) so the next read fetches fresh data.
Fix (Write-Through): update both the DB and cache on every write — cache is always current.

ElastiCache node failure:

Memcached: complete cache loss when a node fails. All data must be re-fetched from the database. Data is NOT replicated.
Redis (Cluster Mode disabled): one primary + replicas. On primary failure, automatic failover to a replica. Data survives.
Redis (Cluster Mode enabled): data is sharded across multiple node groups. Each shard has replicas. Highest availability.

14. Exam Tips & Quick Reference

Scenario-to-Answer Mapping

Scenario Keyword / Requirement	Correct Answer
"EC2 RAM not appearing in CloudWatch"	Install CloudWatch Unified Agent — RAM is not a default EC2 metric
"CloudWatch alarm not visible after creation"	No metric data published yet — send data with `put-metric-data` first
"Metric filter not counting historical log data"	Metric filters are not retroactive — only counts events after filter creation
"Need near-real-time logs in S3"	Subscription Filter → Kinesis Firehose → S3 (not S3 export, which takes 12 hours)
"CloudWatch alarm fire only when two conditions are both true"	Composite Alarm with AND logic
"Identify bottleneck in microservice call chain"	AWS X-Ray — service map and trace timelines
"X-Ray annotations vs metadata"	Annotations = indexed/searchable. Metadata = not searchable. Use Annotations for filtering.
"X-Ray on Lambda"	Enable Active Tracing + add `AWSXRayDaemonWriteAccess` to execution role
"X-Ray on Fargate"	Sidecar container — no daemon on the host
"Who deleted the resource?"	CloudTrail Management Events
"S3 object access audit"	CloudTrail Data Events (not enabled by default)
"React to any AWS API call automatically"	CloudTrail + EventBridge rule pattern match
"Lambda logs not in CloudWatch"	Execution role missing `AWSLambdaBasicExecutionRole` (`logs:PutLogEvents`)
"Lambda 504 from API Gateway"	Backend exceeded 29-second API Gateway hard timeout
"Lambda 502 from API Gateway"	Lambda returned a malformed response — fix the `statusCode`/`body` format
"Lambda processes same SQS message twice"	Visibility timeout too short — set to 6× Lambda timeout
"DLQ for Lambda + SQS"	Configure DLQ on the SQS queue — not on the Lambda function
"Lambda can't reach RDS"	Lambda not in same VPC as RDS — configure VPC settings
"Lambda slow for CPU-intensive work"	Increase memory allocation — CPU scales linearly with memory
"DynamoDB throttled but table capacity looks fine"	GSI WCU is insufficient — GSI throttling blocks main table writes
"DynamoDB writes all going to one shard"	Low-cardinality partition key — use higher-cardinality key or write sharding
"Kinesis consumer falling behind"	Add shards (more capacity) or add consumer instances (up to 1 per shard)
"Kinesis Lambda shard blocked by one bad record"	Enable `BisectBatchOnFunctionError=true` and set `MaximumRetryAttempts`
"SQS high cost, low traffic"	Short polling making empty `ReceiveMessage` calls — enable long polling
"S3 bucket growing exponentially"	Access logs configured to same bucket — infinite loop — use separate bucket
"API key returns 403"	`CreateUsagePlanKey` not called to associate key with usage plan
"Need to search traces by custom attribute"	Use X-Ray Annotations (indexed), not Metadata
"EC2 can't reach internet (in VPC)"	Lambda or EC2 in private subnet — add NAT Gateway in public subnet
"Access S3/DynamoDB from VPC cheaply"	VPC Gateway Endpoint — free, no NAT Gateway needed
"CloudFront serving old S3 content"	Create a CloudFront invalidation for the updated paths

Common Traps

RAM is never a default EC2 CloudWatch metric: No matter how the question is phrased, RAM utilization always requires the CloudWatch Unified Agent.
Metric Filters are not retroactive: They only count new log events after the filter is created. If you create a filter and the alarm shows 0, it is because no matching events have occurred since creation — not because the filter is broken.
S3 export to S3 takes up to 12 hours: For real-time or near-real-time log delivery to S3, use a Subscription Filter → Kinesis Firehose → S3, not CreateExportTask.
X-Ray daemon on Lambda is automatic: Lambda provides and manages the daemon. You only need Active Tracing enabled and the IAM permission — do not attempt to install a daemon on Lambda.
X-Ray metadata is not searchable: Only Annotations are indexed. If a question asks about searching or filtering traces by a custom attribute, the answer involves Annotations.
API Gateway 29-second hard timeout: Lambda can run for 15 minutes but API Gateway will 504 at 29 seconds. This is not configurable. Design APIs to return quickly and use async patterns for long-running work.
DLQ for SQS → Lambda goes on the queue: The Lambda function's own DLQ setting only applies to asynchronous (non-ESM) invocations. For SQS → Lambda (Event Source Mapping), configure the DLQ on the SQS queue itself.
GSI throttling throttles the main table: GSIs have independent WCU, and insufficient GSI capacity does not only affect GSI reads — it blocks main table writes that would update that GSI.
put-metric-alarm does not create the metric: The metric must exist (have data published via put-metric-data) before an alarm referencing it becomes functional.

Key Terms — Domain 4

Term	One-Line Definition
CloudWatch Unified Agent	Agent installed on EC2 or on-premises servers that collects logs and OS-level metrics (RAM, disk, swap)
Detailed Monitoring	EC2 metric collection at 1-minute intervals instead of the default 5 minutes
Metric Filter	CloudWatch Logs feature that creates a metric by counting pattern matches in log events
Composite Alarm	CloudWatch alarm that combines multiple alarms with AND/OR logic to reduce alert noise
Subscription Filter	CloudWatch Logs mechanism for real-time delivery of log events to Lambda, Kinesis, or Firehose
X-Ray Trace	Complete end-to-end record of a single request as it flows through all services
X-Ray Segment	One service's contribution to a trace (work done + latency)
X-Ray Annotation	Indexed key-value pair attached to a trace — searchable and filterable
X-Ray Metadata	Non-indexed key-value pair attached to a trace — for context, not searchable
X-Ray Sampling	Rules controlling what percentage of requests generate trace data (to control cost)
Exponential Backoff	Retry strategy that doubles the wait time between retries, plus random jitter
IteratorAgeMilliseconds	Kinesis metric showing how far behind a consumer is from the latest stream data
ProvisionedThroughputExceededException	DynamoDB or Kinesis error indicating the request rate exceeded provisioned capacity
RDS Proxy	Managed connection pool between application and RDS that prevents connection exhaustion
VPC Gateway Endpoint	Free VPC endpoint for S3 and DynamoDB — eliminates the need for a NAT Gateway

End of Domain 4: Troubleshooting and Optimization.

Domain 3: Deployment

All done!

Start Practicing →

Ready to test yourself?

Practice questions for this topic

Start Practicing →

Domain 4: Troubleshooting and Optimization

AWS Certified Developer – Associate (DVA-C02)