AWSSAA-C03

Domain 2: Design Resilient Architectures

Topic 2 of 4 · Study notes

AWS Certified Solutions Architect – Associate (SAA-C03) — Domain 2: Design Resilient Architectures

Exam Code: SAA-C03 | Level: Associate
Domain Weight: 26% | Total Domains: 4 | Passing Score: 720/1000

Decoupling and Messaging Patterns
Serverless and Event-Driven Architectures
Load Balancing and API Management
Highly Available and Fault-Tolerant Architecture
Storage Durability and Resilience
Resilience Patterns and Observability
- 6.1 Architectural Resilience Patterns
- 6.2 Monitoring and Tracing
Exam Tips & Quick Reference

1. Decoupling and Messaging Patterns

Tight coupling means Service A calls Service B directly — if B is slow or down, A is blocked. Decoupling with queues or event buses breaks this dependency so each component scales and fails independently. This is one of the most tested architectural concepts on the SAA-C03.

1.1 Amazon SQS — Message Queuing

Standard vs. FIFO Queues

Feature	Standard Queue	FIFO Queue
Throughput	Unlimited	300 TPS (3,000 with batching)
Ordering	Best-effort; not guaranteed	Strictly guaranteed FIFO
Delivery	At-least-once (duplicates possible)	Exactly-once processing
Naming	Any name	Must end in `.fifo`
Message Groups	Not supported	Yes — parallel processing within ordered groups

Key SQS Concepts

Concept	Value / Behavior
Visibility Timeout	Default 30 seconds; max 12 hours. Message hidden after receipt; reappears if not deleted in time.
Message Retention	1 minute to 14 days; default 4 days
Dead Letter Queue (DLQ)	Receives messages that fail processing N times; used for debugging
Long Polling	Consumer waits up to 20 seconds for messages; reduces empty responses and cost; always prefer over short polling
Delay Queue	Delay delivery 0–900 seconds; useful for initial processing pause
Message Size	Up to 256 KB per message

Exam Tip: SQS Visibility Timeout is a critical concept. If your consumer takes longer than the timeout to process, the message becomes visible again and another consumer may pick it up. Extend the timeout if processing is slow; don't rely on the 30-second default for complex workloads.

SQS Auto Scaling Integration

Custom CloudWatch Metric:
  ApproximateNumberOfMessagesVisible / NumberOfRunningInstances

Target Tracking Policy:
  Target = desired messages per instance (e.g., 100)
  
Result: ASG scales workers proportionally to queue backlog

SQS FIFO — Deduplication and Ordering

A MessageGroupId is required and determines ordering scope — messages with the same group ID are processed in strict FIFO order. Deduplication uses either content-based deduplication (SHA-256 hash of body, 5-minute window) or an explicit MessageDeduplicationId.

SNS is a push-based publish/subscribe service. Publishers send to a Topic; all subscribers receive the same message simultaneously.

Feature	SQS	SNS
Pattern	Pull — consumers poll the queue	Push — SNS pushes to all subscribers
Persistence	Yes, up to 14 days	No — fire-and-forget
Multiple Consumers	No — each message goes to one consumer	Yes — all subscribers receive the message
Ordering	FIFO queues only	Not guaranteed

Fan-Out Pattern (Frequently Tested)

Order Placed → SNS Topic
                    ├── SQS Queue A → Fulfillment Service
                    ├── SQS Queue B → Notification Service
                    └── SQS Queue C → Analytics Service

Each queue processes independently. A failure in one downstream service does not affect the others. This is the correct answer for "process one event in multiple systems simultaneously."

SNS Subscribers: SQS, Lambda, HTTP/HTTPS endpoint, email, SMS, mobile push notifications, Kinesis Data Firehose.

1.3 Amazon EventBridge

EventBridge is a serverless event bus for loosely coupled application integration. It is more powerful than SNS for complex routing scenarios.

Feature	EventBridge	SNS
Event Sources	AWS services, SaaS apps, custom apps	Publishers (AWS services or code)
Routing	Complex pattern matching on event fields	Topic subscription only
Schema Registry	Yes	No
SaaS Integration	Yes (Zendesk, Datadog, etc.)	No
Event Replay	Yes (archive and replay)	No
Scheduling	Yes (Scheduler — replaces CloudWatch Events)	No
Number of Targets	20+ target types	~10 protocols

Key Concept: EventBridge Pipes create direct point-to-point connections between a source (SQS, DynamoDB Stream, Kinesis) and a target with optional filtering and enrichment via Lambda. Use Pipes when you need to connect two services with minimal code.

1.4 Amazon MQ

Amazon MQ is a managed message broker for Apache ActiveMQ and RabbitMQ. Use it when migrating existing applications that use standard messaging protocols (AMQP, STOMP, MQTT, OpenWire, JMS) to avoid rewriting application code.

Exam Tip: For new applications, always choose SQS/SNS (more scalable, fully managed, AWS-native). Choose Amazon MQ only when the question mentions existing applications using standard broker protocols that cannot be refactored.

2. Serverless and Event-Driven Architectures

2.1 AWS Lambda — Deep Dive

Lambda Limits and Characteristics

Property	Value
Maximum timeout	15 minutes per invocation
Memory range	128 MB – 10,240 MB (CPU scales proportionally)
Ephemeral storage (/tmp)	Up to 10,240 MB
Concurrent executions	1,000 default per region (can request increase)
Deployment package	50 MB (zip) / 250 MB (unzipped)

Lambda Invocation Types

Type	Triggered By	Error Handling
Synchronous	API Gateway, ALB, SDK, CLI	Caller receives error; caller handles retries
Asynchronous	S3 events, SNS, EventBridge	Lambda retries 2× automatically; then routes to DLQ
Event Source Mapping	SQS, DynamoDB Streams, Kinesis, MSK	Lambda polls the source; batch processing

Lambda Concurrency Controls

Reserved concurrency — limits the maximum concurrent executions for one function; prevents it from consuming the entire account concurrency (throttles excess requests).
Provisioned concurrency — pre-initializes N function instances to eliminate cold starts; charged per hour provisioned; critical for latency-sensitive applications.

Key Concept: Cold starts occur when Lambda initializes a new execution environment for the first invocation. Mitigate with Provisioned Concurrency (any runtime) or Lambda SnapStart (Java — snapshots the initialized state for fast restore).

Lambda Best Practices

Move expensive initialization (DB connections, SDK clients) outside the handler function so it is reused across warm invocations.
Use Lambda Layers for shared libraries to reduce deployment package size.
Store configuration in environment variables or SSM Parameter Store.
When Lambda needs VPC resources (RDS, ElastiCache), attach it to the VPC — but add a NAT Gateway for internet access and VPC Endpoints for AWS service calls.

Lambda Destinations (Async Only)

Route the result of asynchronous invocations to a next step without polling:

On Success → SQS, SNS, EventBridge, or another Lambda
On Failure → SQS, SNS, EventBridge, or another Lambda

2.2 AWS Step Functions

Step Functions orchestrates multi-step workflows as state machines, coordinating Lambda functions and AWS services with built-in error handling, retries, and parallel execution.

Feature	Standard Workflow	Express Workflow
Max Duration	1 year	5 minutes
Execution Semantics	Exactly-once	At-least-once
Execution History	Full audit in console	CloudWatch Logs
Cost	Per state transition	Per execution duration + requests
Use For	Order processing, long ETL, human approval workflows	High-volume, short-duration event processing

2.3 Container Orchestration — ECS and EKS

ECS Launch Types

Feature	EC2 Launch Type	Fargate Launch Type
Who Manages EC2	You — patch, scale, manage	AWS — fully managed
Pricing	Per EC2 instance	Per vCPU/memory/second
Use Case	Steady workload, cost control, GPU	Variable load, no management overhead

ECS Key Concepts

Concept	Definition
Task Definition	Blueprint: container image, CPU, memory, port mappings, environment variables, IAM task role
Task	A running instance of a task definition
Service	Maintains N running tasks; auto-restarts failed tasks; integrates with ALB
Cluster	Logical grouping of tasks and services

ECS Networking Modes:

awsvpc (recommended) — each task gets its own ENI and security group; best isolation and control
bridge — shared host network; port mapped via NAT on host
host — task shares the host's ENI directly; maximum performance, minimal isolation

Exam Tip: ECS tasks have two IAM roles. The Execution Role allows ECS to pull the image from ECR and write logs. The Task Role gives your application code permissions to call AWS services. These are separate and both may be required.

Amazon EKS

Managed Kubernetes control plane. Worker nodes run on EC2 node groups or Fargate. More complex than ECS but portable (standard Kubernetes API) across cloud providers. Use when the team is already Kubernetes-native or when workloads must be portable.

Amazon ECR

Managed container registry integrated with IAM. Features: image scanning (basic on-push or enhanced/continuous via Inspector), lifecycle policies to automatically remove old images, and cross-region/cross-account replication.

3. Load Balancing and API Management

3.1 Elastic Load Balancing — Full Comparison

Feature	ALB (Application)	NLB (Network)	GLB (Gateway)
OSI Layer	Layer 7 (HTTP/HTTPS)	Layer 4 (TCP/UDP/TLS)	Layer 3 (IP)
Static IP	No	Yes — per AZ	No
Content-Based Routing	Path, host header, query string, HTTP header	No	No
WebSocket / gRPC	Yes	Yes	No
TLS Termination	Yes	Yes	No
HTTP → HTTPS Redirect	Yes (built-in rule)	No	No
Millions Req/sec	Yes	Yes	Yes
Preserve Client IP	Via X-Forwarded-For header	Yes (native)	Yes
Sticky Sessions	Yes (cookie-based)	Yes	No
Use Case	Web apps, APIs, microservices	Low-latency TCP, static IPs, extreme throughput	Third-party security appliances

ALB Content-Based Routing

Listener Rules (evaluated top to bottom, first match wins):
├── IF path = /api/*       → forward to API target group
├── IF host = admin.co.com → forward to Admin target group
├── IF header X-Version=v2 → forward to V2 target group
├── IF query ?color=blue   → weighted: 80% Blue TG, 20% Green TG
└── DEFAULT                → forward to Main target group

Target Types: EC2 instances, IP addresses (including on-premises via Direct Connect), Lambda functions, or another ALB.

3.2 Amazon API Gateway

API Types

Type	Best For
REST API	Standard HTTP/REST; most features; response caching
HTTP API	Lower latency and cost; OIDC and OAuth 2.0 support; simpler routing
WebSocket API	Real-time bidirectional communication (chat, live dashboards)

Key Features

Throttling — default 10,000 req/sec per account; burst limit 5,000; configurable per stage and per method
Caching — cache responses at the API Gateway level; configurable TTL (default 300 seconds); reduces backend load
Usage Plans — throttle and quota per API key; used for API monetization and partner access tiers
Stages — separate environments (dev/staging/prod) each with independent settings, throttling, and logging

API Gateway Authorizers

Authorizer	How It Works
Lambda Authorizer	Custom auth logic in Lambda; returns an IAM policy allow/deny
Cognito User Pool	Validates JWT tokens from a Cognito User Pool; no Lambda needed
IAM Authorization	Requires AWS Signature V4 signing; for internal service-to-service calls

3.3 Caching Strategies

Amazon ElastiCache — Redis vs. Memcached

Feature	Redis	Memcached
Persistence	Yes (RDB snapshots, AOF)	No
Multi-AZ Failover	Yes — automatic	No
Pub/Sub	Yes	No
Data Structures	Sorted sets, lists, hashes, geospatial	Strings only
Cluster Mode (Sharding)	Yes	Yes (multi-threaded horizontal scaling)
Transactions	Yes	No

Choose Redis when you need persistence, HA, pub/sub, rich data structures, or leaderboards. Choose Memcached when you need simple, pure caching with multi-threaded performance and no durability requirements.

Caching Patterns

Pattern	Behavior	Best For
Cache-aside (Lazy Loading)	App checks cache → miss → read DB → write to cache	Read-heavy; acceptable stale data window
Write-through	Write to cache AND DB simultaneously	No stale data; accepts additional write latency
Write-behind (Write-back)	Write to cache; async write to DB	Highest write performance; risk of data loss on failure
TTL	Cache entries expire after a set duration	All patterns; balance freshness vs. DB load

Amazon DAX (DynamoDB Accelerator)

In-memory write-through cache for DynamoDB. Uses the same DynamoDB API — transparent to the application. Provides microsecond read latency for frequently accessed items. Does not help write-heavy workloads. Cluster size: 1–10 nodes, multi-AZ.

4. Highly Available and Fault-Tolerant Architecture

4.1 Multi-AZ and Multi-Region Design Patterns

Availability Zones

AZs are physically separate data centers within a Region with independent power, cooling, and networking. Design all production workloads to span at least 2 AZs (3 recommended for critical applications). Auto Scaling groups automatically rebalance instances across specified AZs.

Disaster Recovery Strategy Comparison

Strategy	Description	RTO	RPO	Cost
Backup & Restore	Periodic backups copied to DR Region; restore on failure	Hours	Hours	$
Pilot Light	Minimal core infrastructure (DB) running in DR; compute off	10–60 min	Minutes	$$
Warm Standby	Scaled-down running copy of full environment in DR	Minutes	Seconds	$$$
Multi-Site Active/Active	Full production capacity in both Regions; live traffic split	Near-zero	Near-zero	$$$$

Key Concept: RPO is the maximum acceptable data loss (measured in time). RTO is the maximum acceptable downtime. Lower RPO and RTO = higher cost. The exam often presents a cost constraint and asks which DR strategy fits.

4.2 Database High Availability

RDS Multi-AZ vs. Read Replicas

Feature	Multi-AZ Deployment	Read Replica
Primary Purpose	High availability and automatic failover	Read scaling and DR
Replication	Synchronous — zero data loss	Asynchronous — potential lag
Readable	No — standby is not accessible	Yes — redirect read queries
Automatic Failover	Yes — 1–2 minutes; DNS updates automatically	No — manual promotion
Cross-Region	No — standby is in same region only	Yes — cross-region replicas supported

Exam Tip: Multi-AZ standby is NOT readable. If a question asks about offloading read traffic, the answer is Read Replicas. If a question asks about automatic failover or high availability, the answer is Multi-AZ. These are different features that can (and should) both be used together.

Amazon Aurora HA Architecture

Aurora stores 6 copies of data across 3 AZs automatically (2 copies per AZ). It can sustain writes with 4/6 copies and reads with 3/6 copies, and self-heals corrupted blocks via peer-to-peer replication.

Aurora Feature	Detail
Read Replicas	Up to 15; shared storage volume (no replication lag for reads)
Aurora Serverless v2	Scales from 0.5 to 128 ACUs; per-second billing; minimum 0.5 (not zero)
Aurora Global Database	Cross-region replication; RPO < 1 second; RTO < 1 minute; up to 5 secondary Regions
Writer / Reader Endpoints	Writer always points to primary; Reader load-balances across all replicas

Amazon DynamoDB High Availability Features

DynamoDB replicates data across 3 AZs by default — no configuration needed.

Feature	Detail
Global Tables	Multi-region, multi-master active-active replication; requires DynamoDB Streams enabled
On-Demand Capacity	Auto-scales instantly; no capacity planning; higher cost per request
DynamoDB Streams	Captures item-level changes (INSERT, MODIFY, REMOVE); triggers Lambda for event-driven processing

4.3 EC2 Auto Scaling — Full Reference

Scaling Policy Types

Policy Type	Behavior	Best For
Simple	One alarm triggers one fixed action; cooldown period	Basic scaling needs
Step	Multiple thresholds; proportional response steps	Graduated response to varying load levels
Target Tracking	Maintain a specific metric value automatically	Most use cases; easiest to configure
Scheduled	Pre-defined capacity changes at specific times	Predictable load patterns (business hours, batch windows)
Predictive	ML-based forecast; provisions capacity proactively before demand	Recurring, cyclical traffic patterns

Key Concept: Target Tracking is the simplest and recommended default. You specify a target metric value (e.g., 50% CPU) and AWS automatically adjusts capacity to maintain it. Predictive Scaling learns from 2 weeks of history and pre-warms capacity before demand spikes.

Auto Scaling Lifecycle Hooks

Lifecycle hooks allow custom actions during scale-out and scale-in events. The instance is paused in Pending:Wait (scale-out) or Terminating:Wait (scale-in) state.

EC2_INSTANCE_LAUNCHING hook — configure the instance before it enters service (install agents, run tests)
EC2_INSTANCE_TERMINATING hook — drain connections, copy logs to S3, or deregister from service discovery before termination

Launch Templates vs. Launch Configurations

Feature	Launch Template	Launch Configuration
Versioning	Yes — multiple versions	No — immutable
Spot + On-Demand Mix	Yes	No
Required for New Features	Yes	No
Recommendation	Preferred	Legacy; avoid for new ASGs

4.4 Route 53 for Availability

Routing Policy Reference

Policy	Behavior	Use For
Failover	Route to secondary when health check on primary fails	Active-passive failover
Latency-Based	Route to Region with lowest measured latency	Multi-region for global users
Weighted	Split traffic by percentage	A/B testing; gradual version migration
Geolocation	Route based on user's geographic location	Content localization; regulatory compliance
Geoproximity	Route by location with adjustable bias	Fine-tune traffic distribution
Multivalue	Return multiple healthy IPs	Simple load distribution — not an ELB replacement

Route 53 Health Checks

Route 53 health checkers are globally distributed. Supported check types: HTTP, HTTPS, TCP, and string matching (verify response body contains specific text). A Calculated health check combines multiple health checks with AND/OR logic. Use a CloudWatch alarm health check for resources that are not publicly accessible (e.g., internal ALBs).

5. Storage Durability and Resilience

5.1 S3 Durability, Storage Classes, and Replication

S3 Storage Class Durability and Availability

Storage Class	Durability	Availability	AZs	Notes
Standard	11 nines	99.99%	≥ 3	General purpose; most resilient
Standard-IA	11 nines	99.9%	≥ 3	Infrequent access; retrieval fee
One Zone-IA	11 nines	99.5%	1	Lower cost; risk if AZ fails
Glacier Instant	11 nines	99.9%	≥ 3	Archive; millisecond retrieval
Glacier Flexible	11 nines	99.99%	≥ 3	Archive; minutes–hours retrieval
Glacier Deep Archive	11 nines	99.99%	≥ 3	12–48 hour retrieval; lowest cost
Intelligent-Tiering	11 nines	99.9%	≥ 3	Auto-moves between access tiers

Exam Tip: All S3 storage classes share the same 11-nines (99.999999999%) durability except One Zone-IA, which has the same durability rating mathematically but will lose data if the single AZ is destroyed. Availability (uptime SLA) differs across classes.

S3 Replication

Both replication types require versioning enabled on both source and destination.

Feature	CRR (Cross-Region)	SRR (Same-Region)
Purpose	DR, compliance, reduce latency for distant users	Log aggregation, test/prod separation
Latency	Near real-time (asynchronous)	Near real-time (asynchronous)
Replicate Existing Objects	No — only new objects after enabling (use S3 Batch Ops for existing)	Same
Delete Marker Replication	Optional (off by default)	Optional

5.2 Block and File Storage Resilience

EBS Resilience

EBS volumes are AZ-specific — they exist in one AZ only. For resilience:

Take snapshots (stored in S3, multi-AZ) at regular intervals
Copy snapshots to another Region for cross-region DR
Use AWS Data Lifecycle Manager (DLM) to automate snapshot schedules and retention

EFS Resilience

Amazon EFS automatically replicates data across multiple AZs in a Region (Standard tier). Use EFS Replication to create a read-only replica in a different Region for DR.

FSx for Windows — Multi-AZ

FSx for Windows File Server supports a Multi-AZ deployment with automatic failover between file servers in separate AZs.

5.3 AWS Backup

AWS Backup provides centralized, policy-driven backup management across: EC2, EBS, RDS, Aurora, DynamoDB, EFS, FSx, Storage Gateway, and S3.

Feature	Detail
Backup Plans	Schedules, retention periods, and lifecycle transition rules
Cross-Account Backup	Copy backups to another account for isolation from operational account
Cross-Region Backup	Copy backups to another Region for DR compliance
Vault Lock (WORM)	Immutable backup vault; compliance mode prevents deletion even by root

6. Resilience Patterns and Observability

6.1 Architectural Resilience Patterns

Queue-Based Load Leveling

Place an SQS queue between a fast producer and a slow consumer. The producer never waits; the consumer processes at its own pace. Queue depth provides a scaling signal.

[Fast Producer] → [SQS Queue] → [Consumer Workers]
                                     ↑
                             (Auto Scaling based on
                             ApproximateNumberOfMessagesVisible)

Circuit Breaker Pattern

Stop calling a failing downstream service to prevent cascading failures. When the error rate exceeds a threshold, open the circuit (fail fast). Periodically allow a test request through to detect recovery.

Retry with Exponential Backoff and Jitter

Do not retry immediately on failure — wait an exponentially increasing interval. Add jitter (random delay) to prevent a "thundering herd" where all retrying clients hit the service simultaneously. AWS SDKs implement this by default.

Bulkhead Pattern

Isolate workloads so a failure in one does not affect others. Use separate SQS queues per consumer type, separate Lambda functions per purpose, and separate ECS services per workload. Avoid monoliths that fail entirely.

6.2 Monitoring and Tracing

Amazon CloudWatch

Feature	Purpose
Metrics	Time-series data from AWS services and custom sources; 1-second to 1-day resolution
Logs	Centralized log storage from EC2 (via agent), Lambda, ECS, API Gateway, VPC Flow Logs
Alarms	Trigger on threshold or anomaly; actions include SNS notification, Auto Scaling, EC2 reboot
Dashboards	Cross-service, cross-region metric visualization
Container Insights	Enhanced metrics for ECS/EKS (CPU, memory, disk, network per task/pod)
Anomaly Detection	ML model of expected metric range; alarm when actual deviates

AWS X-Ray

Distributed tracing for microservices and serverless applications. Traces requests end-to-end across Lambda, ECS, EC2, API Gateway, SQS, and DynamoDB. The Service Map provides a visual representation of all components and inter-service latency. Use annotations (indexed) for filtering traces and metadata (non-indexed) for additional context. Sampling controls the fraction of traces collected to manage cost.

Exam Tips & Quick Reference

Scenario-to-Answer Mapping

Scenario Keyword / Requirement	Correct Answer
"Decouple components to handle traffic spikes"	SQS queue between producer and consumer
"Process messages in strict order, exactly once"	SQS FIFO queue
"One event → multiple systems process simultaneously"	SNS topic → fan-out to multiple SQS queues
"Migrate from JMS/AMQP broker to AWS"	Amazon MQ (protocol compatibility required)
"Serverless multi-step workflow with error handling"	AWS Step Functions
"Auto Scaling based on SQS queue depth"	Custom metric (QueueDepth/InServiceInstances) → Target Tracking
"Eliminate Lambda cold starts for critical function"	Lambda Provisioned Concurrency
"DynamoDB microsecond read latency"	DynamoDB DAX
"RDS high availability for production"	RDS Multi-AZ deployment
"Offload read queries from RDS primary"	RDS Read Replicas + reader endpoint
"Aurora cross-region DR with RPO < 1 second"	Aurora Global Database
"DynamoDB active-active multi-region replication"	DynamoDB Global Tables (requires Streams enabled)
"Gradual traffic shift to a new app version"	Weighted routing in Route 53 OR ALB weighted target groups
"Windows EC2 needs shared file system"	FSx for Windows File Server (SMB)
"Linux EC2 needs shared file system across AZs"	Amazon EFS (NFS)
"Retain instance logs before Auto Scaling terminates"	Lifecycle hook (Terminating:Wait) → copy logs to S3/CloudWatch
"Real-time distributed tracing across microservices"	AWS X-Ray
"Detect metric anomalies automatically"	CloudWatch Anomaly Detection
"Scheduled task without managing EC2"	EventBridge Scheduler + Lambda
"Long-running workflow up to 1 year"	Step Functions Standard Workflow
"High-volume short-duration event workflow"	Step Functions Express Workflow
"Automatic failover when primary Region fails"	Route 53 Failover routing + health checks
"Pilot Light DR; restore compute in minutes"	Pilot Light pattern (AMIs + Launch Templates pre-configured)

Common Traps

Multi-AZ standby not readable: The RDS Multi-AZ standby replica cannot serve read traffic. If the question mentions read scaling, use Read Replicas, not Multi-AZ.
DynamoDB Global Tables require Streams: Always enable DynamoDB Streams before creating Global Tables — a frequently tested prerequisite.
Amazon MQ vs. SQS: Only choose MQ when the scenario explicitly mentions existing applications using standard protocols (JMS, AMQP, STOMP). New applications should use SQS/SNS.
Step Functions Express duration: Express Workflows max out at 5 minutes. Standard Workflows support up to 1 year. Confusing these two is a common exam mistake.
SNS does not persist messages: SNS is fire-and-forget. If subscribers are unavailable, messages are lost. Add an SQS queue as a subscriber to provide durability.
Lambda timeout in VPC: Lambda functions in a VPC cannot access the internet without a NAT Gateway and cannot reach AWS services without VPC Endpoints.
S3 One Zone-IA risk: The 11-nines durability figure is misleading — data will be lost if the single AZ fails. Do not use One Zone-IA for data that cannot be recreated.

Key Terms — Domain 2

Term	One-Line Definition
Visibility Timeout	Time an SQS message is hidden after receipt; reappears if not deleted within this window
DLQ (Dead Letter Queue)	Receives SQS/SNS messages that fail processing after N attempts
Fan-Out Pattern	SNS → multiple SQS queues; one event triggers multiple independent consumers
Provisioned Concurrency	Pre-warmed Lambda environments; eliminates cold starts; charged per hour
Task Definition	ECS blueprint defining container image, CPU, memory, ports, and IAM roles
Task Role	IAM role granting the application code inside an ECS container access to AWS services
Target Tracking	ASG policy that maintains a specific CloudWatch metric at a target value
Lifecycle Hook	Pauses an EC2 instance during ASG launch or termination for custom automation
Multi-AZ	RDS standby in a separate AZ; synchronous replication; automatic failover
Read Replica	RDS/Aurora async copy; readable; offloads queries; must be manually promoted
Aurora Global Database	Cross-region active-passive Aurora cluster; RPO < 1s; RTO < 1 min
DAX	DynamoDB Accelerator; in-memory write-through cache; microsecond reads
Circuit Breaker	Pattern that stops calling a failing service to prevent cascading failures
RPO	Recovery Point Objective; maximum acceptable data loss measured in time
RTO	Recovery Time Objective; maximum acceptable downtime before recovery
X-Ray	AWS distributed tracing service; end-to-end request visibility across services

End of Domain 2. Continue to Domain 3: Design High-Performing Architectures →

Domain 1: Design Secure Architectures

Domain 3: Design High-Performing Architectures

Ready to test yourself?

Practice questions for this topic

Start Practicing →