Domain 2: Design Resilient Architectures
Topic 2 of 4 · Study notes
AWS Certified Solutions Architect – Associate (SAA-C03) — Domain 2: Design Resilient Architectures
Exam Code: SAA-C03 | Level: Associate
Domain Weight: 26% | Total Domains: 4 | Passing Score: 720/1000
Table of Contents
- Decoupling and Messaging Patterns
- Serverless and Event-Driven Architectures
- Load Balancing and API Management
- Highly Available and Fault-Tolerant Architecture
- Storage Durability and Resilience
- Resilience Patterns and Observability
- Exam Tips & Quick Reference
1. Decoupling and Messaging Patterns
Tight coupling means Service A calls Service B directly — if B is slow or down, A is blocked. Decoupling with queues or event buses breaks this dependency so each component scales and fails independently. This is one of the most tested architectural concepts on the SAA-C03.
1.1 Amazon SQS — Message Queuing
Standard vs. FIFO Queues
| Feature | Standard Queue | FIFO Queue |
|---|---|---|
| Throughput | Unlimited | 300 TPS (3,000 with batching) |
| Ordering | Best-effort; not guaranteed | Strictly guaranteed FIFO |
| Delivery | At-least-once (duplicates possible) | Exactly-once processing |
| Naming | Any name | Must end in .fifo |
| Message Groups | Not supported | Yes — parallel processing within ordered groups |
Key SQS Concepts
| Concept | Value / Behavior |
|---|---|
| Visibility Timeout | Default 30 seconds; max 12 hours. Message hidden after receipt; reappears if not deleted in time. |
| Message Retention | 1 minute to 14 days; default 4 days |
| Dead Letter Queue (DLQ) | Receives messages that fail processing N times; used for debugging |
| Long Polling | Consumer waits up to 20 seconds for messages; reduces empty responses and cost; always prefer over short polling |
| Delay Queue | Delay delivery 0–900 seconds; useful for initial processing pause |
| Message Size | Up to 256 KB per message |
Exam Tip: SQS Visibility Timeout is a critical concept. If your consumer takes longer than the timeout to process, the message becomes visible again and another consumer may pick it up. Extend the timeout if processing is slow; don't rely on the 30-second default for complex workloads.
SQS Auto Scaling Integration
Custom CloudWatch Metric:
ApproximateNumberOfMessagesVisible / NumberOfRunningInstances
Target Tracking Policy:
Target = desired messages per instance (e.g., 100)
Result: ASG scales workers proportionally to queue backlog
SQS FIFO — Deduplication and Ordering
A MessageGroupId is required and determines ordering scope — messages with the same group ID are processed in strict FIFO order. Deduplication uses either content-based deduplication (SHA-256 hash of body, 5-minute window) or an explicit MessageDeduplicationId.
1.2 Amazon SNS — Pub/Sub Messaging
SNS is a push-based publish/subscribe service. Publishers send to a Topic; all subscribers receive the same message simultaneously.
SNS vs. SQS
| Feature | SQS | SNS |
|---|---|---|
| Pattern | Pull — consumers poll the queue | Push — SNS pushes to all subscribers |
| Persistence | Yes, up to 14 days | No — fire-and-forget |
| Multiple Consumers | No — each message goes to one consumer | Yes — all subscribers receive the message |
| Ordering | FIFO queues only | Not guaranteed |
Fan-Out Pattern (Frequently Tested)
Order Placed → SNS Topic
├── SQS Queue A → Fulfillment Service
├── SQS Queue B → Notification Service
└── SQS Queue C → Analytics Service
Each queue processes independently. A failure in one downstream service does not affect the others. This is the correct answer for "process one event in multiple systems simultaneously."
SNS Subscribers: SQS, Lambda, HTTP/HTTPS endpoint, email, SMS, mobile push notifications, Kinesis Data Firehose.
1.3 Amazon EventBridge
EventBridge is a serverless event bus for loosely coupled application integration. It is more powerful than SNS for complex routing scenarios.
| Feature | EventBridge | SNS |
|---|---|---|
| Event Sources | AWS services, SaaS apps, custom apps | Publishers (AWS services or code) |
| Routing | Complex pattern matching on event fields | Topic subscription only |
| Schema Registry | Yes | No |
| SaaS Integration | Yes (Zendesk, Datadog, etc.) | No |
| Event Replay | Yes (archive and replay) | No |
| Scheduling | Yes (Scheduler — replaces CloudWatch Events) | No |
| Number of Targets | 20+ target types | ~10 protocols |
Key Concept: EventBridge Pipes create direct point-to-point connections between a source (SQS, DynamoDB Stream, Kinesis) and a target with optional filtering and enrichment via Lambda. Use Pipes when you need to connect two services with minimal code.
1.4 Amazon MQ
Amazon MQ is a managed message broker for Apache ActiveMQ and RabbitMQ. Use it when migrating existing applications that use standard messaging protocols (AMQP, STOMP, MQTT, OpenWire, JMS) to avoid rewriting application code.
Exam Tip: For new applications, always choose SQS/SNS (more scalable, fully managed, AWS-native). Choose Amazon MQ only when the question mentions existing applications using standard broker protocols that cannot be refactored.
2. Serverless and Event-Driven Architectures
2.1 AWS Lambda — Deep Dive
Lambda Limits and Characteristics
| Property | Value |
|---|---|
| Maximum timeout | 15 minutes per invocation |
| Memory range | 128 MB – 10,240 MB (CPU scales proportionally) |
| Ephemeral storage (/tmp) | Up to 10,240 MB |
| Concurrent executions | 1,000 default per region (can request increase) |
| Deployment package | 50 MB (zip) / 250 MB (unzipped) |
Lambda Invocation Types
| Type | Triggered By | Error Handling |
|---|---|---|
| Synchronous | API Gateway, ALB, SDK, CLI | Caller receives error; caller handles retries |
| Asynchronous | S3 events, SNS, EventBridge | Lambda retries 2× automatically; then routes to DLQ |
| Event Source Mapping | SQS, DynamoDB Streams, Kinesis, MSK | Lambda polls the source; batch processing |
Lambda Concurrency Controls
- Reserved concurrency — limits the maximum concurrent executions for one function; prevents it from consuming the entire account concurrency (throttles excess requests).
- Provisioned concurrency — pre-initializes N function instances to eliminate cold starts; charged per hour provisioned; critical for latency-sensitive applications.
Key Concept: Cold starts occur when Lambda initializes a new execution environment for the first invocation. Mitigate with Provisioned Concurrency (any runtime) or Lambda SnapStart (Java — snapshots the initialized state for fast restore).
Lambda Best Practices
- Move expensive initialization (DB connections, SDK clients) outside the handler function so it is reused across warm invocations.
- Use Lambda Layers for shared libraries to reduce deployment package size.
- Store configuration in environment variables or SSM Parameter Store.
- When Lambda needs VPC resources (RDS, ElastiCache), attach it to the VPC — but add a NAT Gateway for internet access and VPC Endpoints for AWS service calls.
Lambda Destinations (Async Only)
Route the result of asynchronous invocations to a next step without polling:
- On Success → SQS, SNS, EventBridge, or another Lambda
- On Failure → SQS, SNS, EventBridge, or another Lambda
2.2 AWS Step Functions
Step Functions orchestrates multi-step workflows as state machines, coordinating Lambda functions and AWS services with built-in error handling, retries, and parallel execution.
| Feature | Standard Workflow | Express Workflow |
|---|---|---|
| Max Duration | 1 year | 5 minutes |
| Execution Semantics | Exactly-once | At-least-once |
| Execution History | Full audit in console | CloudWatch Logs |
| Cost | Per state transition | Per execution duration + requests |
| Use For | Order processing, long ETL, human approval workflows | High-volume, short-duration event processing |
2.3 Container Orchestration — ECS and EKS
ECS Launch Types
| Feature | EC2 Launch Type | Fargate Launch Type |
|---|---|---|
| Who Manages EC2 | You — patch, scale, manage | AWS — fully managed |
| Pricing | Per EC2 instance | Per vCPU/memory/second |
| Use Case | Steady workload, cost control, GPU | Variable load, no management overhead |
ECS Key Concepts
| Concept | Definition |
|---|---|
| Task Definition | Blueprint: container image, CPU, memory, port mappings, environment variables, IAM task role |
| Task | A running instance of a task definition |
| Service | Maintains N running tasks; auto-restarts failed tasks; integrates with ALB |
| Cluster | Logical grouping of tasks and services |
ECS Networking Modes:
awsvpc(recommended) — each task gets its own ENI and security group; best isolation and controlbridge— shared host network; port mapped via NAT on hosthost— task shares the host's ENI directly; maximum performance, minimal isolation
Exam Tip: ECS tasks have two IAM roles. The Execution Role allows ECS to pull the image from ECR and write logs. The Task Role gives your application code permissions to call AWS services. These are separate and both may be required.
Amazon EKS
Managed Kubernetes control plane. Worker nodes run on EC2 node groups or Fargate. More complex than ECS but portable (standard Kubernetes API) across cloud providers. Use when the team is already Kubernetes-native or when workloads must be portable.
Amazon ECR
Managed container registry integrated with IAM. Features: image scanning (basic on-push or enhanced/continuous via Inspector), lifecycle policies to automatically remove old images, and cross-region/cross-account replication.
3. Load Balancing and API Management
3.1 Elastic Load Balancing — Full Comparison
| Feature | ALB (Application) | NLB (Network) | GLB (Gateway) |
|---|---|---|---|
| OSI Layer | Layer 7 (HTTP/HTTPS) | Layer 4 (TCP/UDP/TLS) | Layer 3 (IP) |
| Static IP | No | Yes — per AZ | No |
| Content-Based Routing | Path, host header, query string, HTTP header | No | No |
| WebSocket / gRPC | Yes | Yes | No |
| TLS Termination | Yes | Yes | No |
| HTTP → HTTPS Redirect | Yes (built-in rule) | No | No |
| Millions Req/sec | Yes | Yes | Yes |
| Preserve Client IP | Via X-Forwarded-For header | Yes (native) | Yes |
| Sticky Sessions | Yes (cookie-based) | Yes | No |
| Use Case | Web apps, APIs, microservices | Low-latency TCP, static IPs, extreme throughput | Third-party security appliances |
ALB Content-Based Routing
Listener Rules (evaluated top to bottom, first match wins):
├── IF path = /api/* → forward to API target group
├── IF host = admin.co.com → forward to Admin target group
├── IF header X-Version=v2 → forward to V2 target group
├── IF query ?color=blue → weighted: 80% Blue TG, 20% Green TG
└── DEFAULT → forward to Main target group
Target Types: EC2 instances, IP addresses (including on-premises via Direct Connect), Lambda functions, or another ALB.
3.2 Amazon API Gateway
API Types
| Type | Best For |
|---|---|
| REST API | Standard HTTP/REST; most features; response caching |
| HTTP API | Lower latency and cost; OIDC and OAuth 2.0 support; simpler routing |
| WebSocket API | Real-time bidirectional communication (chat, live dashboards) |
Key Features
- Throttling — default 10,000 req/sec per account; burst limit 5,000; configurable per stage and per method
- Caching — cache responses at the API Gateway level; configurable TTL (default 300 seconds); reduces backend load
- Usage Plans — throttle and quota per API key; used for API monetization and partner access tiers
- Stages — separate environments (dev/staging/prod) each with independent settings, throttling, and logging
API Gateway Authorizers
| Authorizer | How It Works |
|---|---|
| Lambda Authorizer | Custom auth logic in Lambda; returns an IAM policy allow/deny |
| Cognito User Pool | Validates JWT tokens from a Cognito User Pool; no Lambda needed |
| IAM Authorization | Requires AWS Signature V4 signing; for internal service-to-service calls |
3.3 Caching Strategies
Amazon ElastiCache — Redis vs. Memcached
| Feature | Redis | Memcached |
|---|---|---|
| Persistence | Yes (RDB snapshots, AOF) | No |
| Multi-AZ Failover | Yes — automatic | No |
| Pub/Sub | Yes | No |
| Data Structures | Sorted sets, lists, hashes, geospatial | Strings only |
| Cluster Mode (Sharding) | Yes | Yes (multi-threaded horizontal scaling) |
| Transactions | Yes | No |
Choose Redis when you need persistence, HA, pub/sub, rich data structures, or leaderboards. Choose Memcached when you need simple, pure caching with multi-threaded performance and no durability requirements.
Caching Patterns
| Pattern | Behavior | Best For |
|---|---|---|
| Cache-aside (Lazy Loading) | App checks cache → miss → read DB → write to cache | Read-heavy; acceptable stale data window |
| Write-through | Write to cache AND DB simultaneously | No stale data; accepts additional write latency |
| Write-behind (Write-back) | Write to cache; async write to DB | Highest write performance; risk of data loss on failure |
| TTL | Cache entries expire after a set duration | All patterns; balance freshness vs. DB load |
Amazon DAX (DynamoDB Accelerator)
In-memory write-through cache for DynamoDB. Uses the same DynamoDB API — transparent to the application. Provides microsecond read latency for frequently accessed items. Does not help write-heavy workloads. Cluster size: 1–10 nodes, multi-AZ.
4. Highly Available and Fault-Tolerant Architecture
4.1 Multi-AZ and Multi-Region Design Patterns
Availability Zones
AZs are physically separate data centers within a Region with independent power, cooling, and networking. Design all production workloads to span at least 2 AZs (3 recommended for critical applications). Auto Scaling groups automatically rebalance instances across specified AZs.
Disaster Recovery Strategy Comparison
| Strategy | Description | RTO | RPO | Cost |
|---|---|---|---|---|
| Backup & Restore | Periodic backups copied to DR Region; restore on failure | Hours | Hours | $ |
| Pilot Light | Minimal core infrastructure (DB) running in DR; compute off | 10–60 min | Minutes | $$ |
| Warm Standby | Scaled-down running copy of full environment in DR | Minutes | Seconds | $$$ |
| Multi-Site Active/Active | Full production capacity in both Regions; live traffic split | Near-zero | Near-zero | $$$$ |
Key Concept: RPO is the maximum acceptable data loss (measured in time). RTO is the maximum acceptable downtime. Lower RPO and RTO = higher cost. The exam often presents a cost constraint and asks which DR strategy fits.
4.2 Database High Availability
RDS Multi-AZ vs. Read Replicas
| Feature | Multi-AZ Deployment | Read Replica |
|---|---|---|
| Primary Purpose | High availability and automatic failover | Read scaling and DR |
| Replication | Synchronous — zero data loss | Asynchronous — potential lag |
| Readable | No — standby is not accessible | Yes — redirect read queries |
| Automatic Failover | Yes — 1–2 minutes; DNS updates automatically | No — manual promotion |
| Cross-Region | No — standby is in same region only | Yes — cross-region replicas supported |
Exam Tip: Multi-AZ standby is NOT readable. If a question asks about offloading read traffic, the answer is Read Replicas. If a question asks about automatic failover or high availability, the answer is Multi-AZ. These are different features that can (and should) both be used together.
Amazon Aurora HA Architecture
Aurora stores 6 copies of data across 3 AZs automatically (2 copies per AZ). It can sustain writes with 4/6 copies and reads with 3/6 copies, and self-heals corrupted blocks via peer-to-peer replication.
| Aurora Feature | Detail |
|---|---|
| Read Replicas | Up to 15; shared storage volume (no replication lag for reads) |
| Aurora Serverless v2 | Scales from 0.5 to 128 ACUs; per-second billing; minimum 0.5 (not zero) |
| Aurora Global Database | Cross-region replication; RPO < 1 second; RTO < 1 minute; up to 5 secondary Regions |
| Writer / Reader Endpoints | Writer always points to primary; Reader load-balances across all replicas |
Amazon DynamoDB High Availability Features
DynamoDB replicates data across 3 AZs by default — no configuration needed.
| Feature | Detail |
|---|---|
| Global Tables | Multi-region, multi-master active-active replication; requires DynamoDB Streams enabled |
| On-Demand Capacity | Auto-scales instantly; no capacity planning; higher cost per request |
| DynamoDB Streams | Captures item-level changes (INSERT, MODIFY, REMOVE); triggers Lambda for event-driven processing |
4.3 EC2 Auto Scaling — Full Reference
Scaling Policy Types
| Policy Type | Behavior | Best For |
|---|---|---|
| Simple | One alarm triggers one fixed action; cooldown period | Basic scaling needs |
| Step | Multiple thresholds; proportional response steps | Graduated response to varying load levels |
| Target Tracking | Maintain a specific metric value automatically | Most use cases; easiest to configure |
| Scheduled | Pre-defined capacity changes at specific times | Predictable load patterns (business hours, batch windows) |
| Predictive | ML-based forecast; provisions capacity proactively before demand | Recurring, cyclical traffic patterns |
Key Concept: Target Tracking is the simplest and recommended default. You specify a target metric value (e.g., 50% CPU) and AWS automatically adjusts capacity to maintain it. Predictive Scaling learns from 2 weeks of history and pre-warms capacity before demand spikes.
Auto Scaling Lifecycle Hooks
Lifecycle hooks allow custom actions during scale-out and scale-in events. The instance is paused in Pending:Wait (scale-out) or Terminating:Wait (scale-in) state.
EC2_INSTANCE_LAUNCHINGhook — configure the instance before it enters service (install agents, run tests)EC2_INSTANCE_TERMINATINGhook — drain connections, copy logs to S3, or deregister from service discovery before termination
Launch Templates vs. Launch Configurations
| Feature | Launch Template | Launch Configuration |
|---|---|---|
| Versioning | Yes — multiple versions | No — immutable |
| Spot + On-Demand Mix | Yes | No |
| Required for New Features | Yes | No |
| Recommendation | Preferred | Legacy; avoid for new ASGs |
4.4 Route 53 for Availability
Routing Policy Reference
| Policy | Behavior | Use For |
|---|---|---|
| Failover | Route to secondary when health check on primary fails | Active-passive failover |
| Latency-Based | Route to Region with lowest measured latency | Multi-region for global users |
| Weighted | Split traffic by percentage | A/B testing; gradual version migration |
| Geolocation | Route based on user's geographic location | Content localization; regulatory compliance |
| Geoproximity | Route by location with adjustable bias | Fine-tune traffic distribution |
| Multivalue | Return multiple healthy IPs | Simple load distribution — not an ELB replacement |
Route 53 Health Checks
Route 53 health checkers are globally distributed. Supported check types: HTTP, HTTPS, TCP, and string matching (verify response body contains specific text). A Calculated health check combines multiple health checks with AND/OR logic. Use a CloudWatch alarm health check for resources that are not publicly accessible (e.g., internal ALBs).
5. Storage Durability and Resilience
5.1 S3 Durability, Storage Classes, and Replication
S3 Storage Class Durability and Availability
| Storage Class | Durability | Availability | AZs | Notes |
|---|---|---|---|---|
| Standard | 11 nines | 99.99% | ≥ 3 | General purpose; most resilient |
| Standard-IA | 11 nines | 99.9% | ≥ 3 | Infrequent access; retrieval fee |
| One Zone-IA | 11 nines | 99.5% | 1 | Lower cost; risk if AZ fails |
| Glacier Instant | 11 nines | 99.9% | ≥ 3 | Archive; millisecond retrieval |
| Glacier Flexible | 11 nines | 99.99% | ≥ 3 | Archive; minutes–hours retrieval |
| Glacier Deep Archive | 11 nines | 99.99% | ≥ 3 | 12–48 hour retrieval; lowest cost |
| Intelligent-Tiering | 11 nines | 99.9% | ≥ 3 | Auto-moves between access tiers |
Exam Tip: All S3 storage classes share the same 11-nines (99.999999999%) durability except One Zone-IA, which has the same durability rating mathematically but will lose data if the single AZ is destroyed. Availability (uptime SLA) differs across classes.
S3 Replication
Both replication types require versioning enabled on both source and destination.
| Feature | CRR (Cross-Region) | SRR (Same-Region) |
|---|---|---|
| Purpose | DR, compliance, reduce latency for distant users | Log aggregation, test/prod separation |
| Latency | Near real-time (asynchronous) | Near real-time (asynchronous) |
| Replicate Existing Objects | No — only new objects after enabling (use S3 Batch Ops for existing) | Same |
| Delete Marker Replication | Optional (off by default) | Optional |
5.2 Block and File Storage Resilience
EBS Resilience
EBS volumes are AZ-specific — they exist in one AZ only. For resilience:
- Take snapshots (stored in S3, multi-AZ) at regular intervals
- Copy snapshots to another Region for cross-region DR
- Use AWS Data Lifecycle Manager (DLM) to automate snapshot schedules and retention
EFS Resilience
Amazon EFS automatically replicates data across multiple AZs in a Region (Standard tier). Use EFS Replication to create a read-only replica in a different Region for DR.
FSx for Windows — Multi-AZ
FSx for Windows File Server supports a Multi-AZ deployment with automatic failover between file servers in separate AZs.
5.3 AWS Backup
AWS Backup provides centralized, policy-driven backup management across: EC2, EBS, RDS, Aurora, DynamoDB, EFS, FSx, Storage Gateway, and S3.
| Feature | Detail |
|---|---|
| Backup Plans | Schedules, retention periods, and lifecycle transition rules |
| Cross-Account Backup | Copy backups to another account for isolation from operational account |
| Cross-Region Backup | Copy backups to another Region for DR compliance |
| Vault Lock (WORM) | Immutable backup vault; compliance mode prevents deletion even by root |
6. Resilience Patterns and Observability
6.1 Architectural Resilience Patterns
Queue-Based Load Leveling
Place an SQS queue between a fast producer and a slow consumer. The producer never waits; the consumer processes at its own pace. Queue depth provides a scaling signal.
[Fast Producer] → [SQS Queue] → [Consumer Workers]
↑
(Auto Scaling based on
ApproximateNumberOfMessagesVisible)
Circuit Breaker Pattern
Stop calling a failing downstream service to prevent cascading failures. When the error rate exceeds a threshold, open the circuit (fail fast). Periodically allow a test request through to detect recovery.
Retry with Exponential Backoff and Jitter
Do not retry immediately on failure — wait an exponentially increasing interval. Add jitter (random delay) to prevent a "thundering herd" where all retrying clients hit the service simultaneously. AWS SDKs implement this by default.
Bulkhead Pattern
Isolate workloads so a failure in one does not affect others. Use separate SQS queues per consumer type, separate Lambda functions per purpose, and separate ECS services per workload. Avoid monoliths that fail entirely.
6.2 Monitoring and Tracing
Amazon CloudWatch
| Feature | Purpose |
|---|---|
| Metrics | Time-series data from AWS services and custom sources; 1-second to 1-day resolution |
| Logs | Centralized log storage from EC2 (via agent), Lambda, ECS, API Gateway, VPC Flow Logs |
| Alarms | Trigger on threshold or anomaly; actions include SNS notification, Auto Scaling, EC2 reboot |
| Dashboards | Cross-service, cross-region metric visualization |
| Container Insights | Enhanced metrics for ECS/EKS (CPU, memory, disk, network per task/pod) |
| Anomaly Detection | ML model of expected metric range; alarm when actual deviates |
AWS X-Ray
Distributed tracing for microservices and serverless applications. Traces requests end-to-end across Lambda, ECS, EC2, API Gateway, SQS, and DynamoDB. The Service Map provides a visual representation of all components and inter-service latency. Use annotations (indexed) for filtering traces and metadata (non-indexed) for additional context. Sampling controls the fraction of traces collected to manage cost.
Exam Tips & Quick Reference
Scenario-to-Answer Mapping
| Scenario Keyword / Requirement | Correct Answer |
|---|---|
| "Decouple components to handle traffic spikes" | SQS queue between producer and consumer |
| "Process messages in strict order, exactly once" | SQS FIFO queue |
| "One event → multiple systems process simultaneously" | SNS topic → fan-out to multiple SQS queues |
| "Migrate from JMS/AMQP broker to AWS" | Amazon MQ (protocol compatibility required) |
| "Serverless multi-step workflow with error handling" | AWS Step Functions |
| "Auto Scaling based on SQS queue depth" | Custom metric (QueueDepth/InServiceInstances) → Target Tracking |
| "Eliminate Lambda cold starts for critical function" | Lambda Provisioned Concurrency |
| "DynamoDB microsecond read latency" | DynamoDB DAX |
| "RDS high availability for production" | RDS Multi-AZ deployment |
| "Offload read queries from RDS primary" | RDS Read Replicas + reader endpoint |
| "Aurora cross-region DR with RPO < 1 second" | Aurora Global Database |
| "DynamoDB active-active multi-region replication" | DynamoDB Global Tables (requires Streams enabled) |
| "Gradual traffic shift to a new app version" | Weighted routing in Route 53 OR ALB weighted target groups |
| "Windows EC2 needs shared file system" | FSx for Windows File Server (SMB) |
| "Linux EC2 needs shared file system across AZs" | Amazon EFS (NFS) |
| "Retain instance logs before Auto Scaling terminates" | Lifecycle hook (Terminating:Wait) → copy logs to S3/CloudWatch |
| "Real-time distributed tracing across microservices" | AWS X-Ray |
| "Detect metric anomalies automatically" | CloudWatch Anomaly Detection |
| "Scheduled task without managing EC2" | EventBridge Scheduler + Lambda |
| "Long-running workflow up to 1 year" | Step Functions Standard Workflow |
| "High-volume short-duration event workflow" | Step Functions Express Workflow |
| "Automatic failover when primary Region fails" | Route 53 Failover routing + health checks |
| "Pilot Light DR; restore compute in minutes" | Pilot Light pattern (AMIs + Launch Templates pre-configured) |
Common Traps
- Multi-AZ standby not readable: The RDS Multi-AZ standby replica cannot serve read traffic. If the question mentions read scaling, use Read Replicas, not Multi-AZ.
- DynamoDB Global Tables require Streams: Always enable DynamoDB Streams before creating Global Tables — a frequently tested prerequisite.
- Amazon MQ vs. SQS: Only choose MQ when the scenario explicitly mentions existing applications using standard protocols (JMS, AMQP, STOMP). New applications should use SQS/SNS.
- Step Functions Express duration: Express Workflows max out at 5 minutes. Standard Workflows support up to 1 year. Confusing these two is a common exam mistake.
- SNS does not persist messages: SNS is fire-and-forget. If subscribers are unavailable, messages are lost. Add an SQS queue as a subscriber to provide durability.
- Lambda timeout in VPC: Lambda functions in a VPC cannot access the internet without a NAT Gateway and cannot reach AWS services without VPC Endpoints.
- S3 One Zone-IA risk: The 11-nines durability figure is misleading — data will be lost if the single AZ fails. Do not use One Zone-IA for data that cannot be recreated.
Key Terms — Domain 2
| Term | One-Line Definition |
|---|---|
| Visibility Timeout | Time an SQS message is hidden after receipt; reappears if not deleted within this window |
| DLQ (Dead Letter Queue) | Receives SQS/SNS messages that fail processing after N attempts |
| Fan-Out Pattern | SNS → multiple SQS queues; one event triggers multiple independent consumers |
| Provisioned Concurrency | Pre-warmed Lambda environments; eliminates cold starts; charged per hour |
| Task Definition | ECS blueprint defining container image, CPU, memory, ports, and IAM roles |
| Task Role | IAM role granting the application code inside an ECS container access to AWS services |
| Target Tracking | ASG policy that maintains a specific CloudWatch metric at a target value |
| Lifecycle Hook | Pauses an EC2 instance during ASG launch or termination for custom automation |
| Multi-AZ | RDS standby in a separate AZ; synchronous replication; automatic failover |
| Read Replica | RDS/Aurora async copy; readable; offloads queries; must be manually promoted |
| Aurora Global Database | Cross-region active-passive Aurora cluster; RPO < 1s; RTO < 1 min |
| DAX | DynamoDB Accelerator; in-memory write-through cache; microsecond reads |
| Circuit Breaker | Pattern that stops calling a failing service to prevent cascading failures |
| RPO | Recovery Point Objective; maximum acceptable data loss measured in time |
| RTO | Recovery Time Objective; maximum acceptable downtime before recovery |
| X-Ray | AWS distributed tracing service; end-to-end request visibility across services |
End of Domain 2. Continue to Domain 3: Design High-Performing Architectures →
Ready to test yourself?
Practice questions for this topic