Reliability and Business Continuity
Topic 2 of 5 · Study notes
AWS Certified CloudOps Engineer - Associate (SOA-C03) — Domain 2: Reliability and Business Continuity
Exam Code: SOA-C03 | Level: Associate
Domain Weight: 16% | Total Domains: 6 | Passing Score: 720/1000
Table of Contents
- High Availability Architecture Design
- Database Reliability
- Backup, Restore and Disaster Recovery
- Exam Tips & Quick Reference
1. High Availability Architecture Design
High availability on AWS requires distributing workloads across multiple Availability Zones, using managed services that provide automatic failover, and designing application tiers to be stateless wherever possible.
1.1 Multi-AZ Architecture Principles
The Core HA Formula
If you need a minimum of N instances at all times and want to survive the failure of 1 Availability Zone:
Minimum total instances = N × (number of AZs) ÷ (number of AZs - 1)
Examples: needing 4 always across 2 AZs requires 8 total (4 per AZ); needing 4 always across 3 AZs requires 6 total (2 per AZ — minimum viable HA).
Exam Tip: "Deploy an Auto Scaling group across three AZs with a minimum of 4 instances" — if one AZ fails, you could have as few as 2 instances. The correct answer is minimum of 6 across 3 AZs to always guarantee 4 surviving.
Stateless vs. Stateful: Design principle is to make the application tier stateless and persist state in a separate managed service (RDS, DynamoDB, ElastiCache). Stateful instances tied to user sessions are harder to scale and replace.
1.2 Elastic Load Balancing
Load Balancer Comparison
| Feature | ALB | NLB | GLB |
|---|---|---|---|
| OSI Layer | Layer 7 (HTTP/HTTPS) | Layer 4 (TCP/UDP/TLS) | Layer 3 (IP) |
| Static IP per AZ | No | Yes | No |
| Content-based routing | Yes (path, host, header) | No | No |
| WebSockets / gRPC | Yes | No | No |
| Millions req/sec | Not ideal | Yes | No |
| Preserve source IP | No (use X-Forwarded-For) | Yes | No |
| Use case | Web apps, microservices | Gaming, IoT, VoIP, low latency | 3rd-party firewalls, IDS/IPS |
ALB Source IP and Access Logs
ALB replaces the client IP with its own IP when forwarding to targets. To get the original client IP, read the X-Forwarded-For header in the application, or use ALB access logs (which contain the client_ip field) for forensic investigation of traffic spikes.
ALB Sticky Sessions
For CloudFront + ALB sticky sessions (users randomly logged out):
- Configure cookie forwarding in the CloudFront distribution cache behavior → forwards ALB cookies to origin.
- Enable application-based sticky sessions on the ALB target group.
Note: Duration-based cookies alone do not work with CloudFront because CloudFront strips them by default. Both changes are required.
Mobile Users Getting Desktop Content
Problem: CloudFront caches the desktop version and serves it to all users. Root cause: CloudFront caches based on URL only and does not forward the User-Agent header. Fix: configure CloudFront behavior to forward the User-Agent header to the origin.
1.3 EC2 Auto Scaling — Deep Dive
Launch Templates vs. Launch Configurations
Launch configurations are legacy and cannot be modified — you must create a new version. Launch templates are the recommended approach: they are versioned, support mixed instance types, and support Spot + On-Demand mix.
Auto Scaling Health Checks
Two types are available: EC2 health checks (instance is healthy if running and not impaired — does NOT know if the application is working) and ELB health checks (instance is healthy if the ALB health check passes — validates application response).
Exam Tip: Auto Scaling shows instance healthy (EC2 check) but ALB shows it unhealthy (application check)? Root cause: Auto Scaling group health check type is set to EC2, not ELB. Fix: change health check type to
ELB.
SQS-Based Auto Scaling (Worker Pattern)
For EC2 worker fleets processing SQS queues, ApproximateNumberOfMessagesVisible alone is insufficient because it does not account for the number of workers. Use a custom metric:
BacklogPerInstance = ApproximateNumberOfMessagesVisible / NumberOfInstancesInService
Create this as a custom CloudWatch metric and build a target tracking policy against it.
Warm Pools
Pre-initializes instances to eliminate cold-start delay. Instance states in warm pool:
| State | Cost | Speed |
|---|---|---|
| Stopped (recommended) | No compute charges; EBS charges apply | Fast |
| Hibernated | Additional EBS charges for RAM state | Faster |
| Running | Full compute charges even when idle | Fastest |
Lifecycle Hooks
Pause instance launch or termination to perform custom actions:
| Hook Type | Pause State | Use Case |
|---|---|---|
autoscaling:EC2_INSTANCE_LAUNCHING |
Pending:Wait |
Configure instance, run tests |
autoscaling:EC2_INSTANCE_TERMINATING |
Terminating:Wait |
Copy logs to S3, graceful shutdown |
Exam Tip: "Instances terminated before developers could retrieve application logs" → add a termination lifecycle hook → Lambda function copies logs to S3 → complete lifecycle hook.
2. Database Reliability
2.1 Amazon RDS High Availability
RDS Multi-AZ
Multi-AZ uses synchronous replication to a standby replica in a different AZ of the same region. The standby is NOT accessible for reads. Automatic failover takes 1–2 minutes with DNS CNAME automatically redirected to standby. Failover triggers include: AZ failure, primary failure, OS patching, and manual failover.
RDS Read Replicas
Read replicas use asynchronous replication with up to 5 per RDS instance. They CAN be used for reads, can be in a different region (Cross-Region Read Replica), and can be promoted to standalone DB — but promotion takes minutes (I/O replay required). They are NOT the same as Multi-AZ standby.
RDS Encryption Constraint
You cannot encrypt an existing unencrypted RDS instance directly. The correct process is:
- Take a snapshot of the unencrypted instance.
- Copy the snapshot with encryption enabled.
- Restore from the encrypted snapshot to create a new encrypted RDS instance.
- Point the application to the new instance and decommission the old one.
2.2 Amazon Aurora — Comprehensive Reference
Aurora Architecture
Aurora uses a shared storage volume that automatically grows in 10 GB increments (up to 128 TB) and is replicated 6 ways across 3 AZs. Replicas share the SAME storage — no data replication lag for reads.
Aurora Replica vs. RDS Read Replica
| Feature | Aurora Replica | RDS Read Replica |
|---|---|---|
| Replication method | Storage-level (shared volume) | I/O log shipping |
| Replication lag | Typically < 100 ms | Seconds to minutes |
| Failover | Automatic, instant (< 30 seconds) | Manual promotion (minutes) |
| Max count | 15 | 5 |
| Reader endpoint | Yes (auto load balances) | No built-in |
Aurora Backtracking vs. PITR
| Feature | Backtracking | Point-in-Time Recovery (PITR) |
|---|---|---|
| How it works | Rewinds existing cluster in-place | Creates a NEW cluster from backup |
| Speed | Fast (minutes) | Slower (depends on data size) |
| Same cluster | Yes | No (new cluster created) |
| Max window | Up to 72 hours | Up to 35 days |
| Use case | Accidental data changes in last 72 hours | Older recovery or compliance |
Exam Tip: "Roll back the DB cluster to a specific recovery point within the previous 72 hours; restores must be in the same production DB cluster." → Aurora Backtracking — PITR creates a new cluster, backtracking rewinds in place.
Aurora Global Database
Cross-region replication with < 1 second RPO and promotion in < 1 minute during regional failure (RTO). Up to 5 secondary regions; secondary regions are read-only until promoted.
2.3 Amazon DynamoDB Reliability
DynamoDB Global Tables
Global Tables require DynamoDB Streams to be enabled and the same table name in all regions. They provide multi-region, multi-master replication — write to any region and it replicates to all others. RPO is near zero; RTO < 1 minute.
DynamoDB Protection Features
DynamoDB offers point-in-time recovery (PITR) for continuous backups restorable to any second in the last 35 days, on-demand backups for manual full backups stored until explicitly deleted, and deletion protection to prevent accidental table deletion.
2.4 Amazon ElastiCache Reliability
Memcached vs. Redis
| Feature | Memcached | Redis |
|---|---|---|
| Data persistence | No | Yes (RDB/AOF) |
| Pub/Sub | No | Yes |
| Sorted sets, lists | No | Yes |
| Multi-AZ failover | No | Yes (with replication groups) |
Diagnosing High Evictions
The Evictions metric being high means the cache is full and items are being evicted before they expire, causing more cache misses and increased database load.
Solutions: add nodes to the cluster (horizontal scaling, more total memory) or increase individual node size (vertical scaling, more memory per node).
Exam Tip: Wrong answers for high evictions: adding an ELB in front of ElastiCache, adding SQS to decouple, or increasing TTL (which makes evictions worse by keeping items longer).
3. Backup, Restore and Disaster Recovery
3.1 AWS Backup Service
AWS Backup supports EC2, EBS, RDS, Aurora, DynamoDB, EFS, FSx, S3, and AWS Storage Gateway. Deploy backup policies organization-wide from the management account using AWS Organizations → Backup Policies, then each member account gets backup plans applied automatically. Include cross-region backup copies in the backup plan for disaster recovery coverage.
For EFS, enabling AWS Backup and using a partial restore job lets you recover individual files or directories rapidly without performing a full restore (which creates a new EFS file system).
3.2 Amazon S3 Data Protection
S3 Versioning
Versioning preserves all object versions. Deleting an object creates a delete marker (the object is not actually removed). To permanently delete, you must delete the specific version ID. Once enabled, versioning can only be suspended, not completely disabled.
S3 MFA Delete
With MFA Delete enabled, the following actions require MFA: permanently removing a versioned object and suspending versioning. The following do NOT require MFA: adding objects (PutObject), listing versions, creating delete markers, and enabling versioning.
Exam Tip: To enable MFA Delete, you must use the AWS CLI with root account credentials — not IAM user, not the console.
S3 Object Lock
WORM protection options:
| Mode | Override Allowed? | Who Can Override |
|---|---|---|
| Governance | Yes | Users with special permissions |
| Compliance | No | Nobody — not even root during retention period |
| Legal hold | N/A | Users with s3:PutObjectLegalHold |
Object Lock requires versioning and must be enabled at bucket creation — cannot be enabled on an existing bucket.
Cross-Account S3 — Object Ownership
Objects uploaded by Account A are owned by Account A, not by the bucket owner (Account B). IAM users in Account B cannot delete them by default. Fix: modify the Lambda function to include bucket-owner-full-control ACL when writing:
s3.put_object(
Bucket='account-b-bucket',
Key='myfile.txt',
Body=data,
ACL='bucket-owner-full-control'
)
3.3 Disaster Recovery Strategies
Backup & Restore → Pilot Light → Warm Standby → Multi-Site Active/Active
Hours RPO/RTO Minutes RPO/RTO Minutes RTO/RPO Near-zero both
Cheapest Low cost Medium cost Most expensive
| Strategy | What Runs Continuously | RTO | RPO |
|---|---|---|---|
| Backup & Restore | Nothing; periodic snapshots to S3 | Hours | Hours |
| Pilot Light | Database with data only | Minutes–Hours | Minutes |
| Warm Standby | Scaled-down but fully functional copy | Minutes | Minutes |
| Multi-Site Active/Active | Both sites fully operational | Near-zero | Near-zero |
3.4 AWS Storage Gateway for Backup
Volume Gateway Types
| Mode | Primary Storage | Backup Location | Use Case |
|---|---|---|---|
| Stored volumes | On-premises (local disk) | Async backup to S3 | All data must be local; cloud is backup |
| Cached volumes | S3 (cloud) | Frequently accessed cached locally | Cloud is primary; local cache for performance |
Exam Tip: "All data must be available locally; use AWS for backup" → Storage Gateway with stored volumes. Cached volumes keep primary data in S3 (cloud primary), which is the opposite of what this scenario requires.
3.5 Amazon CloudFront Reliability
CloudFront Cache Invalidation
When S3 objects are updated with the same filename and users still see old content, the root cause is CloudFront has cached the old version at edge locations.
Fix: create a CloudFront invalidation:
aws cloudfront create-invalidation \
--distribution-id EDFDVBD6EXAMPLE \
--paths "/images/*" "/index.html"
Alternative: use versioned filenames (app.v2.js instead of app.js) — no invalidation needed; old files expire naturally.
CloudFront 502 Bad Gateway
This occurs when CloudFront cannot communicate with the origin. Most common causes: SSL/TLS certificate expired on the origin server, or hostname mismatch between the certificate CN/SAN and the origin hostname CloudFront connects to. Troubleshoot by examining the certificate expiration date directly on the origin.
S3 + CloudFront + Origin Access Identity (OAI)
To restrict S3 bucket access to CloudFront only:
- Create an OAI in CloudFront and associate it with the distribution's S3 origin.
- Update the S3 bucket policy to allow only the OAI principal for
s3:GetObject. - Ensure S3 Block Public Access is enabled.
Direct S3 URLs will return 403; only CloudFront can serve the content.
3.6 Route 53 for High Availability
Failover Routing Configuration
For primary/secondary failover, create Alias records (not CNAME) pointing to ALBs in primary and secondary regions. Set routing policy to Failover, set primary record type to Primary and secondary to Secondary, set Evaluate Target Health to Yes for both, and associate a health check with the primary record.
Key Concept: At the zone apex (
example.com), a CNAME record is prohibited by DNS specification (RFC 1034). Always use a Route 53 Alias record at the zone apex to point to ALBs, NLBs, CloudFront, S3 website endpoints, and API Gateway.
Route 53 Routing Policies
| Policy | Routes Based On | When to Use |
|---|---|---|
| Latency | Measured latency to each region | Route to lowest latency region |
| Geolocation | User's exact country/continent | European users → eu-central-1 |
| Geoproximity | Location + configurable bias | Expand/shrink geographic coverage |
| Failover | Health check results | Active/passive DR |
| Weighted | Percentage distribution | A/B testing, gradual migration |
Exam Tip: "Extend to new region; route users to lowest latency endpoint without changing URL" → Latency routing — not geolocation, which routes by user location, not measured performance.
3.7 AWS Global Accelerator
Global Accelerator provides 2 static anycast IP addresses (globally fixed; can be whitelisted by clients) and routes traffic through the AWS global network instead of the public internet. It supports ALB, NLB, EC2 instances, and Elastic IP addresses as endpoints.
| Feature | Global Accelerator | Route 53 Latency |
|---|---|---|
| IP addresses | 2 fixed static IPs | DNS names (IPs change) |
| Failover time | < 30 seconds | Up to 5 minutes (DNS TTL) |
| Traffic path | AWS global network | Public internet |
| Use case | Static IP requirement, fastest failover | Standard multi-region DNS routing |
Exam Tip: If the question mentions "requires 2 static IP addresses" OR "fastest failover time," the answer is AWS Global Accelerator.
Exam Tips & Quick Reference
Scenario-to-Answer Mapping
| Scenario Keyword / Requirement | Correct Answer |
|---|---|
| "Encrypt existing unencrypted RDS" | Snapshot → copy with encryption → restore (not in-place) |
| "Encrypt existing unencrypted EFS" | Create new encrypted EFS → migrate data |
| "Enable Multi-AZ on existing RDS" | Modify instance → enable Multi-AZ option |
| "Aurora replication lag → stale cart" | AuroraReplicaLagMaximum high; read replica returning stale data |
| "Rewind Aurora cluster in-place ≤ 72 hrs" | Aurora Backtracking (PITR creates new cluster) |
| "Cross-region DR for RDS" | Cross-Region Read Replica |
| "DynamoDB multi-region replication" | Enable DynamoDB Streams → add Global Table region |
| "ElastiCache high evictions" | Add nodes (horizontal) OR increase node size (vertical) |
| "Mobile users getting desktop version" | Configure CloudFront to forward User-Agent header |
| "CloudFront serving old S3 content" | Create CloudFront invalidation |
| "CloudFront 502 error with custom origin" | SSL certificate expired on origin server |
| "S3 bucket accessible only via CloudFront" | Create OAI → assign to distribution → update bucket policy |
| "Zone apex pointing to ALB" | Route 53 Alias record (not CNAME) |
| "Presigned URL failing to upload" | URL creator does not have s3:PutObject permission |
| "Account B users can't delete objects from Account A" | Lambda must call PutObjectAcl with bucket-owner-full-control |
| "2 static IPs for application globally" | AWS Global Accelerator |
| "Fastest failover between regions" | Global Accelerator (< 30 sec) vs. Route 53 (up to 5 min) |
| "All data must be local, cloud is backup" | Storage Gateway stored volumes (not cached) |
| "ASG healthy in console but unhealthy in ALB" | Change ASG health check type from EC2 to ELB |
Common Traps
- RDS Multi-AZ standby is not readable: It exists only for failover; all reads still go to the primary. Read replicas are readable — these are different features.
- Aurora Backtracking vs. PITR: Both restore data but PITR creates a new cluster. If the question says "same cluster" or "in-place," the answer is Backtracking.
- CloudFront 502 vs. 503: 502 = origin SSL/TLS issue (almost always an expired cert). 503 = no healthy targets behind ALB.
- Storage Gateway stored vs. cached: Stored = local primary, cloud backup. Cached = cloud primary, local cache. These are the OPPOSITE of what you might initially assume.
Key Terms — Domain 2
| Term | One-Line Definition |
|---|---|
| RDS Multi-AZ | Synchronous standby replica in different AZ; NOT readable; automatic DNS failover |
| RDS Read Replica | Async replica for read offloading; can be cross-region; manually promoted |
| Aurora Backtracking | Rewinds existing Aurora cluster in-place up to 72 hours; no new cluster created |
| DynamoDB Global Tables | Multi-region, multi-master replication; requires DynamoDB Streams |
| Warm Pool | Pre-initialized EC2 instances that can be attached to ASG instantly on scale-out |
| Lifecycle Hook | Pause during ASG launch or termination to run custom actions |
| OAI (Origin Access Identity) | CloudFront identity used to restrict S3 bucket access to CloudFront only |
| Global Accelerator | AWS service providing 2 static IPs and routing via AWS backbone |
| Object Lock | S3 WORM protection preventing object deletion during retention period |
| MFA Delete | S3 setting requiring MFA to permanently delete object versions |
End of Domain 2. Continue to Domain 3: Deployment, Provisioning & Automation →
Previous
Monitoring, Logging, Analysis, Remediation, and Performance Optimization
Next
Deployment, Provisioning, and Automation
Ready to test yourself?
Practice questions for this topic