AWSSOA-C03

Reliability and Business Continuity

Topic 2 of 5 · Study notes

AWS Certified CloudOps Engineer - Associate (SOA-C03) — Domain 2: Reliability and Business Continuity

Exam Code: SOA-C03 | Level: Associate
Domain Weight: 16% | Total Domains: 6 | Passing Score: 720/1000

High Availability Architecture Design
Database Reliability
Backup, Restore and Disaster Recovery
Exam Tips & Quick Reference

1. High Availability Architecture Design

High availability on AWS requires distributing workloads across multiple Availability Zones, using managed services that provide automatic failover, and designing application tiers to be stateless wherever possible.

1.1 Multi-AZ Architecture Principles

The Core HA Formula

If you need a minimum of N instances at all times and want to survive the failure of 1 Availability Zone:

Minimum total instances = N × (number of AZs) ÷ (number of AZs - 1)

Examples: needing 4 always across 2 AZs requires 8 total (4 per AZ); needing 4 always across 3 AZs requires 6 total (2 per AZ — minimum viable HA).

Exam Tip: "Deploy an Auto Scaling group across three AZs with a minimum of 4 instances" — if one AZ fails, you could have as few as 2 instances. The correct answer is minimum of 6 across 3 AZs to always guarantee 4 surviving.

Stateless vs. Stateful: Design principle is to make the application tier stateless and persist state in a separate managed service (RDS, DynamoDB, ElastiCache). Stateful instances tied to user sessions are harder to scale and replace.

1.2 Elastic Load Balancing

Load Balancer Comparison

Feature	ALB	NLB	GLB
OSI Layer	Layer 7 (HTTP/HTTPS)	Layer 4 (TCP/UDP/TLS)	Layer 3 (IP)
Static IP per AZ	No	Yes	No
Content-based routing	Yes (path, host, header)	No	No
WebSockets / gRPC	Yes	No	No
Millions req/sec	Not ideal	Yes	No
Preserve source IP	No (use X-Forwarded-For)	Yes	No
Use case	Web apps, microservices	Gaming, IoT, VoIP, low latency	3rd-party firewalls, IDS/IPS

ALB Source IP and Access Logs

ALB replaces the client IP with its own IP when forwarding to targets. To get the original client IP, read the X-Forwarded-For header in the application, or use ALB access logs (which contain the client_ip field) for forensic investigation of traffic spikes.

ALB Sticky Sessions

For CloudFront + ALB sticky sessions (users randomly logged out):

Configure cookie forwarding in the CloudFront distribution cache behavior → forwards ALB cookies to origin.
Enable application-based sticky sessions on the ALB target group.

Note: Duration-based cookies alone do not work with CloudFront because CloudFront strips them by default. Both changes are required.

Mobile Users Getting Desktop Content

Problem: CloudFront caches the desktop version and serves it to all users. Root cause: CloudFront caches based on URL only and does not forward the User-Agent header. Fix: configure CloudFront behavior to forward the User-Agent header to the origin.

1.3 EC2 Auto Scaling — Deep Dive

Launch Templates vs. Launch Configurations

Launch configurations are legacy and cannot be modified — you must create a new version. Launch templates are the recommended approach: they are versioned, support mixed instance types, and support Spot + On-Demand mix.

Auto Scaling Health Checks

Two types are available: EC2 health checks (instance is healthy if running and not impaired — does NOT know if the application is working) and ELB health checks (instance is healthy if the ALB health check passes — validates application response).

Exam Tip: Auto Scaling shows instance healthy (EC2 check) but ALB shows it unhealthy (application check)? Root cause: Auto Scaling group health check type is set to EC2, not ELB. Fix: change health check type to ELB.

SQS-Based Auto Scaling (Worker Pattern)

For EC2 worker fleets processing SQS queues, ApproximateNumberOfMessagesVisible alone is insufficient because it does not account for the number of workers. Use a custom metric:

BacklogPerInstance = ApproximateNumberOfMessagesVisible / NumberOfInstancesInService

Create this as a custom CloudWatch metric and build a target tracking policy against it.

Warm Pools

Pre-initializes instances to eliminate cold-start delay. Instance states in warm pool:

State	Cost	Speed
Stopped (recommended)	No compute charges; EBS charges apply	Fast
Hibernated	Additional EBS charges for RAM state	Faster
Running	Full compute charges even when idle	Fastest

Lifecycle Hooks

Pause instance launch or termination to perform custom actions:

Hook Type	Pause State	Use Case
`autoscaling:EC2_INSTANCE_LAUNCHING`	`Pending:Wait`	Configure instance, run tests
`autoscaling:EC2_INSTANCE_TERMINATING`	`Terminating:Wait`	Copy logs to S3, graceful shutdown

Exam Tip: "Instances terminated before developers could retrieve application logs" → add a termination lifecycle hook → Lambda function copies logs to S3 → complete lifecycle hook.

2. Database Reliability

2.1 Amazon RDS High Availability

RDS Multi-AZ

Multi-AZ uses synchronous replication to a standby replica in a different AZ of the same region. The standby is NOT accessible for reads. Automatic failover takes 1–2 minutes with DNS CNAME automatically redirected to standby. Failover triggers include: AZ failure, primary failure, OS patching, and manual failover.

RDS Read Replicas

Read replicas use asynchronous replication with up to 5 per RDS instance. They CAN be used for reads, can be in a different region (Cross-Region Read Replica), and can be promoted to standalone DB — but promotion takes minutes (I/O replay required). They are NOT the same as Multi-AZ standby.

RDS Encryption Constraint

You cannot encrypt an existing unencrypted RDS instance directly. The correct process is:

Take a snapshot of the unencrypted instance.
Copy the snapshot with encryption enabled.
Restore from the encrypted snapshot to create a new encrypted RDS instance.
Point the application to the new instance and decommission the old one.

2.2 Amazon Aurora — Comprehensive Reference

Aurora Architecture

Aurora uses a shared storage volume that automatically grows in 10 GB increments (up to 128 TB) and is replicated 6 ways across 3 AZs. Replicas share the SAME storage — no data replication lag for reads.

Aurora Replica vs. RDS Read Replica

Feature	Aurora Replica	RDS Read Replica
Replication method	Storage-level (shared volume)	I/O log shipping
Replication lag	Typically < 100 ms	Seconds to minutes
Failover	Automatic, instant (< 30 seconds)	Manual promotion (minutes)
Max count	15	5
Reader endpoint	Yes (auto load balances)	No built-in

Aurora Backtracking vs. PITR

Feature	Backtracking	Point-in-Time Recovery (PITR)
How it works	Rewinds existing cluster in-place	Creates a NEW cluster from backup
Speed	Fast (minutes)	Slower (depends on data size)
Same cluster	Yes	No (new cluster created)
Max window	Up to 72 hours	Up to 35 days
Use case	Accidental data changes in last 72 hours	Older recovery or compliance

Exam Tip: "Roll back the DB cluster to a specific recovery point within the previous 72 hours; restores must be in the same production DB cluster." → Aurora Backtracking — PITR creates a new cluster, backtracking rewinds in place.

Aurora Global Database

Cross-region replication with < 1 second RPO and promotion in < 1 minute during regional failure (RTO). Up to 5 secondary regions; secondary regions are read-only until promoted.

2.3 Amazon DynamoDB Reliability

DynamoDB Global Tables

Global Tables require DynamoDB Streams to be enabled and the same table name in all regions. They provide multi-region, multi-master replication — write to any region and it replicates to all others. RPO is near zero; RTO < 1 minute.

DynamoDB Protection Features

DynamoDB offers point-in-time recovery (PITR) for continuous backups restorable to any second in the last 35 days, on-demand backups for manual full backups stored until explicitly deleted, and deletion protection to prevent accidental table deletion.

2.4 Amazon ElastiCache Reliability

Memcached vs. Redis

Feature	Memcached	Redis
Data persistence	No	Yes (RDB/AOF)
Pub/Sub	No	Yes
Sorted sets, lists	No	Yes
Multi-AZ failover	No	Yes (with replication groups)

Diagnosing High Evictions

The Evictions metric being high means the cache is full and items are being evicted before they expire, causing more cache misses and increased database load.

Solutions: add nodes to the cluster (horizontal scaling, more total memory) or increase individual node size (vertical scaling, more memory per node).

Exam Tip: Wrong answers for high evictions: adding an ELB in front of ElastiCache, adding SQS to decouple, or increasing TTL (which makes evictions worse by keeping items longer).

3. Backup, Restore and Disaster Recovery

3.1 AWS Backup Service

AWS Backup supports EC2, EBS, RDS, Aurora, DynamoDB, EFS, FSx, S3, and AWS Storage Gateway. Deploy backup policies organization-wide from the management account using AWS Organizations → Backup Policies, then each member account gets backup plans applied automatically. Include cross-region backup copies in the backup plan for disaster recovery coverage.

For EFS, enabling AWS Backup and using a partial restore job lets you recover individual files or directories rapidly without performing a full restore (which creates a new EFS file system).

3.2 Amazon S3 Data Protection

S3 Versioning

Versioning preserves all object versions. Deleting an object creates a delete marker (the object is not actually removed). To permanently delete, you must delete the specific version ID. Once enabled, versioning can only be suspended, not completely disabled.

S3 MFA Delete

With MFA Delete enabled, the following actions require MFA: permanently removing a versioned object and suspending versioning. The following do NOT require MFA: adding objects (PutObject), listing versions, creating delete markers, and enabling versioning.

Exam Tip: To enable MFA Delete, you must use the AWS CLI with root account credentials — not IAM user, not the console.

S3 Object Lock

WORM protection options:

Mode	Override Allowed?	Who Can Override
Governance	Yes	Users with special permissions
Compliance	No	Nobody — not even root during retention period
Legal hold	N/A	Users with `s3:PutObjectLegalHold`

Object Lock requires versioning and must be enabled at bucket creation — cannot be enabled on an existing bucket.

Cross-Account S3 — Object Ownership

Objects uploaded by Account A are owned by Account A, not by the bucket owner (Account B). IAM users in Account B cannot delete them by default. Fix: modify the Lambda function to include bucket-owner-full-control ACL when writing:

s3.put_object(
    Bucket='account-b-bucket',
    Key='myfile.txt',
    Body=data,
    ACL='bucket-owner-full-control'
)

3.3 Disaster Recovery Strategies

Backup & Restore    →    Pilot Light    →    Warm Standby    →    Multi-Site Active/Active
  Hours RPO/RTO         Minutes RPO/RTO      Minutes RTO/RPO       Near-zero both
  Cheapest              Low cost             Medium cost           Most expensive

Strategy	What Runs Continuously	RTO	RPO
Backup & Restore	Nothing; periodic snapshots to S3	Hours	Hours
Pilot Light	Database with data only	Minutes–Hours	Minutes
Warm Standby	Scaled-down but fully functional copy	Minutes	Minutes
Multi-Site Active/Active	Both sites fully operational	Near-zero	Near-zero

3.4 AWS Storage Gateway for Backup

Volume Gateway Types

Mode	Primary Storage	Backup Location	Use Case
Stored volumes	On-premises (local disk)	Async backup to S3	All data must be local; cloud is backup
Cached volumes	S3 (cloud)	Frequently accessed cached locally	Cloud is primary; local cache for performance

Exam Tip: "All data must be available locally; use AWS for backup" → Storage Gateway with stored volumes. Cached volumes keep primary data in S3 (cloud primary), which is the opposite of what this scenario requires.

3.5 Amazon CloudFront Reliability

CloudFront Cache Invalidation

When S3 objects are updated with the same filename and users still see old content, the root cause is CloudFront has cached the old version at edge locations.

Fix: create a CloudFront invalidation:

aws cloudfront create-invalidation \
  --distribution-id EDFDVBD6EXAMPLE \
  --paths "/images/*" "/index.html"

Alternative: use versioned filenames (app.v2.js instead of app.js) — no invalidation needed; old files expire naturally.

CloudFront 502 Bad Gateway

This occurs when CloudFront cannot communicate with the origin. Most common causes: SSL/TLS certificate expired on the origin server, or hostname mismatch between the certificate CN/SAN and the origin hostname CloudFront connects to. Troubleshoot by examining the certificate expiration date directly on the origin.

S3 + CloudFront + Origin Access Identity (OAI)

To restrict S3 bucket access to CloudFront only:

Create an OAI in CloudFront and associate it with the distribution's S3 origin.
Update the S3 bucket policy to allow only the OAI principal for s3:GetObject.
Ensure S3 Block Public Access is enabled.

Direct S3 URLs will return 403; only CloudFront can serve the content.

3.6 Route 53 for High Availability

Failover Routing Configuration

For primary/secondary failover, create Alias records (not CNAME) pointing to ALBs in primary and secondary regions. Set routing policy to Failover, set primary record type to Primary and secondary to Secondary, set Evaluate Target Health to Yes for both, and associate a health check with the primary record.

Key Concept: At the zone apex (example.com), a CNAME record is prohibited by DNS specification (RFC 1034). Always use a Route 53 Alias record at the zone apex to point to ALBs, NLBs, CloudFront, S3 website endpoints, and API Gateway.

Route 53 Routing Policies

Policy	Routes Based On	When to Use
Latency	Measured latency to each region	Route to lowest latency region
Geolocation	User's exact country/continent	European users → eu-central-1
Geoproximity	Location + configurable bias	Expand/shrink geographic coverage
Failover	Health check results	Active/passive DR
Weighted	Percentage distribution	A/B testing, gradual migration

Exam Tip: "Extend to new region; route users to lowest latency endpoint without changing URL" → Latency routing — not geolocation, which routes by user location, not measured performance.

3.7 AWS Global Accelerator

Global Accelerator provides 2 static anycast IP addresses (globally fixed; can be whitelisted by clients) and routes traffic through the AWS global network instead of the public internet. It supports ALB, NLB, EC2 instances, and Elastic IP addresses as endpoints.

Feature	Global Accelerator	Route 53 Latency
IP addresses	2 fixed static IPs	DNS names (IPs change)
Failover time	< 30 seconds	Up to 5 minutes (DNS TTL)
Traffic path	AWS global network	Public internet
Use case	Static IP requirement, fastest failover	Standard multi-region DNS routing

Exam Tip: If the question mentions "requires 2 static IP addresses" OR "fastest failover time," the answer is AWS Global Accelerator.

Exam Tips & Quick Reference

Scenario-to-Answer Mapping

Scenario Keyword / Requirement	Correct Answer
"Encrypt existing unencrypted RDS"	Snapshot → copy with encryption → restore (not in-place)
"Encrypt existing unencrypted EFS"	Create new encrypted EFS → migrate data
"Enable Multi-AZ on existing RDS"	Modify instance → enable Multi-AZ option
"Aurora replication lag → stale cart"	`AuroraReplicaLagMaximum` high; read replica returning stale data
"Rewind Aurora cluster in-place ≤ 72 hrs"	Aurora Backtracking (PITR creates new cluster)
"Cross-region DR for RDS"	Cross-Region Read Replica
"DynamoDB multi-region replication"	Enable DynamoDB Streams → add Global Table region
"ElastiCache high evictions"	Add nodes (horizontal) OR increase node size (vertical)
"Mobile users getting desktop version"	Configure CloudFront to forward User-Agent header
"CloudFront serving old S3 content"	Create CloudFront invalidation
"CloudFront 502 error with custom origin"	SSL certificate expired on origin server
"S3 bucket accessible only via CloudFront"	Create OAI → assign to distribution → update bucket policy
"Zone apex pointing to ALB"	Route 53 Alias record (not CNAME)
"Presigned URL failing to upload"	URL creator does not have s3:PutObject permission
"Account B users can't delete objects from Account A"	Lambda must call PutObjectAcl with bucket-owner-full-control
"2 static IPs for application globally"	AWS Global Accelerator
"Fastest failover between regions"	Global Accelerator (< 30 sec) vs. Route 53 (up to 5 min)
"All data must be local, cloud is backup"	Storage Gateway stored volumes (not cached)
"ASG healthy in console but unhealthy in ALB"	Change ASG health check type from EC2 to ELB

Common Traps

RDS Multi-AZ standby is not readable: It exists only for failover; all reads still go to the primary. Read replicas are readable — these are different features.
Aurora Backtracking vs. PITR: Both restore data but PITR creates a new cluster. If the question says "same cluster" or "in-place," the answer is Backtracking.
CloudFront 502 vs. 503: 502 = origin SSL/TLS issue (almost always an expired cert). 503 = no healthy targets behind ALB.
Storage Gateway stored vs. cached: Stored = local primary, cloud backup. Cached = cloud primary, local cache. These are the OPPOSITE of what you might initially assume.

Key Terms — Domain 2

Term	One-Line Definition
RDS Multi-AZ	Synchronous standby replica in different AZ; NOT readable; automatic DNS failover
RDS Read Replica	Async replica for read offloading; can be cross-region; manually promoted
Aurora Backtracking	Rewinds existing Aurora cluster in-place up to 72 hours; no new cluster created
DynamoDB Global Tables	Multi-region, multi-master replication; requires DynamoDB Streams
Warm Pool	Pre-initialized EC2 instances that can be attached to ASG instantly on scale-out
Lifecycle Hook	Pause during ASG launch or termination to run custom actions
OAI (Origin Access Identity)	CloudFront identity used to restrict S3 bucket access to CloudFront only
Global Accelerator	AWS service providing 2 static IPs and routing via AWS backbone
Object Lock	S3 WORM protection preventing object deletion during retention period
MFA Delete	S3 setting requiring MFA to permanently delete object versions

End of Domain 2. Continue to Domain 3: Deployment, Provisioning & Automation →

Monitoring, Logging, Analysis, Remediation, and Performance Optimization

Deployment, Provisioning, and Automation

Ready to test yourself?

Practice questions for this topic

Start Practicing →