Courses/SOA-C03/Reliability and Business Continuity
Practice questions →
AWSSOA-C03

Reliability and Business Continuity

Topic 2 of 5 · Study notes

AWS Certified CloudOps Engineer - Associate (SOA-C03) — Domain 2: Reliability and Business Continuity

Exam Code: SOA-C03  |  Level: Associate
Domain Weight: 16%  |  Total Domains: 6  |  Passing Score: 720/1000


Table of Contents

  1. High Availability Architecture Design
  2. Database Reliability
  3. Backup, Restore and Disaster Recovery
  4. Exam Tips & Quick Reference

1. High Availability Architecture Design

High availability on AWS requires distributing workloads across multiple Availability Zones, using managed services that provide automatic failover, and designing application tiers to be stateless wherever possible.

1.1 Multi-AZ Architecture Principles

The Core HA Formula

If you need a minimum of N instances at all times and want to survive the failure of 1 Availability Zone:

Minimum total instances = N × (number of AZs) ÷ (number of AZs - 1)

Examples: needing 4 always across 2 AZs requires 8 total (4 per AZ); needing 4 always across 3 AZs requires 6 total (2 per AZ — minimum viable HA).

Exam Tip: "Deploy an Auto Scaling group across three AZs with a minimum of 4 instances" — if one AZ fails, you could have as few as 2 instances. The correct answer is minimum of 6 across 3 AZs to always guarantee 4 surviving.

Stateless vs. Stateful: Design principle is to make the application tier stateless and persist state in a separate managed service (RDS, DynamoDB, ElastiCache). Stateful instances tied to user sessions are harder to scale and replace.

1.2 Elastic Load Balancing

Load Balancer Comparison

Feature ALB NLB GLB
OSI Layer Layer 7 (HTTP/HTTPS) Layer 4 (TCP/UDP/TLS) Layer 3 (IP)
Static IP per AZ No Yes No
Content-based routing Yes (path, host, header) No No
WebSockets / gRPC Yes No No
Millions req/sec Not ideal Yes No
Preserve source IP No (use X-Forwarded-For) Yes No
Use case Web apps, microservices Gaming, IoT, VoIP, low latency 3rd-party firewalls, IDS/IPS

ALB Source IP and Access Logs

ALB replaces the client IP with its own IP when forwarding to targets. To get the original client IP, read the X-Forwarded-For header in the application, or use ALB access logs (which contain the client_ip field) for forensic investigation of traffic spikes.

ALB Sticky Sessions

For CloudFront + ALB sticky sessions (users randomly logged out):

  1. Configure cookie forwarding in the CloudFront distribution cache behavior → forwards ALB cookies to origin.
  2. Enable application-based sticky sessions on the ALB target group.

Note: Duration-based cookies alone do not work with CloudFront because CloudFront strips them by default. Both changes are required.

Mobile Users Getting Desktop Content

Problem: CloudFront caches the desktop version and serves it to all users. Root cause: CloudFront caches based on URL only and does not forward the User-Agent header. Fix: configure CloudFront behavior to forward the User-Agent header to the origin.

1.3 EC2 Auto Scaling — Deep Dive

Launch Templates vs. Launch Configurations

Launch configurations are legacy and cannot be modified — you must create a new version. Launch templates are the recommended approach: they are versioned, support mixed instance types, and support Spot + On-Demand mix.

Auto Scaling Health Checks

Two types are available: EC2 health checks (instance is healthy if running and not impaired — does NOT know if the application is working) and ELB health checks (instance is healthy if the ALB health check passes — validates application response).

Exam Tip: Auto Scaling shows instance healthy (EC2 check) but ALB shows it unhealthy (application check)? Root cause: Auto Scaling group health check type is set to EC2, not ELB. Fix: change health check type to ELB.

SQS-Based Auto Scaling (Worker Pattern)

For EC2 worker fleets processing SQS queues, ApproximateNumberOfMessagesVisible alone is insufficient because it does not account for the number of workers. Use a custom metric:

BacklogPerInstance = ApproximateNumberOfMessagesVisible / NumberOfInstancesInService

Create this as a custom CloudWatch metric and build a target tracking policy against it.

Warm Pools

Pre-initializes instances to eliminate cold-start delay. Instance states in warm pool:

State Cost Speed
Stopped (recommended) No compute charges; EBS charges apply Fast
Hibernated Additional EBS charges for RAM state Faster
Running Full compute charges even when idle Fastest

Lifecycle Hooks

Pause instance launch or termination to perform custom actions:

Hook Type Pause State Use Case
autoscaling:EC2_INSTANCE_LAUNCHING Pending:Wait Configure instance, run tests
autoscaling:EC2_INSTANCE_TERMINATING Terminating:Wait Copy logs to S3, graceful shutdown

Exam Tip: "Instances terminated before developers could retrieve application logs" → add a termination lifecycle hook → Lambda function copies logs to S3 → complete lifecycle hook.


2. Database Reliability

2.1 Amazon RDS High Availability

RDS Multi-AZ

Multi-AZ uses synchronous replication to a standby replica in a different AZ of the same region. The standby is NOT accessible for reads. Automatic failover takes 1–2 minutes with DNS CNAME automatically redirected to standby. Failover triggers include: AZ failure, primary failure, OS patching, and manual failover.

RDS Read Replicas

Read replicas use asynchronous replication with up to 5 per RDS instance. They CAN be used for reads, can be in a different region (Cross-Region Read Replica), and can be promoted to standalone DB — but promotion takes minutes (I/O replay required). They are NOT the same as Multi-AZ standby.

RDS Encryption Constraint

You cannot encrypt an existing unencrypted RDS instance directly. The correct process is:

  1. Take a snapshot of the unencrypted instance.
  2. Copy the snapshot with encryption enabled.
  3. Restore from the encrypted snapshot to create a new encrypted RDS instance.
  4. Point the application to the new instance and decommission the old one.

2.2 Amazon Aurora — Comprehensive Reference

Aurora Architecture

Aurora uses a shared storage volume that automatically grows in 10 GB increments (up to 128 TB) and is replicated 6 ways across 3 AZs. Replicas share the SAME storage — no data replication lag for reads.

Aurora Replica vs. RDS Read Replica

Feature Aurora Replica RDS Read Replica
Replication method Storage-level (shared volume) I/O log shipping
Replication lag Typically < 100 ms Seconds to minutes
Failover Automatic, instant (< 30 seconds) Manual promotion (minutes)
Max count 15 5
Reader endpoint Yes (auto load balances) No built-in

Aurora Backtracking vs. PITR

Feature Backtracking Point-in-Time Recovery (PITR)
How it works Rewinds existing cluster in-place Creates a NEW cluster from backup
Speed Fast (minutes) Slower (depends on data size)
Same cluster Yes No (new cluster created)
Max window Up to 72 hours Up to 35 days
Use case Accidental data changes in last 72 hours Older recovery or compliance

Exam Tip: "Roll back the DB cluster to a specific recovery point within the previous 72 hours; restores must be in the same production DB cluster." → Aurora Backtracking — PITR creates a new cluster, backtracking rewinds in place.

Aurora Global Database

Cross-region replication with < 1 second RPO and promotion in < 1 minute during regional failure (RTO). Up to 5 secondary regions; secondary regions are read-only until promoted.

2.3 Amazon DynamoDB Reliability

DynamoDB Global Tables

Global Tables require DynamoDB Streams to be enabled and the same table name in all regions. They provide multi-region, multi-master replication — write to any region and it replicates to all others. RPO is near zero; RTO < 1 minute.

DynamoDB Protection Features

DynamoDB offers point-in-time recovery (PITR) for continuous backups restorable to any second in the last 35 days, on-demand backups for manual full backups stored until explicitly deleted, and deletion protection to prevent accidental table deletion.

2.4 Amazon ElastiCache Reliability

Memcached vs. Redis

Feature Memcached Redis
Data persistence No Yes (RDB/AOF)
Pub/Sub No Yes
Sorted sets, lists No Yes
Multi-AZ failover No Yes (with replication groups)

Diagnosing High Evictions

The Evictions metric being high means the cache is full and items are being evicted before they expire, causing more cache misses and increased database load.

Solutions: add nodes to the cluster (horizontal scaling, more total memory) or increase individual node size (vertical scaling, more memory per node).

Exam Tip: Wrong answers for high evictions: adding an ELB in front of ElastiCache, adding SQS to decouple, or increasing TTL (which makes evictions worse by keeping items longer).


3. Backup, Restore and Disaster Recovery

3.1 AWS Backup Service

AWS Backup supports EC2, EBS, RDS, Aurora, DynamoDB, EFS, FSx, S3, and AWS Storage Gateway. Deploy backup policies organization-wide from the management account using AWS Organizations → Backup Policies, then each member account gets backup plans applied automatically. Include cross-region backup copies in the backup plan for disaster recovery coverage.

For EFS, enabling AWS Backup and using a partial restore job lets you recover individual files or directories rapidly without performing a full restore (which creates a new EFS file system).

3.2 Amazon S3 Data Protection

S3 Versioning

Versioning preserves all object versions. Deleting an object creates a delete marker (the object is not actually removed). To permanently delete, you must delete the specific version ID. Once enabled, versioning can only be suspended, not completely disabled.

S3 MFA Delete

With MFA Delete enabled, the following actions require MFA: permanently removing a versioned object and suspending versioning. The following do NOT require MFA: adding objects (PutObject), listing versions, creating delete markers, and enabling versioning.

Exam Tip: To enable MFA Delete, you must use the AWS CLI with root account credentials — not IAM user, not the console.

S3 Object Lock

WORM protection options:

Mode Override Allowed? Who Can Override
Governance Yes Users with special permissions
Compliance No Nobody — not even root during retention period
Legal hold N/A Users with s3:PutObjectLegalHold

Object Lock requires versioning and must be enabled at bucket creation — cannot be enabled on an existing bucket.

Cross-Account S3 — Object Ownership

Objects uploaded by Account A are owned by Account A, not by the bucket owner (Account B). IAM users in Account B cannot delete them by default. Fix: modify the Lambda function to include bucket-owner-full-control ACL when writing:

s3.put_object(
    Bucket='account-b-bucket',
    Key='myfile.txt',
    Body=data,
    ACL='bucket-owner-full-control'
)

3.3 Disaster Recovery Strategies

Backup & Restore    →    Pilot Light    →    Warm Standby    →    Multi-Site Active/Active
  Hours RPO/RTO         Minutes RPO/RTO      Minutes RTO/RPO       Near-zero both
  Cheapest              Low cost             Medium cost           Most expensive
Strategy What Runs Continuously RTO RPO
Backup & Restore Nothing; periodic snapshots to S3 Hours Hours
Pilot Light Database with data only Minutes–Hours Minutes
Warm Standby Scaled-down but fully functional copy Minutes Minutes
Multi-Site Active/Active Both sites fully operational Near-zero Near-zero

3.4 AWS Storage Gateway for Backup

Volume Gateway Types

Mode Primary Storage Backup Location Use Case
Stored volumes On-premises (local disk) Async backup to S3 All data must be local; cloud is backup
Cached volumes S3 (cloud) Frequently accessed cached locally Cloud is primary; local cache for performance

Exam Tip: "All data must be available locally; use AWS for backup" → Storage Gateway with stored volumes. Cached volumes keep primary data in S3 (cloud primary), which is the opposite of what this scenario requires.

3.5 Amazon CloudFront Reliability

CloudFront Cache Invalidation

When S3 objects are updated with the same filename and users still see old content, the root cause is CloudFront has cached the old version at edge locations.

Fix: create a CloudFront invalidation:

aws cloudfront create-invalidation \
  --distribution-id EDFDVBD6EXAMPLE \
  --paths "/images/*" "/index.html"

Alternative: use versioned filenames (app.v2.js instead of app.js) — no invalidation needed; old files expire naturally.

CloudFront 502 Bad Gateway

This occurs when CloudFront cannot communicate with the origin. Most common causes: SSL/TLS certificate expired on the origin server, or hostname mismatch between the certificate CN/SAN and the origin hostname CloudFront connects to. Troubleshoot by examining the certificate expiration date directly on the origin.

S3 + CloudFront + Origin Access Identity (OAI)

To restrict S3 bucket access to CloudFront only:

  1. Create an OAI in CloudFront and associate it with the distribution's S3 origin.
  2. Update the S3 bucket policy to allow only the OAI principal for s3:GetObject.
  3. Ensure S3 Block Public Access is enabled.

Direct S3 URLs will return 403; only CloudFront can serve the content.

3.6 Route 53 for High Availability

Failover Routing Configuration

For primary/secondary failover, create Alias records (not CNAME) pointing to ALBs in primary and secondary regions. Set routing policy to Failover, set primary record type to Primary and secondary to Secondary, set Evaluate Target Health to Yes for both, and associate a health check with the primary record.

Key Concept: At the zone apex (example.com), a CNAME record is prohibited by DNS specification (RFC 1034). Always use a Route 53 Alias record at the zone apex to point to ALBs, NLBs, CloudFront, S3 website endpoints, and API Gateway.

Route 53 Routing Policies

Policy Routes Based On When to Use
Latency Measured latency to each region Route to lowest latency region
Geolocation User's exact country/continent European users → eu-central-1
Geoproximity Location + configurable bias Expand/shrink geographic coverage
Failover Health check results Active/passive DR
Weighted Percentage distribution A/B testing, gradual migration

Exam Tip: "Extend to new region; route users to lowest latency endpoint without changing URL" → Latency routing — not geolocation, which routes by user location, not measured performance.

3.7 AWS Global Accelerator

Global Accelerator provides 2 static anycast IP addresses (globally fixed; can be whitelisted by clients) and routes traffic through the AWS global network instead of the public internet. It supports ALB, NLB, EC2 instances, and Elastic IP addresses as endpoints.

Feature Global Accelerator Route 53 Latency
IP addresses 2 fixed static IPs DNS names (IPs change)
Failover time < 30 seconds Up to 5 minutes (DNS TTL)
Traffic path AWS global network Public internet
Use case Static IP requirement, fastest failover Standard multi-region DNS routing

Exam Tip: If the question mentions "requires 2 static IP addresses" OR "fastest failover time," the answer is AWS Global Accelerator.


Exam Tips & Quick Reference

Scenario-to-Answer Mapping

Scenario Keyword / Requirement Correct Answer
"Encrypt existing unencrypted RDS" Snapshot → copy with encryption → restore (not in-place)
"Encrypt existing unencrypted EFS" Create new encrypted EFS → migrate data
"Enable Multi-AZ on existing RDS" Modify instance → enable Multi-AZ option
"Aurora replication lag → stale cart" AuroraReplicaLagMaximum high; read replica returning stale data
"Rewind Aurora cluster in-place ≤ 72 hrs" Aurora Backtracking (PITR creates new cluster)
"Cross-region DR for RDS" Cross-Region Read Replica
"DynamoDB multi-region replication" Enable DynamoDB Streams → add Global Table region
"ElastiCache high evictions" Add nodes (horizontal) OR increase node size (vertical)
"Mobile users getting desktop version" Configure CloudFront to forward User-Agent header
"CloudFront serving old S3 content" Create CloudFront invalidation
"CloudFront 502 error with custom origin" SSL certificate expired on origin server
"S3 bucket accessible only via CloudFront" Create OAI → assign to distribution → update bucket policy
"Zone apex pointing to ALB" Route 53 Alias record (not CNAME)
"Presigned URL failing to upload" URL creator does not have s3:PutObject permission
"Account B users can't delete objects from Account A" Lambda must call PutObjectAcl with bucket-owner-full-control
"2 static IPs for application globally" AWS Global Accelerator
"Fastest failover between regions" Global Accelerator (< 30 sec) vs. Route 53 (up to 5 min)
"All data must be local, cloud is backup" Storage Gateway stored volumes (not cached)
"ASG healthy in console but unhealthy in ALB" Change ASG health check type from EC2 to ELB

Common Traps

  • RDS Multi-AZ standby is not readable: It exists only for failover; all reads still go to the primary. Read replicas are readable — these are different features.
  • Aurora Backtracking vs. PITR: Both restore data but PITR creates a new cluster. If the question says "same cluster" or "in-place," the answer is Backtracking.
  • CloudFront 502 vs. 503: 502 = origin SSL/TLS issue (almost always an expired cert). 503 = no healthy targets behind ALB.
  • Storage Gateway stored vs. cached: Stored = local primary, cloud backup. Cached = cloud primary, local cache. These are the OPPOSITE of what you might initially assume.

Key Terms — Domain 2

Term One-Line Definition
RDS Multi-AZ Synchronous standby replica in different AZ; NOT readable; automatic DNS failover
RDS Read Replica Async replica for read offloading; can be cross-region; manually promoted
Aurora Backtracking Rewinds existing Aurora cluster in-place up to 72 hours; no new cluster created
DynamoDB Global Tables Multi-region, multi-master replication; requires DynamoDB Streams
Warm Pool Pre-initialized EC2 instances that can be attached to ASG instantly on scale-out
Lifecycle Hook Pause during ASG launch or termination to run custom actions
OAI (Origin Access Identity) CloudFront identity used to restrict S3 bucket access to CloudFront only
Global Accelerator AWS service providing 2 static IPs and routing via AWS backbone
Object Lock S3 WORM protection preventing object deletion during retention period
MFA Delete S3 setting requiring MFA to permanently delete object versions

End of Domain 2. Continue to Domain 3: Deployment, Provisioning & Automation →


Ready to test yourself?

Practice questions for this topic

Start Practicing →

SOA-C03 Topics

Topic 2 of 5