AWSSOA-C03

Monitoring, Logging, Analysis, Remediation, and Performance Optimization

Topic 1 of 5 · Study notes

AWS Certified CloudOps Engineer - Associate (SOA-C03) — Domain 1: Monitoring, Logging, Analysis, Remediation & Performance Optimization

Exam Code: SOA-C03 | Level: Associate
Domain Weight: 20% (Monitoring/Logging/Remediation) + 12% (Cost & Performance) | Total Domains: 6 | Passing Score: 720/1000

Amazon CloudWatch
AWS CloudTrail
- 2.1 CloudTrail Fundamentals
- 2.2 CloudTrail vs. CloudWatch Logs vs. VPC Flow Logs
Amazon EventBridge
- 3.1 EventBridge Rules
- 3.2 Common EventBridge Patterns
AWS Systems Manager — Monitoring and Remediation
- 4.1 SSM for Monitoring and Remediation
EC2 and EBS Performance Optimization
Cost and Billing Monitoring
- 6.1 Cost Allocation and Reporting
Exam Tips & Quick Reference

1. Amazon CloudWatch

Amazon CloudWatch is the primary observability service for AWS. It collects metrics, logs, and events, enables alarming, and supports automated remediation. The exam tests both default metric coverage and what requires the CloudWatch agent.

1.1 Metrics — Default vs. Custom

AWS services push metrics directly to CloudWatch without any configuration.

Service	Key Default Metrics
EC2	`CPUUtilization`, `NetworkIn`, `NetworkOut`, `DiskReadOps`, `DiskWriteOps`
EBS	`VolumeReadOps`, `VolumeWriteOps`, `VolumeQueueLength`, `VolumeReadBytes`, `VolumeWriteBytes`
ALB	`HealthyHostCount`, `UnHealthyHostCount`, `RequestCount`, `RequestCountPerTarget`, `HTTPCode_ELB_5xx_Count`
RDS	`CPUUtilization`, `DatabaseConnections`, `FreeStorageSpace`, `ReadIOPS`, `WriteIOPS`
SQS	`ApproximateNumberOfMessagesVisible`, `ApproximateAgeOfOldestMessage`, `ApproximateNumberOfMessagesDelayed`
Lambda	`Invocations`, `Errors`, `Duration`, `Throttles`, `ConcurrentExecutions`
ELB	`HealthyHostCount`, `UnHealthyHostCount`, `Latency`, `HTTPCode_Backend_5xx`

The following are NOT available by default and require the CloudWatch agent installed on the EC2 instance: memory utilization (RAM usage, swap usage), disk space and disk utilization (percentage used, bytes available), individual process metrics, custom application log files, Windows Event Logs, and Windows Performance Counters.

Exam Tip: DiskReadBytes and DiskWriteBytes are EC2-level metrics — NOT EBS-level. The correct EBS metrics are VolumeReadBytes and VolumeWriteBytes. If an alarm uses EC2 disk metrics, reconfigure to use EBS volume metrics.

1.2 Installing and Configuring the CloudWatch Agent

Installation methods (in order of scalability):

Systems Manager Run Command — best for fleets; no SSH needed; use AWS-ConfigureAWSPackage document.
EC2 user data — runs on first launch only.
Manual SSH — single instances only.

The EC2 instance role MUST include CloudWatchAgentServerPolicy (allows agent to publish metrics and logs) and AmazonSSMManagedInstanceCore (allows agent config to be read from Parameter Store).

Agent Configuration File

The configuration file (/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json) controls what is collected. Store it centrally in SSM Parameter Store (AmazonCloudWatch-linux, AmazonCloudWatch-windows) so a fleet of hundreds of instances can share a single configuration — update once, apply everywhere via State Manager.

`procstat` Plugin — Monitor Individual Processes

Used to identify which specific process is consuming CPU or memory during incidents:

{
  "metrics": {
    "metrics_collected": {
      "procstat": [
        {
          "pattern": "nginx",
          "measurement": ["cpu_usage", "memory_rss", "pid_count"]
        }
      ]
    }
  }
}

`append_dimensions` — Add Custom Dimensions

Adds dimensions to all collected metrics — useful for tagging with environment, team, or application:

{
  "agent": { "metrics_collection_interval": 60 },
  "metrics": {
    "append_dimensions": {
      "Environment": "Production",
      "Application": "WebApp"
    }
  }
}

Use SSM Run Command with append-config flag to add additional metrics to specific instances without rebuilding the config from scratch:

amazon-cloudwatch-agent-ctl -a append-config -m ec2 -c ssm:AmazonCloudWatch-DHCP -s

1.3 Alarm Types

Static Threshold Alarms

Alert when a metric crosses a fixed value. Best when the normal range is well-understood. Example: CPU > 80% for 5 minutes.

Anomaly Detection Alarms

Uses ML to model expected behavior based on historical patterns. Creates a band of expected values (upper and lower bounds). Adapts automatically to daily/weekly patterns and seasonal trends.

Exam Tip: When the question says "application team does not know expected usage or growth," use anomaly detection alarms rather than static threshold alarms.

Composite Alarms

Combines multiple alarms using AND/OR logic and only sends ONE notification when all conditions are true. Reduces alarm noise significantly.

AlarmRule: "ALARM(DiskUtilizationAlarm) AND ALARM(DiskReadOpsAlarm)"

Missing Data Treatment

When an alarm's metric does not report data, choose how to treat it: notBreaching (treat as OK), breaching (treat as in alarm), ignore (maintain current state), or missing (evaluation insufficient).

Exam Tip: When monitoring for a file that SHOULD arrive every hour via Lambda trigger: create an alarm on Lambda Invocations = 0 for 1 hour, and set missing data treatment to "breaching" — because zero invocations with no data means the file never arrived.

1.4 CloudWatch Dashboards

Automating Dashboard Deployment

To deploy identical dashboards across every application deployment: create the dashboard manually, export it as JSON (Dashboard → Actions → View/Edit source), then include it in a CloudFormation template:

Resources:
  AppDashboard:
    Type: AWS::CloudWatch::Dashboard
    Properties:
      DashboardName: !Sub "${AppName}-dashboard"
      DashboardBody: !Sub |
        {
          "widgets": [
            {
              "type": "metric",
              "properties": {
                "metrics": [["AWS/EC2", "CPUUtilization", "InstanceId", "${EC2Instance}"]],
                "title": "EC2 CPU"
              }
            }
          ]
        }

Metrics Explorer for Auto Scaling Groups

Instead of manually updating dashboards when new instances launch, use CloudWatch Metrics Explorer filtered by aws:autoscaling:groupName tag. This creates dynamic visualizations that automatically include new instances as they launch — no Lambda function needed.

1.5 Logs and Logs Insights

Every application should send logs to a CloudWatch Logs log group with a retention policy set on the log group (not on individual log streams).

CloudWatch Logs Insights is a purpose-built query language for analyzing logs without exporting them:

-- Count Lambda errors by type in last 7 days
fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) as errorCount by errorType
| sort errorCount desc

-- Find top 5 IPs through NAT gateway
fields dstAddr, bytes
| stats sum(bytes) as totalBytes by dstAddr
| sort totalBytes desc
| limit 5

Tool	When to Use
Logs Insights	Query directly in CloudWatch Logs; no export needed; faster setup
Athena	Query data exported to S3; better for very large datasets

1.6 Synthetics Canaries

Canary Type	What It Does
Heartbeat monitor	Regularly loads a URL, captures screenshots and HTTP response
API canary	Tests REST API endpoints for availability and correct responses
Broken link checker	Scans all links on a page for broken URLs
GUI workflow builder	Simulates multi-step user journeys (login → add to cart → checkout)

Key Concept: Route 53 health checks perform a simple endpoint availability check. CloudWatch Synthetics performs full browser simulation with screenshots and user journey validation. Use Synthetics when you need to "follow the same routes and actions as a customer."

2. AWS CloudTrail

CloudTrail records every API call made to AWS services — who made the call (IAM identity), when, from what IP, what was called, and whether it succeeded or failed.

2.1 CloudTrail Fundamentals

Type	Coverage
Single-region trail	Only records events in one region
Multi-region trail (recommended)	Records events in all regions; sends all to one S3 bucket
Organization trail	Records events for all accounts in an AWS Organization

Log File Integrity Validation

Enable integrity validation on the trail itself (not on the S3 bucket). Every hour, CloudTrail creates a digest file in S3 containing SHA-256 hashes of every log file delivered in that hour. The digest file itself is signed with CloudTrail's private key.

aws cloudtrail validate-log-file-integrity \
  --trail-arn arn:aws:cloudtrail:us-east-1:123456789:trail/MyTrail \
  --start-time 2024-01-01T00:00:00Z \
  --end-time 2024-01-02T00:00:00Z

Exam Tip: Enable log file integrity on the trail itself, not on the S3 bucket destination. CloudWatch Logs streaming, S3 versioning, or MFA Delete alone do NOT provide integrity validation.

Investigating Compromised Access Keys

When an IAM access key is accidentally pushed to a public GitHub repository: go to CloudTrail Event History, filter by Access Key, enter the compromised key ID, and see every API call made with that key, from what IPs, at what times.

2.2 CloudTrail vs. CloudWatch Logs vs. VPC Flow Logs

Source	What It Captures	Query Tool
CloudTrail	AWS API calls (who did what to which resource)	Event History, Athena
CloudWatch Logs	Application logs, OS logs, Lambda logs	Logs Insights
VPC Flow Logs	Network traffic metadata (IP, port, bytes, action)	Athena, Logs Insights
S3 Server Access Logs	HTTP requests to S3 objects (source IP, operation, status)	Athena
ALB Access Logs	HTTP request details including source IP, target, response code	Athena

3. Amazon EventBridge

EventBridge routes AWS events and scheduled triggers to targets like Lambda, SNS, and SSM Automation — enabling event-driven automation without custom polling code.

3.1 EventBridge Rules

Event Pattern Rules

Match and route specific AWS events:

{
  "source": ["aws.ec2"],
  "detail-type": ["EC2 Instance State-change Notification"],
  "detail": {
    "state": ["terminated", "stopped"]
  }
}

Scheduled Rules

Use cron or rate expressions to trigger Lambda, SSM Automation, and more:

rate(1 hour) — every hour
cron(0 20 * * ? *) — every day at 8 PM UTC
cron(0 9 ? * MON-FRI *) — weekdays at 9 AM

EventBridge API Destinations

Allows forwarding events to external HTTPS endpoints (third-party ticketing systems, webhooks). Configure a connection (authentication headers, API keys), create an API destination with the HTTPS endpoint URL, and add an input transformer to reshape the event JSON before sending.

3.2 Common EventBridge Patterns

Business Requirement	EventBridge Pattern
Notify when EC2 starts/stops	EC2 State-change event → SNS topic
Nightly report generation	Scheduled rule → Lambda function
Re-enable CloudTrail when disabled	CloudTrail config change event → SSM Automation runbook
Spot Instance interruption notification	EC2 Spot Interruption Warning → SNS
Auto-remediate non-compliant resources	Config → EventBridge → SSM Automation

4. AWS Systems Manager — Monitoring and Remediation

SSM provides fleet management, remediation, and secure instance access — all without requiring SSH or custom code.

4.1 SSM for Monitoring and Remediation

Key Concept: The exam repeatedly tests "without custom code" scenarios. The correct answer is always AWS Config managed rule + SSM Automation document — not EventBridge → Lambda (custom code).

Common SSM Automation runbooks for remediation: AWS-RebootEC2Instance, AWS-StopEC2Instance, AWS-StartEC2Instance, AWS-TerminateEC2Instance, AWS-ReleaseElasticIP, and AWS-ConfigureAWSPackage.

Session Manager for Audit Logging

Configure Session Manager to stream all session data to CloudWatch Logs. Every command typed and output returned is logged. No SSH, no bastion hosts, no security group inbound rules needed.

State Manager for Consistent Configuration

Use SSM State Manager to continuously enforce desired state across a fleet. Create an association to run AmazonCloudWatch-ManageAgent on a schedule (e.g., every 30 minutes), targeting instances by tag. This ensures the CloudWatch agent stays configured consistently — even after instance replacement.

5. EC2 and EBS Performance Optimization

5.1 EC2 Performance

Burstable Instances (T-Series) and CPU Credits

T2, T3, T3a, and T4g instances earn CPU credits when running below baseline and spend credits when above. Symptoms of credit exhaustion include CPU pinned at baseline, gradual increase in response times, and VolumeReadOps dropping to less than 10% of peak despite requests.

Fix: Enable Unlimited mode to allow bursting beyond credit balance:

aws ec2 modify-instance-credit-specification \
  --instance-credit-specification "InstanceId=i-1234567890abcdef0,CpuCredits=unlimited"

Auto Scaling — Scaling Policies

Policy	Behavior	Best For
Simple Scaling	One alarm → one action; cooldown applies	Legacy use cases
Step Scaling	Multiple thresholds → proportional steps	Unpredictable traffic needing proportional response
Target Tracking	Maintains a target metric value automatically	Most workloads (e.g., keep CPU at 50%)
Scheduled Scaling	Pre-emptively change capacity at specific times	Predictable peaks (business hours, weekly patterns)

When scale-out is too slow, diagnose whether the delay is in launching, bootstrapping, or becoming healthy. For long user data scripts, pre-bake AMIs. For long health check grace periods, reduce them. Add a warm pool to keep pre-initialized instances ready to launch instantly.

5.2 EBS Performance Deep Dive

Volume Types

Volume Type	IOPS	Throughput	Key Characteristic
gp2	3 IOPS/GB; max 16,000	250 MB/s	IOPS tied to size; legacy
gp3	Baseline 3,000; max 16,000 (configurable)	125–1,000 MB/s	IOPS independent of size; recommended
io1/io2	Up to 64,000 (io2 Block Express: 256,000)	Up to 4,000 MB/s	High-performance; highest cost
st1	Max 500	Up to 500 MB/s	Sequential access; not bootable
sc1	Max 250	Up to 250 MB/s	Cold data; cheapest EBS option

Diagnosing EBS Problems

High VolumeQueueLength (consistently > 1) indicates I/O requests are waiting due to IOPS saturation. Fix: increase provisioned IOPS by modifying to io2 or increasing gp3 IOPS. On gp2, increase volume size to gain more IOPS or migrate to gp3.

Volume restored from snapshot is slow initially because EBS volumes restored from snapshots have blocks stored in S3 — first access to each block requires a round trip to S3. Fix: enable Fast Snapshot Restore (FSR) on the snapshot so blocks are pre-initialized.

Exam Tip: Need 10,000 IOPS on gp2? You must provision 3,333 GB (3 IOPS/GB) = expensive. On gp3, just configure 10,000 IOPS on any volume size = cheaper. Always prefer gp3 for IOPS requirements.

5.3 EFS Performance Deep Dive

Performance Modes

Mode	Ops/Sec Limit	Latency	Best For
General Purpose (default)	7,000 ops/sec	Lower	Most workloads, latency-sensitive apps
Max I/O	Unlimited	Higher	Massively parallel workloads

Note: You cannot switch an existing General Purpose file system to Max I/O in place — you must create a new EFS with Max I/O and migrate data.

Throughput Modes

Mode	How Throughput Is Determined
Bursting	Scales with storage size; earns burst credits when below baseline
Provisioned	You specify throughput independently of storage size

Exam Tip: PercentIOLimit = 100% means the file system is at its operations-per-second limit. The fix is Provisioned Throughput mode — not Max I/O (which addresses parallel clients, not raw throughput).

5.4 S3 Performance and Cost Optimization

S3 Transfer Acceleration

Routes uploads through CloudFront edge locations → AWS global backbone → S3 bucket. Uses a special endpoint: bucket.s3-accelerate.amazonaws.com. Best for global users uploading to a single centralized S3 bucket.

Multipart Upload: Recommended for objects > 100 MB, required for > 5 GB. Use aws s3 cp (automatically uses multipart) rather than aws s3api put-object (single-part only). Clean up incomplete multipart uploads with an S3 Lifecycle rule → AbortIncompleteMultipartUpload after 7 days.

S3 Storage Classes

Storage Class	Retrieval	Min Duration	Use Case
Standard	Immediate	None	Frequently accessed data
Standard-IA	Immediate	30 days	Infrequent, needs fast retrieval
One Zone-IA	Immediate	30 days	Infrequent, single AZ, lower cost
Intelligent-Tiering	Immediate	None	Unknown or changing access patterns
Glacier Instant	Immediate	90 days	Archive with instant retrieval requirement
Glacier Flexible	Minutes to hours	90 days	Archive; retrieval under 5 hours
Glacier Deep Archive	12–48 hours	180 days	Long-term compliance archive

5.5 RDS Performance Optimization

RDS Performance Insights

Visual dashboard showing database load as Average Active Sessions (AAS), broken down by wait events, SQL statements, hosts, and users. Available for MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, and Aurora.

Tool	What It Shows
Performance Insights	Database-level metrics (query performance, waits, connections)
Enhanced Monitoring	OS-level metrics (CPU, memory, disk, network at 1–60 second intervals)

RDS Proxy — Connection Pooling

Problem: Lambda functions exhaust the max_connections parameter on RDS because each invocation opens a new connection. Solution: RDS Proxy maintains a pool of persistent connections to the database. Lambda connects to the proxy, which multiplexes thousands of Lambda connections into the pool. Supports MySQL, PostgreSQL, MariaDB, Aurora MySQL, and Aurora PostgreSQL.

Aurora Read Replica vs. RDS Read Replica

Feature	Aurora Replica	RDS Read Replica
Replication lag	Typically < 100 ms	Seconds to minutes
Promotion to primary	Instant (shared storage)	Requires I/O replay (minutes)
Connection endpoint	Reader endpoint (auto load balances)	Separate read endpoint per replica
Max replicas	15	5
Auto Scaling	Yes	No

6. Cost and Billing Monitoring

6.1 Cost Allocation and Reporting

Cost Allocation Tags

Apply tags to resources (e.g., CostCenter: Engineering), then activate them in the Billing console of the payer/management account as user-defined cost allocation tags. Tags appear in Cost Explorer after activation (up to 24 hours). The createdBy tag is AWS-generated and shows which IAM identity created each resource.

Exam Tip: Tags must be activated in the payer account. Activating in member accounts has no effect on cost reporting.

Billing and Cost Tools

Tool	Purpose
AWS Budgets	Alert when actual or forecasted spending exceeds a threshold; per-team, per-project budgets
Cost and Usage Report (CUR)	Most granular billing data; hourly/daily/monthly; delivered to S3; queryable with Athena
AWS Compute Optimizer	Analyzes CloudWatch metrics (14+ days) for right-sizing EC2, EBS, Lambda recommendations
Personal Health Dashboard	Personalized events affecting YOUR specific resources (retirements, maintenance, availability)

Exam Tips & Quick Reference

Scenario-to-Answer Mapping

Scenario Keyword / Requirement	Correct Answer
"Without custom code"	Config rule + SSM Automation (not Lambda)
"Individual process CPU monitoring"	CloudWatch agent `procstat` plugin
"Disk space / memory on EC2"	CloudWatch agent (not default metrics)
"File didn't arrive — notify"	CloudWatch alarm, Invocations=0, missing data=breaching
"Don't know expected baseline"	CloudWatch anomaly detection
"Both metric A AND metric B must alarm"	Composite alarm
"Same dashboard across all deployments"	Export dashboard JSON → CloudFormation `DashboardBody`
"All ASG instances in one dashboard"	CloudWatch Metrics Explorer with ASG tag filter
"Verify CloudTrail logs not tampered"	Enable log file integrity validation on trail
"Who made API calls with stolen key"	CloudTrail Event History, filter by access key
"EBS volume slow after snapshot restore"	Enable Fast Snapshot Restore (FSR)
"T-series instance CPU throttled"	Enable Unlimited mode
"EFS PercentIOLimit = 100%"	Switch to Provisioned Throughput
"RDS too many connections (Lambda)"	Deploy RDS Proxy
"Daily cost report to S3"	Cost and Usage Report (CUR)
"Alert when spend forecast exceeds budget"	AWS Budgets
"Identify which developer created resources"	Activate `createdBy` tag + Cost Explorer
"Upcoming EC2 hardware maintenance"	Personal Health Dashboard
"Right-size EC2 recommendations"	AWS Compute Optimizer

Common Traps

Default EC2 metrics include disk ops, not disk space: DiskReadOps is available by default; disk space utilization is NOT — it requires the CloudWatch agent.
Anomaly detection vs. static threshold: If the question says the team doesn't know what "normal" looks like, anomaly detection is always preferred over an arbitrary threshold.
Missing data treatment: Only set missing data to "breaching" when the absence of data is itself a problem (file not arriving, Lambda not running). For all other cases use the default.
Cost tags in wrong account: Activating tags in a member account does nothing — always activate in the management/payer account.

Key Terms — Domain 1

Term	One-Line Definition
CloudWatch agent	Software installed on EC2 to collect memory, disk, and process metrics
procstat plugin	CloudWatch agent plugin for per-process metrics (CPU, memory, PID count)
Composite alarm	Single alarm that combines multiple CloudWatch alarms with AND/OR logic
Anomaly detection alarm	ML-based alarm that alerts when a metric deviates from its expected band
CloudWatch Synthetics	Scheduled canary scripts that simulate user browser interactions
Fast Snapshot Restore (FSR)	EBS feature that pre-initializes snapshot blocks to avoid first-access latency
RDS Proxy	Connection pooler between Lambda/apps and RDS that prevents connection exhaustion
Cost Allocation Tags	Resource tags activated in the payer account to appear in billing reports
Compute Optimizer	AWS service that analyzes usage metrics and recommends right-sized instance types
Personal Health Dashboard	Personalized view of AWS service events affecting your specific resources

End of Domain 1. Continue to Domain 2: Reliability and Business Continuity →

Reliability and Business Continuity

Ready to test yourself?

Practice questions for this topic

Start Practicing →