Courses/SOA-C03/Monitoring, Logging, Analysis, Remediation, and Performance Optimization
Practice questions →
AWSSOA-C03

Monitoring, Logging, Analysis, Remediation, and Performance Optimization

Topic 1 of 5 · Study notes

AWS Certified CloudOps Engineer - Associate (SOA-C03) — Domain 1: Monitoring, Logging, Analysis, Remediation & Performance Optimization

Exam Code: SOA-C03  |  Level: Associate
Domain Weight: 20% (Monitoring/Logging/Remediation) + 12% (Cost & Performance)  |  Total Domains: 6  |  Passing Score: 720/1000


Table of Contents

  1. Amazon CloudWatch
  2. AWS CloudTrail
  3. Amazon EventBridge
  4. AWS Systems Manager — Monitoring and Remediation
  5. EC2 and EBS Performance Optimization
  6. Cost and Billing Monitoring
  7. Exam Tips & Quick Reference

1. Amazon CloudWatch

Amazon CloudWatch is the primary observability service for AWS. It collects metrics, logs, and events, enables alarming, and supports automated remediation. The exam tests both default metric coverage and what requires the CloudWatch agent.

1.1 Metrics — Default vs. Custom

AWS services push metrics directly to CloudWatch without any configuration.

Service Key Default Metrics
EC2 CPUUtilization, NetworkIn, NetworkOut, DiskReadOps, DiskWriteOps
EBS VolumeReadOps, VolumeWriteOps, VolumeQueueLength, VolumeReadBytes, VolumeWriteBytes
ALB HealthyHostCount, UnHealthyHostCount, RequestCount, RequestCountPerTarget, HTTPCode_ELB_5xx_Count
RDS CPUUtilization, DatabaseConnections, FreeStorageSpace, ReadIOPS, WriteIOPS
SQS ApproximateNumberOfMessagesVisible, ApproximateAgeOfOldestMessage, ApproximateNumberOfMessagesDelayed
Lambda Invocations, Errors, Duration, Throttles, ConcurrentExecutions
ELB HealthyHostCount, UnHealthyHostCount, Latency, HTTPCode_Backend_5xx

The following are NOT available by default and require the CloudWatch agent installed on the EC2 instance: memory utilization (RAM usage, swap usage), disk space and disk utilization (percentage used, bytes available), individual process metrics, custom application log files, Windows Event Logs, and Windows Performance Counters.

Exam Tip: DiskReadBytes and DiskWriteBytes are EC2-level metrics — NOT EBS-level. The correct EBS metrics are VolumeReadBytes and VolumeWriteBytes. If an alarm uses EC2 disk metrics, reconfigure to use EBS volume metrics.

1.2 Installing and Configuring the CloudWatch Agent

Installation methods (in order of scalability):

  1. Systems Manager Run Command — best for fleets; no SSH needed; use AWS-ConfigureAWSPackage document.
  2. EC2 user data — runs on first launch only.
  3. Manual SSH — single instances only.

The EC2 instance role MUST include CloudWatchAgentServerPolicy (allows agent to publish metrics and logs) and AmazonSSMManagedInstanceCore (allows agent config to be read from Parameter Store).

Agent Configuration File

The configuration file (/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json) controls what is collected. Store it centrally in SSM Parameter Store (AmazonCloudWatch-linux, AmazonCloudWatch-windows) so a fleet of hundreds of instances can share a single configuration — update once, apply everywhere via State Manager.

`procstat` Plugin — Monitor Individual Processes

Used to identify which specific process is consuming CPU or memory during incidents:

{
  "metrics": {
    "metrics_collected": {
      "procstat": [
        {
          "pattern": "nginx",
          "measurement": ["cpu_usage", "memory_rss", "pid_count"]
        }
      ]
    }
  }
}

`append_dimensions` — Add Custom Dimensions

Adds dimensions to all collected metrics — useful for tagging with environment, team, or application:

{
  "agent": { "metrics_collection_interval": 60 },
  "metrics": {
    "append_dimensions": {
      "Environment": "Production",
      "Application": "WebApp"
    }
  }
}

Use SSM Run Command with append-config flag to add additional metrics to specific instances without rebuilding the config from scratch:

amazon-cloudwatch-agent-ctl -a append-config -m ec2 -c ssm:AmazonCloudWatch-DHCP -s

1.3 Alarm Types

Static Threshold Alarms

Alert when a metric crosses a fixed value. Best when the normal range is well-understood. Example: CPU > 80% for 5 minutes.

Anomaly Detection Alarms

Uses ML to model expected behavior based on historical patterns. Creates a band of expected values (upper and lower bounds). Adapts automatically to daily/weekly patterns and seasonal trends.

Exam Tip: When the question says "application team does not know expected usage or growth," use anomaly detection alarms rather than static threshold alarms.

Composite Alarms

Combines multiple alarms using AND/OR logic and only sends ONE notification when all conditions are true. Reduces alarm noise significantly.

AlarmRule: "ALARM(DiskUtilizationAlarm) AND ALARM(DiskReadOpsAlarm)"

Missing Data Treatment

When an alarm's metric does not report data, choose how to treat it: notBreaching (treat as OK), breaching (treat as in alarm), ignore (maintain current state), or missing (evaluation insufficient).

Exam Tip: When monitoring for a file that SHOULD arrive every hour via Lambda trigger: create an alarm on Lambda Invocations = 0 for 1 hour, and set missing data treatment to "breaching" — because zero invocations with no data means the file never arrived.

1.4 CloudWatch Dashboards

Automating Dashboard Deployment

To deploy identical dashboards across every application deployment: create the dashboard manually, export it as JSON (Dashboard → Actions → View/Edit source), then include it in a CloudFormation template:

Resources:
  AppDashboard:
    Type: AWS::CloudWatch::Dashboard
    Properties:
      DashboardName: !Sub "${AppName}-dashboard"
      DashboardBody: !Sub |
        {
          "widgets": [
            {
              "type": "metric",
              "properties": {
                "metrics": [["AWS/EC2", "CPUUtilization", "InstanceId", "${EC2Instance}"]],
                "title": "EC2 CPU"
              }
            }
          ]
        }

Metrics Explorer for Auto Scaling Groups

Instead of manually updating dashboards when new instances launch, use CloudWatch Metrics Explorer filtered by aws:autoscaling:groupName tag. This creates dynamic visualizations that automatically include new instances as they launch — no Lambda function needed.

1.5 Logs and Logs Insights

Every application should send logs to a CloudWatch Logs log group with a retention policy set on the log group (not on individual log streams).

CloudWatch Logs Insights is a purpose-built query language for analyzing logs without exporting them:

-- Count Lambda errors by type in last 7 days
fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) as errorCount by errorType
| sort errorCount desc

-- Find top 5 IPs through NAT gateway
fields dstAddr, bytes
| stats sum(bytes) as totalBytes by dstAddr
| sort totalBytes desc
| limit 5
Tool When to Use
Logs Insights Query directly in CloudWatch Logs; no export needed; faster setup
Athena Query data exported to S3; better for very large datasets

1.6 Synthetics Canaries

Canary Type What It Does
Heartbeat monitor Regularly loads a URL, captures screenshots and HTTP response
API canary Tests REST API endpoints for availability and correct responses
Broken link checker Scans all links on a page for broken URLs
GUI workflow builder Simulates multi-step user journeys (login → add to cart → checkout)

Key Concept: Route 53 health checks perform a simple endpoint availability check. CloudWatch Synthetics performs full browser simulation with screenshots and user journey validation. Use Synthetics when you need to "follow the same routes and actions as a customer."


2. AWS CloudTrail

CloudTrail records every API call made to AWS services — who made the call (IAM identity), when, from what IP, what was called, and whether it succeeded or failed.

2.1 CloudTrail Fundamentals

Type Coverage
Single-region trail Only records events in one region
Multi-region trail (recommended) Records events in all regions; sends all to one S3 bucket
Organization trail Records events for all accounts in an AWS Organization

Log File Integrity Validation

Enable integrity validation on the trail itself (not on the S3 bucket). Every hour, CloudTrail creates a digest file in S3 containing SHA-256 hashes of every log file delivered in that hour. The digest file itself is signed with CloudTrail's private key.

aws cloudtrail validate-log-file-integrity \
  --trail-arn arn:aws:cloudtrail:us-east-1:123456789:trail/MyTrail \
  --start-time 2024-01-01T00:00:00Z \
  --end-time 2024-01-02T00:00:00Z

Exam Tip: Enable log file integrity on the trail itself, not on the S3 bucket destination. CloudWatch Logs streaming, S3 versioning, or MFA Delete alone do NOT provide integrity validation.

Investigating Compromised Access Keys

When an IAM access key is accidentally pushed to a public GitHub repository: go to CloudTrail Event History, filter by Access Key, enter the compromised key ID, and see every API call made with that key, from what IPs, at what times.

2.2 CloudTrail vs. CloudWatch Logs vs. VPC Flow Logs

Source What It Captures Query Tool
CloudTrail AWS API calls (who did what to which resource) Event History, Athena
CloudWatch Logs Application logs, OS logs, Lambda logs Logs Insights
VPC Flow Logs Network traffic metadata (IP, port, bytes, action) Athena, Logs Insights
S3 Server Access Logs HTTP requests to S3 objects (source IP, operation, status) Athena
ALB Access Logs HTTP request details including source IP, target, response code Athena

3. Amazon EventBridge

EventBridge routes AWS events and scheduled triggers to targets like Lambda, SNS, and SSM Automation — enabling event-driven automation without custom polling code.

3.1 EventBridge Rules

Event Pattern Rules

Match and route specific AWS events:

{
  "source": ["aws.ec2"],
  "detail-type": ["EC2 Instance State-change Notification"],
  "detail": {
    "state": ["terminated", "stopped"]
  }
}

Scheduled Rules

Use cron or rate expressions to trigger Lambda, SSM Automation, and more:

  • rate(1 hour) — every hour
  • cron(0 20 * * ? *) — every day at 8 PM UTC
  • cron(0 9 ? * MON-FRI *) — weekdays at 9 AM

EventBridge API Destinations

Allows forwarding events to external HTTPS endpoints (third-party ticketing systems, webhooks). Configure a connection (authentication headers, API keys), create an API destination with the HTTPS endpoint URL, and add an input transformer to reshape the event JSON before sending.

3.2 Common EventBridge Patterns

Business Requirement EventBridge Pattern
Notify when EC2 starts/stops EC2 State-change event → SNS topic
Nightly report generation Scheduled rule → Lambda function
Re-enable CloudTrail when disabled CloudTrail config change event → SSM Automation runbook
Spot Instance interruption notification EC2 Spot Interruption Warning → SNS
Auto-remediate non-compliant resources Config → EventBridge → SSM Automation

4. AWS Systems Manager — Monitoring and Remediation

SSM provides fleet management, remediation, and secure instance access — all without requiring SSH or custom code.

4.1 SSM for Monitoring and Remediation

Key Concept: The exam repeatedly tests "without custom code" scenarios. The correct answer is always AWS Config managed rule + SSM Automation document — not EventBridge → Lambda (custom code).

Common SSM Automation runbooks for remediation: AWS-RebootEC2Instance, AWS-StopEC2Instance, AWS-StartEC2Instance, AWS-TerminateEC2Instance, AWS-ReleaseElasticIP, and AWS-ConfigureAWSPackage.

Session Manager for Audit Logging

Configure Session Manager to stream all session data to CloudWatch Logs. Every command typed and output returned is logged. No SSH, no bastion hosts, no security group inbound rules needed.

State Manager for Consistent Configuration

Use SSM State Manager to continuously enforce desired state across a fleet. Create an association to run AmazonCloudWatch-ManageAgent on a schedule (e.g., every 30 minutes), targeting instances by tag. This ensures the CloudWatch agent stays configured consistently — even after instance replacement.


5. EC2 and EBS Performance Optimization

5.1 EC2 Performance

Burstable Instances (T-Series) and CPU Credits

T2, T3, T3a, and T4g instances earn CPU credits when running below baseline and spend credits when above. Symptoms of credit exhaustion include CPU pinned at baseline, gradual increase in response times, and VolumeReadOps dropping to less than 10% of peak despite requests.

Fix: Enable Unlimited mode to allow bursting beyond credit balance:

aws ec2 modify-instance-credit-specification \
  --instance-credit-specification "InstanceId=i-1234567890abcdef0,CpuCredits=unlimited"

Auto Scaling — Scaling Policies

Policy Behavior Best For
Simple Scaling One alarm → one action; cooldown applies Legacy use cases
Step Scaling Multiple thresholds → proportional steps Unpredictable traffic needing proportional response
Target Tracking Maintains a target metric value automatically Most workloads (e.g., keep CPU at 50%)
Scheduled Scaling Pre-emptively change capacity at specific times Predictable peaks (business hours, weekly patterns)

When scale-out is too slow, diagnose whether the delay is in launching, bootstrapping, or becoming healthy. For long user data scripts, pre-bake AMIs. For long health check grace periods, reduce them. Add a warm pool to keep pre-initialized instances ready to launch instantly.

5.2 EBS Performance Deep Dive

Volume Types

Volume Type IOPS Throughput Key Characteristic
gp2 3 IOPS/GB; max 16,000 250 MB/s IOPS tied to size; legacy
gp3 Baseline 3,000; max 16,000 (configurable) 125–1,000 MB/s IOPS independent of size; recommended
io1/io2 Up to 64,000 (io2 Block Express: 256,000) Up to 4,000 MB/s High-performance; highest cost
st1 Max 500 Up to 500 MB/s Sequential access; not bootable
sc1 Max 250 Up to 250 MB/s Cold data; cheapest EBS option

Diagnosing EBS Problems

High VolumeQueueLength (consistently > 1) indicates I/O requests are waiting due to IOPS saturation. Fix: increase provisioned IOPS by modifying to io2 or increasing gp3 IOPS. On gp2, increase volume size to gain more IOPS or migrate to gp3.

Volume restored from snapshot is slow initially because EBS volumes restored from snapshots have blocks stored in S3 — first access to each block requires a round trip to S3. Fix: enable Fast Snapshot Restore (FSR) on the snapshot so blocks are pre-initialized.

Exam Tip: Need 10,000 IOPS on gp2? You must provision 3,333 GB (3 IOPS/GB) = expensive. On gp3, just configure 10,000 IOPS on any volume size = cheaper. Always prefer gp3 for IOPS requirements.

5.3 EFS Performance Deep Dive

Performance Modes

Mode Ops/Sec Limit Latency Best For
General Purpose (default) 7,000 ops/sec Lower Most workloads, latency-sensitive apps
Max I/O Unlimited Higher Massively parallel workloads

Note: You cannot switch an existing General Purpose file system to Max I/O in place — you must create a new EFS with Max I/O and migrate data.

Throughput Modes

Mode How Throughput Is Determined
Bursting Scales with storage size; earns burst credits when below baseline
Provisioned You specify throughput independently of storage size

Exam Tip: PercentIOLimit = 100% means the file system is at its operations-per-second limit. The fix is Provisioned Throughput mode — not Max I/O (which addresses parallel clients, not raw throughput).

5.4 S3 Performance and Cost Optimization

S3 Transfer Acceleration

Routes uploads through CloudFront edge locations → AWS global backbone → S3 bucket. Uses a special endpoint: bucket.s3-accelerate.amazonaws.com. Best for global users uploading to a single centralized S3 bucket.

Multipart Upload: Recommended for objects > 100 MB, required for > 5 GB. Use aws s3 cp (automatically uses multipart) rather than aws s3api put-object (single-part only). Clean up incomplete multipart uploads with an S3 Lifecycle rule → AbortIncompleteMultipartUpload after 7 days.

S3 Storage Classes

Storage Class Retrieval Min Duration Use Case
Standard Immediate None Frequently accessed data
Standard-IA Immediate 30 days Infrequent, needs fast retrieval
One Zone-IA Immediate 30 days Infrequent, single AZ, lower cost
Intelligent-Tiering Immediate None Unknown or changing access patterns
Glacier Instant Immediate 90 days Archive with instant retrieval requirement
Glacier Flexible Minutes to hours 90 days Archive; retrieval under 5 hours
Glacier Deep Archive 12–48 hours 180 days Long-term compliance archive

5.5 RDS Performance Optimization

RDS Performance Insights

Visual dashboard showing database load as Average Active Sessions (AAS), broken down by wait events, SQL statements, hosts, and users. Available for MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, and Aurora.

Tool What It Shows
Performance Insights Database-level metrics (query performance, waits, connections)
Enhanced Monitoring OS-level metrics (CPU, memory, disk, network at 1–60 second intervals)

RDS Proxy — Connection Pooling

Problem: Lambda functions exhaust the max_connections parameter on RDS because each invocation opens a new connection. Solution: RDS Proxy maintains a pool of persistent connections to the database. Lambda connects to the proxy, which multiplexes thousands of Lambda connections into the pool. Supports MySQL, PostgreSQL, MariaDB, Aurora MySQL, and Aurora PostgreSQL.

Aurora Read Replica vs. RDS Read Replica

Feature Aurora Replica RDS Read Replica
Replication lag Typically < 100 ms Seconds to minutes
Promotion to primary Instant (shared storage) Requires I/O replay (minutes)
Connection endpoint Reader endpoint (auto load balances) Separate read endpoint per replica
Max replicas 15 5
Auto Scaling Yes No

6. Cost and Billing Monitoring

6.1 Cost Allocation and Reporting

Cost Allocation Tags

Apply tags to resources (e.g., CostCenter: Engineering), then activate them in the Billing console of the payer/management account as user-defined cost allocation tags. Tags appear in Cost Explorer after activation (up to 24 hours). The createdBy tag is AWS-generated and shows which IAM identity created each resource.

Exam Tip: Tags must be activated in the payer account. Activating in member accounts has no effect on cost reporting.

Billing and Cost Tools

Tool Purpose
AWS Budgets Alert when actual or forecasted spending exceeds a threshold; per-team, per-project budgets
Cost and Usage Report (CUR) Most granular billing data; hourly/daily/monthly; delivered to S3; queryable with Athena
AWS Compute Optimizer Analyzes CloudWatch metrics (14+ days) for right-sizing EC2, EBS, Lambda recommendations
Personal Health Dashboard Personalized events affecting YOUR specific resources (retirements, maintenance, availability)

Exam Tips & Quick Reference

Scenario-to-Answer Mapping

Scenario Keyword / Requirement Correct Answer
"Without custom code" Config rule + SSM Automation (not Lambda)
"Individual process CPU monitoring" CloudWatch agent procstat plugin
"Disk space / memory on EC2" CloudWatch agent (not default metrics)
"File didn't arrive — notify" CloudWatch alarm, Invocations=0, missing data=breaching
"Don't know expected baseline" CloudWatch anomaly detection
"Both metric A AND metric B must alarm" Composite alarm
"Same dashboard across all deployments" Export dashboard JSON → CloudFormation DashboardBody
"All ASG instances in one dashboard" CloudWatch Metrics Explorer with ASG tag filter
"Verify CloudTrail logs not tampered" Enable log file integrity validation on trail
"Who made API calls with stolen key" CloudTrail Event History, filter by access key
"EBS volume slow after snapshot restore" Enable Fast Snapshot Restore (FSR)
"T-series instance CPU throttled" Enable Unlimited mode
"EFS PercentIOLimit = 100%" Switch to Provisioned Throughput
"RDS too many connections (Lambda)" Deploy RDS Proxy
"Daily cost report to S3" Cost and Usage Report (CUR)
"Alert when spend forecast exceeds budget" AWS Budgets
"Identify which developer created resources" Activate createdBy tag + Cost Explorer
"Upcoming EC2 hardware maintenance" Personal Health Dashboard
"Right-size EC2 recommendations" AWS Compute Optimizer

Common Traps

  • Default EC2 metrics include disk ops, not disk space: DiskReadOps is available by default; disk space utilization is NOT — it requires the CloudWatch agent.
  • Anomaly detection vs. static threshold: If the question says the team doesn't know what "normal" looks like, anomaly detection is always preferred over an arbitrary threshold.
  • Missing data treatment: Only set missing data to "breaching" when the absence of data is itself a problem (file not arriving, Lambda not running). For all other cases use the default.
  • Cost tags in wrong account: Activating tags in a member account does nothing — always activate in the management/payer account.

Key Terms — Domain 1

Term One-Line Definition
CloudWatch agent Software installed on EC2 to collect memory, disk, and process metrics
procstat plugin CloudWatch agent plugin for per-process metrics (CPU, memory, PID count)
Composite alarm Single alarm that combines multiple CloudWatch alarms with AND/OR logic
Anomaly detection alarm ML-based alarm that alerts when a metric deviates from its expected band
CloudWatch Synthetics Scheduled canary scripts that simulate user browser interactions
Fast Snapshot Restore (FSR) EBS feature that pre-initializes snapshot blocks to avoid first-access latency
RDS Proxy Connection pooler between Lambda/apps and RDS that prevents connection exhaustion
Cost Allocation Tags Resource tags activated in the payer account to appear in billing reports
Compute Optimizer AWS service that analyzes usage metrics and recommends right-sized instance types
Personal Health Dashboard Personalized view of AWS service events affecting your specific resources

End of Domain 1. Continue to Domain 2: Reliability and Business Continuity →


Ready to test yourself?

Practice questions for this topic

Start Practicing →

SOA-C03 Topics

Topic 1 of 5