Monitoring, Logging, Analysis, Remediation, and Performance Optimization
Topic 1 of 5 · Study notes
AWS Certified CloudOps Engineer - Associate (SOA-C03) — Domain 1: Monitoring, Logging, Analysis, Remediation & Performance Optimization
Exam Code: SOA-C03 | Level: Associate
Domain Weight: 20% (Monitoring/Logging/Remediation) + 12% (Cost & Performance) | Total Domains: 6 | Passing Score: 720/1000
Table of Contents
- Amazon CloudWatch
- AWS CloudTrail
- Amazon EventBridge
- AWS Systems Manager — Monitoring and Remediation
- EC2 and EBS Performance Optimization
- Cost and Billing Monitoring
- Exam Tips & Quick Reference
1. Amazon CloudWatch
Amazon CloudWatch is the primary observability service for AWS. It collects metrics, logs, and events, enables alarming, and supports automated remediation. The exam tests both default metric coverage and what requires the CloudWatch agent.
1.1 Metrics — Default vs. Custom
AWS services push metrics directly to CloudWatch without any configuration.
| Service | Key Default Metrics |
|---|---|
| EC2 | CPUUtilization, NetworkIn, NetworkOut, DiskReadOps, DiskWriteOps |
| EBS | VolumeReadOps, VolumeWriteOps, VolumeQueueLength, VolumeReadBytes, VolumeWriteBytes |
| ALB | HealthyHostCount, UnHealthyHostCount, RequestCount, RequestCountPerTarget, HTTPCode_ELB_5xx_Count |
| RDS | CPUUtilization, DatabaseConnections, FreeStorageSpace, ReadIOPS, WriteIOPS |
| SQS | ApproximateNumberOfMessagesVisible, ApproximateAgeOfOldestMessage, ApproximateNumberOfMessagesDelayed |
| Lambda | Invocations, Errors, Duration, Throttles, ConcurrentExecutions |
| ELB | HealthyHostCount, UnHealthyHostCount, Latency, HTTPCode_Backend_5xx |
The following are NOT available by default and require the CloudWatch agent installed on the EC2 instance: memory utilization (RAM usage, swap usage), disk space and disk utilization (percentage used, bytes available), individual process metrics, custom application log files, Windows Event Logs, and Windows Performance Counters.
Exam Tip:
DiskReadBytesandDiskWriteBytesare EC2-level metrics — NOT EBS-level. The correct EBS metrics areVolumeReadBytesandVolumeWriteBytes. If an alarm uses EC2 disk metrics, reconfigure to use EBS volume metrics.
1.2 Installing and Configuring the CloudWatch Agent
Installation methods (in order of scalability):
- Systems Manager Run Command — best for fleets; no SSH needed; use
AWS-ConfigureAWSPackagedocument. - EC2 user data — runs on first launch only.
- Manual SSH — single instances only.
The EC2 instance role MUST include CloudWatchAgentServerPolicy (allows agent to publish metrics and logs) and AmazonSSMManagedInstanceCore (allows agent config to be read from Parameter Store).
Agent Configuration File
The configuration file (/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json) controls what is collected. Store it centrally in SSM Parameter Store (AmazonCloudWatch-linux, AmazonCloudWatch-windows) so a fleet of hundreds of instances can share a single configuration — update once, apply everywhere via State Manager.
`procstat` Plugin — Monitor Individual Processes
Used to identify which specific process is consuming CPU or memory during incidents:
{
"metrics": {
"metrics_collected": {
"procstat": [
{
"pattern": "nginx",
"measurement": ["cpu_usage", "memory_rss", "pid_count"]
}
]
}
}
}
`append_dimensions` — Add Custom Dimensions
Adds dimensions to all collected metrics — useful for tagging with environment, team, or application:
{
"agent": { "metrics_collection_interval": 60 },
"metrics": {
"append_dimensions": {
"Environment": "Production",
"Application": "WebApp"
}
}
}
Use SSM Run Command with append-config flag to add additional metrics to specific instances without rebuilding the config from scratch:
amazon-cloudwatch-agent-ctl -a append-config -m ec2 -c ssm:AmazonCloudWatch-DHCP -s
1.3 Alarm Types
Static Threshold Alarms
Alert when a metric crosses a fixed value. Best when the normal range is well-understood. Example: CPU > 80% for 5 minutes.
Anomaly Detection Alarms
Uses ML to model expected behavior based on historical patterns. Creates a band of expected values (upper and lower bounds). Adapts automatically to daily/weekly patterns and seasonal trends.
Exam Tip: When the question says "application team does not know expected usage or growth," use anomaly detection alarms rather than static threshold alarms.
Composite Alarms
Combines multiple alarms using AND/OR logic and only sends ONE notification when all conditions are true. Reduces alarm noise significantly.
AlarmRule: "ALARM(DiskUtilizationAlarm) AND ALARM(DiskReadOpsAlarm)"
Missing Data Treatment
When an alarm's metric does not report data, choose how to treat it: notBreaching (treat as OK), breaching (treat as in alarm), ignore (maintain current state), or missing (evaluation insufficient).
Exam Tip: When monitoring for a file that SHOULD arrive every hour via Lambda trigger: create an alarm on Lambda
Invocations= 0 for 1 hour, and set missing data treatment to "breaching" — because zero invocations with no data means the file never arrived.
1.4 CloudWatch Dashboards
Automating Dashboard Deployment
To deploy identical dashboards across every application deployment: create the dashboard manually, export it as JSON (Dashboard → Actions → View/Edit source), then include it in a CloudFormation template:
Resources:
AppDashboard:
Type: AWS::CloudWatch::Dashboard
Properties:
DashboardName: !Sub "${AppName}-dashboard"
DashboardBody: !Sub |
{
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [["AWS/EC2", "CPUUtilization", "InstanceId", "${EC2Instance}"]],
"title": "EC2 CPU"
}
}
]
}
Metrics Explorer for Auto Scaling Groups
Instead of manually updating dashboards when new instances launch, use CloudWatch Metrics Explorer filtered by aws:autoscaling:groupName tag. This creates dynamic visualizations that automatically include new instances as they launch — no Lambda function needed.
1.5 Logs and Logs Insights
Every application should send logs to a CloudWatch Logs log group with a retention policy set on the log group (not on individual log streams).
CloudWatch Logs Insights is a purpose-built query language for analyzing logs without exporting them:
-- Count Lambda errors by type in last 7 days
fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) as errorCount by errorType
| sort errorCount desc
-- Find top 5 IPs through NAT gateway
fields dstAddr, bytes
| stats sum(bytes) as totalBytes by dstAddr
| sort totalBytes desc
| limit 5
| Tool | When to Use |
|---|---|
| Logs Insights | Query directly in CloudWatch Logs; no export needed; faster setup |
| Athena | Query data exported to S3; better for very large datasets |
1.6 Synthetics Canaries
| Canary Type | What It Does |
|---|---|
| Heartbeat monitor | Regularly loads a URL, captures screenshots and HTTP response |
| API canary | Tests REST API endpoints for availability and correct responses |
| Broken link checker | Scans all links on a page for broken URLs |
| GUI workflow builder | Simulates multi-step user journeys (login → add to cart → checkout) |
Key Concept: Route 53 health checks perform a simple endpoint availability check. CloudWatch Synthetics performs full browser simulation with screenshots and user journey validation. Use Synthetics when you need to "follow the same routes and actions as a customer."
2. AWS CloudTrail
CloudTrail records every API call made to AWS services — who made the call (IAM identity), when, from what IP, what was called, and whether it succeeded or failed.
2.1 CloudTrail Fundamentals
| Type | Coverage |
|---|---|
| Single-region trail | Only records events in one region |
| Multi-region trail (recommended) | Records events in all regions; sends all to one S3 bucket |
| Organization trail | Records events for all accounts in an AWS Organization |
Log File Integrity Validation
Enable integrity validation on the trail itself (not on the S3 bucket). Every hour, CloudTrail creates a digest file in S3 containing SHA-256 hashes of every log file delivered in that hour. The digest file itself is signed with CloudTrail's private key.
aws cloudtrail validate-log-file-integrity \
--trail-arn arn:aws:cloudtrail:us-east-1:123456789:trail/MyTrail \
--start-time 2024-01-01T00:00:00Z \
--end-time 2024-01-02T00:00:00Z
Exam Tip: Enable log file integrity on the trail itself, not on the S3 bucket destination. CloudWatch Logs streaming, S3 versioning, or MFA Delete alone do NOT provide integrity validation.
Investigating Compromised Access Keys
When an IAM access key is accidentally pushed to a public GitHub repository: go to CloudTrail Event History, filter by Access Key, enter the compromised key ID, and see every API call made with that key, from what IPs, at what times.
2.2 CloudTrail vs. CloudWatch Logs vs. VPC Flow Logs
| Source | What It Captures | Query Tool |
|---|---|---|
| CloudTrail | AWS API calls (who did what to which resource) | Event History, Athena |
| CloudWatch Logs | Application logs, OS logs, Lambda logs | Logs Insights |
| VPC Flow Logs | Network traffic metadata (IP, port, bytes, action) | Athena, Logs Insights |
| S3 Server Access Logs | HTTP requests to S3 objects (source IP, operation, status) | Athena |
| ALB Access Logs | HTTP request details including source IP, target, response code | Athena |
3. Amazon EventBridge
EventBridge routes AWS events and scheduled triggers to targets like Lambda, SNS, and SSM Automation — enabling event-driven automation without custom polling code.
3.1 EventBridge Rules
Event Pattern Rules
Match and route specific AWS events:
{
"source": ["aws.ec2"],
"detail-type": ["EC2 Instance State-change Notification"],
"detail": {
"state": ["terminated", "stopped"]
}
}
Scheduled Rules
Use cron or rate expressions to trigger Lambda, SSM Automation, and more:
rate(1 hour)— every hourcron(0 20 * * ? *)— every day at 8 PM UTCcron(0 9 ? * MON-FRI *)— weekdays at 9 AM
EventBridge API Destinations
Allows forwarding events to external HTTPS endpoints (third-party ticketing systems, webhooks). Configure a connection (authentication headers, API keys), create an API destination with the HTTPS endpoint URL, and add an input transformer to reshape the event JSON before sending.
3.2 Common EventBridge Patterns
| Business Requirement | EventBridge Pattern |
|---|---|
| Notify when EC2 starts/stops | EC2 State-change event → SNS topic |
| Nightly report generation | Scheduled rule → Lambda function |
| Re-enable CloudTrail when disabled | CloudTrail config change event → SSM Automation runbook |
| Spot Instance interruption notification | EC2 Spot Interruption Warning → SNS |
| Auto-remediate non-compliant resources | Config → EventBridge → SSM Automation |
4. AWS Systems Manager — Monitoring and Remediation
SSM provides fleet management, remediation, and secure instance access — all without requiring SSH or custom code.
4.1 SSM for Monitoring and Remediation
Key Concept: The exam repeatedly tests "without custom code" scenarios. The correct answer is always AWS Config managed rule + SSM Automation document — not EventBridge → Lambda (custom code).
Common SSM Automation runbooks for remediation: AWS-RebootEC2Instance, AWS-StopEC2Instance, AWS-StartEC2Instance, AWS-TerminateEC2Instance, AWS-ReleaseElasticIP, and AWS-ConfigureAWSPackage.
Session Manager for Audit Logging
Configure Session Manager to stream all session data to CloudWatch Logs. Every command typed and output returned is logged. No SSH, no bastion hosts, no security group inbound rules needed.
State Manager for Consistent Configuration
Use SSM State Manager to continuously enforce desired state across a fleet. Create an association to run AmazonCloudWatch-ManageAgent on a schedule (e.g., every 30 minutes), targeting instances by tag. This ensures the CloudWatch agent stays configured consistently — even after instance replacement.
5. EC2 and EBS Performance Optimization
5.1 EC2 Performance
Burstable Instances (T-Series) and CPU Credits
T2, T3, T3a, and T4g instances earn CPU credits when running below baseline and spend credits when above. Symptoms of credit exhaustion include CPU pinned at baseline, gradual increase in response times, and VolumeReadOps dropping to less than 10% of peak despite requests.
Fix: Enable Unlimited mode to allow bursting beyond credit balance:
aws ec2 modify-instance-credit-specification \
--instance-credit-specification "InstanceId=i-1234567890abcdef0,CpuCredits=unlimited"
Auto Scaling — Scaling Policies
| Policy | Behavior | Best For |
|---|---|---|
| Simple Scaling | One alarm → one action; cooldown applies | Legacy use cases |
| Step Scaling | Multiple thresholds → proportional steps | Unpredictable traffic needing proportional response |
| Target Tracking | Maintains a target metric value automatically | Most workloads (e.g., keep CPU at 50%) |
| Scheduled Scaling | Pre-emptively change capacity at specific times | Predictable peaks (business hours, weekly patterns) |
When scale-out is too slow, diagnose whether the delay is in launching, bootstrapping, or becoming healthy. For long user data scripts, pre-bake AMIs. For long health check grace periods, reduce them. Add a warm pool to keep pre-initialized instances ready to launch instantly.
5.2 EBS Performance Deep Dive
Volume Types
| Volume Type | IOPS | Throughput | Key Characteristic |
|---|---|---|---|
| gp2 | 3 IOPS/GB; max 16,000 | 250 MB/s | IOPS tied to size; legacy |
| gp3 | Baseline 3,000; max 16,000 (configurable) | 125–1,000 MB/s | IOPS independent of size; recommended |
| io1/io2 | Up to 64,000 (io2 Block Express: 256,000) | Up to 4,000 MB/s | High-performance; highest cost |
| st1 | Max 500 | Up to 500 MB/s | Sequential access; not bootable |
| sc1 | Max 250 | Up to 250 MB/s | Cold data; cheapest EBS option |
Diagnosing EBS Problems
High VolumeQueueLength (consistently > 1) indicates I/O requests are waiting due to IOPS saturation. Fix: increase provisioned IOPS by modifying to io2 or increasing gp3 IOPS. On gp2, increase volume size to gain more IOPS or migrate to gp3.
Volume restored from snapshot is slow initially because EBS volumes restored from snapshots have blocks stored in S3 — first access to each block requires a round trip to S3. Fix: enable Fast Snapshot Restore (FSR) on the snapshot so blocks are pre-initialized.
Exam Tip: Need 10,000 IOPS on gp2? You must provision 3,333 GB (3 IOPS/GB) = expensive. On gp3, just configure 10,000 IOPS on any volume size = cheaper. Always prefer gp3 for IOPS requirements.
5.3 EFS Performance Deep Dive
Performance Modes
| Mode | Ops/Sec Limit | Latency | Best For |
|---|---|---|---|
| General Purpose (default) | 7,000 ops/sec | Lower | Most workloads, latency-sensitive apps |
| Max I/O | Unlimited | Higher | Massively parallel workloads |
Note: You cannot switch an existing General Purpose file system to Max I/O in place — you must create a new EFS with Max I/O and migrate data.
Throughput Modes
| Mode | How Throughput Is Determined |
|---|---|
| Bursting | Scales with storage size; earns burst credits when below baseline |
| Provisioned | You specify throughput independently of storage size |
Exam Tip:
PercentIOLimit= 100% means the file system is at its operations-per-second limit. The fix is Provisioned Throughput mode — not Max I/O (which addresses parallel clients, not raw throughput).
5.4 S3 Performance and Cost Optimization
S3 Transfer Acceleration
Routes uploads through CloudFront edge locations → AWS global backbone → S3 bucket. Uses a special endpoint: bucket.s3-accelerate.amazonaws.com. Best for global users uploading to a single centralized S3 bucket.
Multipart Upload: Recommended for objects > 100 MB, required for > 5 GB. Use aws s3 cp (automatically uses multipart) rather than aws s3api put-object (single-part only). Clean up incomplete multipart uploads with an S3 Lifecycle rule → AbortIncompleteMultipartUpload after 7 days.
S3 Storage Classes
| Storage Class | Retrieval | Min Duration | Use Case |
|---|---|---|---|
| Standard | Immediate | None | Frequently accessed data |
| Standard-IA | Immediate | 30 days | Infrequent, needs fast retrieval |
| One Zone-IA | Immediate | 30 days | Infrequent, single AZ, lower cost |
| Intelligent-Tiering | Immediate | None | Unknown or changing access patterns |
| Glacier Instant | Immediate | 90 days | Archive with instant retrieval requirement |
| Glacier Flexible | Minutes to hours | 90 days | Archive; retrieval under 5 hours |
| Glacier Deep Archive | 12–48 hours | 180 days | Long-term compliance archive |
5.5 RDS Performance Optimization
RDS Performance Insights
Visual dashboard showing database load as Average Active Sessions (AAS), broken down by wait events, SQL statements, hosts, and users. Available for MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, and Aurora.
| Tool | What It Shows |
|---|---|
| Performance Insights | Database-level metrics (query performance, waits, connections) |
| Enhanced Monitoring | OS-level metrics (CPU, memory, disk, network at 1–60 second intervals) |
RDS Proxy — Connection Pooling
Problem: Lambda functions exhaust the max_connections parameter on RDS because each invocation opens a new connection. Solution: RDS Proxy maintains a pool of persistent connections to the database. Lambda connects to the proxy, which multiplexes thousands of Lambda connections into the pool. Supports MySQL, PostgreSQL, MariaDB, Aurora MySQL, and Aurora PostgreSQL.
Aurora Read Replica vs. RDS Read Replica
| Feature | Aurora Replica | RDS Read Replica |
|---|---|---|
| Replication lag | Typically < 100 ms | Seconds to minutes |
| Promotion to primary | Instant (shared storage) | Requires I/O replay (minutes) |
| Connection endpoint | Reader endpoint (auto load balances) | Separate read endpoint per replica |
| Max replicas | 15 | 5 |
| Auto Scaling | Yes | No |
6. Cost and Billing Monitoring
6.1 Cost Allocation and Reporting
Cost Allocation Tags
Apply tags to resources (e.g., CostCenter: Engineering), then activate them in the Billing console of the payer/management account as user-defined cost allocation tags. Tags appear in Cost Explorer after activation (up to 24 hours). The createdBy tag is AWS-generated and shows which IAM identity created each resource.
Exam Tip: Tags must be activated in the payer account. Activating in member accounts has no effect on cost reporting.
Billing and Cost Tools
| Tool | Purpose |
|---|---|
| AWS Budgets | Alert when actual or forecasted spending exceeds a threshold; per-team, per-project budgets |
| Cost and Usage Report (CUR) | Most granular billing data; hourly/daily/monthly; delivered to S3; queryable with Athena |
| AWS Compute Optimizer | Analyzes CloudWatch metrics (14+ days) for right-sizing EC2, EBS, Lambda recommendations |
| Personal Health Dashboard | Personalized events affecting YOUR specific resources (retirements, maintenance, availability) |
Exam Tips & Quick Reference
Scenario-to-Answer Mapping
| Scenario Keyword / Requirement | Correct Answer |
|---|---|
| "Without custom code" | Config rule + SSM Automation (not Lambda) |
| "Individual process CPU monitoring" | CloudWatch agent procstat plugin |
| "Disk space / memory on EC2" | CloudWatch agent (not default metrics) |
| "File didn't arrive — notify" | CloudWatch alarm, Invocations=0, missing data=breaching |
| "Don't know expected baseline" | CloudWatch anomaly detection |
| "Both metric A AND metric B must alarm" | Composite alarm |
| "Same dashboard across all deployments" | Export dashboard JSON → CloudFormation DashboardBody |
| "All ASG instances in one dashboard" | CloudWatch Metrics Explorer with ASG tag filter |
| "Verify CloudTrail logs not tampered" | Enable log file integrity validation on trail |
| "Who made API calls with stolen key" | CloudTrail Event History, filter by access key |
| "EBS volume slow after snapshot restore" | Enable Fast Snapshot Restore (FSR) |
| "T-series instance CPU throttled" | Enable Unlimited mode |
| "EFS PercentIOLimit = 100%" | Switch to Provisioned Throughput |
| "RDS too many connections (Lambda)" | Deploy RDS Proxy |
| "Daily cost report to S3" | Cost and Usage Report (CUR) |
| "Alert when spend forecast exceeds budget" | AWS Budgets |
| "Identify which developer created resources" | Activate createdBy tag + Cost Explorer |
| "Upcoming EC2 hardware maintenance" | Personal Health Dashboard |
| "Right-size EC2 recommendations" | AWS Compute Optimizer |
Common Traps
- Default EC2 metrics include disk ops, not disk space:
DiskReadOpsis available by default; disk space utilization is NOT — it requires the CloudWatch agent. - Anomaly detection vs. static threshold: If the question says the team doesn't know what "normal" looks like, anomaly detection is always preferred over an arbitrary threshold.
- Missing data treatment: Only set missing data to "breaching" when the absence of data is itself a problem (file not arriving, Lambda not running). For all other cases use the default.
- Cost tags in wrong account: Activating tags in a member account does nothing — always activate in the management/payer account.
Key Terms — Domain 1
| Term | One-Line Definition |
|---|---|
| CloudWatch agent | Software installed on EC2 to collect memory, disk, and process metrics |
| procstat plugin | CloudWatch agent plugin for per-process metrics (CPU, memory, PID count) |
| Composite alarm | Single alarm that combines multiple CloudWatch alarms with AND/OR logic |
| Anomaly detection alarm | ML-based alarm that alerts when a metric deviates from its expected band |
| CloudWatch Synthetics | Scheduled canary scripts that simulate user browser interactions |
| Fast Snapshot Restore (FSR) | EBS feature that pre-initializes snapshot blocks to avoid first-access latency |
| RDS Proxy | Connection pooler between Lambda/apps and RDS that prevents connection exhaustion |
| Cost Allocation Tags | Resource tags activated in the payer account to appear in billing reports |
| Compute Optimizer | AWS service that analyzes usage metrics and recommends right-sized instance types |
| Personal Health Dashboard | Personalized view of AWS service events affecting your specific resources |
End of Domain 1. Continue to Domain 2: Reliability and Business Continuity →
Ready to test yourself?
Practice questions for this topic