Skip to content

EC2 Troubleshooting

0

Hello World!

In order to troubleshoot a server bottleneck in terms of memory/storage/CPU/network, what is the best way to approach the issues? Being a guy with Linux background, I only am limited to checking the server from os internally, but on AWS, not quite sure how to make the approach?

Any help is appreciated! Thanks :)

AWS
asked 2 months ago47 views
2 Answers
0

When troubleshooting server bottlenecks on AWS EC2 instances, you can combine your Linux expertise with AWS-specific tools for a comprehensive approach:

  1. System-level monitoring: Continue using familiar Linux tools like top, htop, free -m, vmstat, iotop, and iftop to monitor CPU, memory, disk I/O, and network usage from within the instance.

  2. CloudWatch metrics: AWS CloudWatch provides metrics for EC2 instances including CPU utilization, network throughput, and disk operations. However, memory usage isn't included by default.

  3. CloudWatch agent: Install the CloudWatch agent to collect and monitor detailed system-level metrics, including memory usage, which isn't available in the default EC2 metrics. This is particularly useful for identifying memory constraints.

  4. Log analysis: Check system and application logs (/var/log/syslog, /var/log/dmesg, application-specific logs) for clues about performance issues.

  5. Resource constraints: For T-series instances (like t2.micro, t3.micro), monitor CPU credit usage as these are burstable instances with baseline performance.

  6. Instance sizing: Verify if your workload is appropriate for the instance type. Running resource-intensive applications on smaller instances can lead to performance issues.

  7. EC2Rescue tool: For more serious issues, AWS provides EC2Rescue which can help diagnose and troubleshoot problems. This can be run manually or automatically using AWS Systems Manager Automation with the AWSSupport-ExecuteEC2Rescue runbook.

  8. Swap file: If memory is the bottleneck, consider adding a swap file, though this is a temporary solution and might impact performance.

Common bottleneck scenarios include:

  • Instances appearing to run normally but becoming unresponsive (often memory-related)
  • Periodic high system load despite normal-looking metrics (could be application-specific tasks or I/O bottlenecks)
  • Late response times despite low CPU/memory usage (possibly network latency or application code issues)

If issues persist after your investigation, consider upgrading to a larger instance type or optimizing your application code and resource usage.
Sources
Troubleshoot issues with Amazon EC2 instances - Amazon Elastic Compute Cloud
Run the EC2Rescue tool on unreachable instances - AWS Systems Manager
EC2 Ubuntu instance stops working everyday | AWS re:Post
EC2 Server Late response issue | AWS re:Post
AWS EC2 T2 Medium Performance Issue | AWS re:Post

answered 2 months ago
EXPERT
reviewed 2 months ago
0

Step-by-Step EC2 Bottleneck Troubleshooting Guide

  1. Start with AWS CloudWatch CloudWatch provides system-level metrics even if you don’t log into the instance: CPUUtilization – if it's >80% for long periods, CPU is likely a bottleneck. DiskReadOps / DiskWriteOps – high values might indicate I/O issues. NetworkIn / NetworkOut – check for bandwidth saturation. StatusCheckFailed – shows instance-level issues (hardware or networking). Note: Enable detailed monitoring (1-minute granularity) if it's disabled.

  2. Check OS-Level Metrics (inside EC2 Linux) From your Linux background: top, htop, vmstat, iostat, free -m, df -h → CPU, memory, swap, disk I/O usage. netstat, ss, iftop, nethogs → Network traffic analysis. Example: top -o %MEM # Sort by memory usage iotop # Real-time I/O usage (if installed) dstat # All-in-one overview (needs to be installed)

  3. Enable EC2 Instance-Level Diagnostics Install CloudWatch Agent to push memory and disk metrics to CloudWatch. sudo yum install amazon-cloudwatch-agent sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard Log stream monitoring with CloudWatch Logs (optional but recommended).

Consider SSM Agent for access without SSH.

  1. Review EC2 Instance Type vs Workload If resource usage is high: Are you using the right instance family (compute-optimized, memory-optimized, storage-optimized)? Would burstable (T series) behavior be a limiting factor? Check CPU Credit Balance.

  2. Check EBS Performance If your app is I/O heavy: Is EBS volume gp2 or gp3? gp2 has burst behavior, check VolumeReadOps/WriteOps and BurstBalance. Upgrade to gp3/io1/io2 for more consistent IOPS.

  3. Use AWS Compute Optimizer (Free Tool) This can tell you if the instance is over/under-provisioned based on recent metrics.

  4. Capture a Performance Snapshot If troubleshooting something transient: Create a CPU profile (e.g., perf, flamegraph, py-spy for Python apps). Use dstat or sar to log metrics over time.

answered 2 months ago