1 Introduction

The application logs across microservices are currently centralized in CloudWatch. However, the true potential of this data is only realized when it is proactively monitored and analyzed for insights.

With the increasing complexity of .NET applications, it becomes essential to gather custom metrics that go beyond standard logs. Tools like PerfMon provide deep insights into vital performance indicators such as CPU usage, memory leaks, and exception rates. By integrating these tools with AWS CloudWatch, we can create a robust monitoring ecosystem tailored to our application's specific needs.

2 Amazon CloudWatch Application Insights

2.1 Enable CloudWatch agent

· Install CloudWatch Agent: On EC2 instances where.NET and SQL Server applications are running, install the CloudWatch Agent. This agent collects metrics and logs from servers and sends them to CloudWatch.

· Configure the Agent: Configure the CloudWatch Agent to collect logs and metrics. For .NET applications, we want to collect logs from IIS, application event logs, and custom application logs. For SQL Server, collect SQL Server logs and Windows application event logs.

· Setup and configure CloudWatch application Insights.

o Application Insights offers two avenues: automatic collection and custom metric integration. For advanced debugging, Garbage collections, thread metrics, IO metris, asp .net core metrics, we need to configure custom metrics.

o Using the CloudWatch agent for Windows:

2.2 Custom metrics

AWS cloudwatch agent can collect performance counter data from Windows servers and send it to CloudWatch. This table serves as a guideline for setting up CloudWatch Alarms and notifying support teams.

Metric	Description	Acceptable Threshold	Alert Trigger Condition
CPU Usage	Monitor CPU utilization of microservices and batch jobs	Below 85%	CPU usage > 85% for 5 minutes
Memory Usage (Private Bytes)	Committed process memory (managed and native)	Varies by application	Usage > 90% of allocated memory for 5 minutes
Memory Usage (Virtual Bytes)	Total virtual memory allocated for the process	Varies by application. To be revisited later	Usage > 90% of allocated memory for 5 minutes
Memory Usage (Working Set)	Physical memory consumed by the process	Varies by application. To be revisited later	Usage > 90% of allocated memory for 5 minutes
Garbage Collection (% Time in GC)	Percentage of time spent in garbage collection	Below 10%	> 10% for 5 minutes
Request Queue Length	Number of requests waiting to be processed	Below 50	> 50 for 5 minutes
Response Success Rate	Percentage of successful responses	Above 95%	< 95% for 5 minutes
HTTP Status 5xx	Server error responses	0	Any occurrence
HTTP Status 4xx	Client error responses	Varies by application. To be revisited later	Increase > 50% compared to 24-hour average
Database Connection Errors	Issues connecting to the database	0	Any occurrence
Authentication Failures	Unauthorized access attempts	0	Any occurrence

Message Queue

Service	Metric	Description	Acceptable Threshold	Alert Trigger Condition
SNS	NumberOfNotificationsFailed	The number of notifications that failed to deliver.	0	Any failures
SQS	ApproximateAgeOfOldestMessage	The age of the oldest message in the queue, indicating processing delays.	60 minutes	Messages older than 60 minutes, indicating slow processing or a stalled consumer
SQS	NumberOfMessagesDeleted	The number of messages deleted from the queue, which indicates successful processing.	Consistent with expected volume	Decrease by 50% over 30 minutes compared to the average rate, indicating processing issues

2.3 Tickets and prioritization

The priority for raising tickets based on the metrics provided should be determined by the impact and urgency of each alert condition. An Alert has to be triggered to discovery support based on the alert trigger conditions defined in the tables above

This can be reviewed and adjust the priority levels based on usage and business impact.

Critical Priority

· CPU Usage: CPU usage > 95% for 5 minutes. High CPU utilization can lead to service degradation and outages.

· Failed Requests: Any occurrence. Indicates a failure in processing requests, directly impacting user transactions. Context log

· HTTP Status 5xx: Any occurrence. Server errors directly affect the availability of services.

· Database Connection Errors: Any occurrence. Indicates problems accessing the database, which could cripple application functionality.

High Priority

· Memory Usage (Private Bytes, Virtual Bytes, Working Set): Usage > 90% of allocated memory for 5 minutes. High memory usage can lead to application crashes or severe performance degradation.

· Garbage Collection (% Time in GC): > 10% for 5 minutes. Excessive garbage collection can indicate memory leaks or inefficient memory use, impacting performance.

Medium Priority

· Request Queue Length: > 50 for 5 minutes. While this indicates a backlog, it may be manageable in the short term but needs to be monitored for worsening trends.

· HTTP Status 4xx: Increase > 50% compared to 24-hour average. These are client errors and may not always indicate a server-side issue, but a significant increase could point to API or user experience problems.

· Message Queue metrics

3 AWS CloudWatch Logs Insights: Common Error Pattern Queries

This section details a set of queries for AWS CloudWatch Logs Insights, designed to efficiently identify and diagnose common error patterns in log data. It includes queries for general errors, .NET exceptions, HTTP status codes, database connection issues, and more, enhancing the monitoring and troubleshooting process.

3.1 CorrelationID

Correlation is injected by the application and will be used to track the request flow. This ID ties together all actions and requests that are part of the same user action or transaction. Logging it with each request helps in tracing the flow of a request across different services and components.

· X-Correlation-Id for User Actions: Application Generates a unique X-Correlation-Id for the entire user action sequence (e.g., submitting a form) at the frontend level (Angular App) provides a consistent identifier to track the flow of this action across all involved components and services.

· Propagation of Identifiers: Application Forwarding both X-Correlation-Id through all service calls, including those to subsequent microservices or AWS services (SNS/SQS), ensures that every transactional step in the process can be traced back to the original user action.

iii. Proceed Based on Check: If an existing ticket is found, update the log with new occurrence details but skip creating a new ticket. Otherwise, proceed to create a new ticket and log the action.

4.2 Configuration to turn off the notifications.

Have an option to turn off notifications not to overload with alerts to technical teams till the issue is addressed

4.3 Clear Logs

Every few hours have a copy of logs(error) in S3 buckets. Clear logs every XX hours. Provide access to S3 to QA/Few tech teams

4.4 Strategies to manage large number of Alerts

Incase if there are large number of alerts generated (for example http 500 errors) then Instead of raising a ticket for each error, configure alerts based on error rates or spikes in HTTP 500 errors over a certain threshold.

· This query groups exceptions by message within 30-minute windows, counting occurrences in log insights

fields @message, @timestamp

| filter @message like /Exception/

| stats count(*) as exceptionCount by bin(30m) as timeWindow, @message

| sort timeWindow desc

| limit 100

5 Dashboards

Proposing few dashboards for observability on the metrics collected

5.1 Overview Dashboard

Purpose: To provide a high-level view of the overall health of the application.

Widgets:

· CPU and Memory Usage: Display graphs for CPU usage and memory metrics across different microservices.

· Garbage Collection Metrics: Visualize GC heap size and % time in GC.

· Throughput and Error Rates: Show the rate of requests and errors over time, including HTTP status codes.

5.2 Performance Dashboard:

Purpose: To focus on the performance aspects of the application.

Widgets:

· I/O Metrics: Graphs for Disk and Network I/O metrics.

· Thread and Task Metrics: Display thread count, thread pool count, and lock contention rate.

· ASP.NET Core Specific Metrics: Visualize request queue length, failed requests, and response success rate.

5.3 Error and Anomaly Detection Dashboard:

Purpose: To quickly identify and diagnose common error patterns and anomalies.

Widgets:

· Error and Exception Graphs: Filters and graphs for general errors, .NET specific exceptions, and HTTP status codes.

· Database Connection Errors: Display trends and spikes in database connectivity issues.

· Security and Configuration Errors: Graphs for authentication failures, configuration errors, and connectivity issues.

5.4 AWS Services Monitoring Dashboard (SNS and SQS):

Purpose: To monitor the performance and health of AWS services used by Application.

Widgets:

· SNS Metrics: Display NumberOfNotificationsDelivered, NumberOfNotificationsFailed

· SQS Metrics: Show NumberOfMessagesDeleted, ApproximateNumberOfMessagesVisible

6 Conclusion

In conclusion, the integration of AWS monitoring tools like CloudWatch, Application Insights, coupled with the use of custom metrics, provides a robust framework for observability in the application. This approach not only centralizes log management and performance monitoring. The suggested dashboards will offer real-time insights into the application's operational health, performance bottlenecks, and potential areas of improvement, ensuring high availability and optimal performance of the services. By continuously monitoring these metrics, team can swiftly respond to issues, maintain service quality, and improve customer satisfaction. Additionally, these insights can guide future enhancements and resource optimization strategies for the platform.

Vamsi Tokala's Blog

Monday, March 18, 2024

Observability in AWS

1 Introduction

2 Amazon CloudWatch Application Insights

2.1 Enable CloudWatch agent

2.2 Custom metrics

2.3 Tickets and prioritization

3 AWS CloudWatch Logs Insights: Common Error Pattern Queries

3.1 CorrelationID

3.2 General Errors and Exceptions

3.3 NET Specific Exceptions

3.4 HTTP Status Codes (Client and Server Errors)

3.5 Database Connection Errors

3.6 Authentication and Security Issues

3.7 Configuration Errors

3.8 Connectivity and Timeout Issues

3.9 Service or Dependency Failures

4 Raise issue Tickets

4.1 Avoid raising duplicate tickets

4.2 Configuration to turn off the notifications.

4.3 Clear Logs

4.4 Strategies to manage large number of Alerts

5 Dashboards

5.1 Overview Dashboard

5.2 Performance Dashboard:

5.3 Error and Anomaly Detection Dashboard:

5.4 AWS Services Monitoring Dashboard (SNS and SQS):

6 Conclusion

Claim Based Authorization

Welcome