Showing posts with label Observability. Show all posts
Showing posts with label Observability. Show all posts

Monday, March 18, 2024

Observability in AWS


1        Introduction

The application logs across microservices are currently centralized in CloudWatch.  However, the true potential of this data is only realized when it is proactively monitored and analyzed for insights.

With the increasing complexity of .NET applications, it becomes essential to gather custom metrics that go beyond standard logs. Tools like PerfMon provide deep insights into vital performance indicators such as CPU usage, memory leaks, and exception rates. By integrating these tools with AWS CloudWatch, we can create a robust monitoring ecosystem tailored to our application's specific needs.

2        Amazon CloudWatch Application Insights

2.1       Enable CloudWatch agent  

·        Install CloudWatch Agent: On EC2 instances where.NET and SQL Server applications are running, install the CloudWatch Agent. This agent collects metrics and logs from servers and sends them to CloudWatch.

·        Configure the Agent: Configure the CloudWatch Agent to collect logs and metrics. For .NET applications, we want to collect logs from IIS, application event logs, and custom application logs. For SQL Server, collect SQL Server logs and Windows application event logs.

·        Setup and configure CloudWatch application Insights.

o   Application Insights offers two avenues: automatic collection and custom metric integration.  For advanced debugging, Garbage collections, thread metrics, IO metris, asp .net core metrics, we need to configure custom metrics.

o   Using the CloudWatch agent for Windows:

 

2.2       Custom metrics

AWS cloudwatch agent can collect performance counter data from Windows servers and send it to CloudWatch.  This table serves as a guideline for setting up CloudWatch Alarms and notifying support teams.

 

Metric

Description

Acceptable Threshold

Alert Trigger Condition

CPU Usage

Monitor CPU utilization of microservices and batch jobs

Below 85%

CPU usage > 85% for 5 minutes

Memory Usage (Private Bytes)

Committed process memory (managed and native)

Varies by application

Usage > 90% of allocated memory for 5 minutes

Memory Usage (Virtual Bytes)

Total virtual memory allocated for the process

Varies by application. To be revisited later

Usage > 90% of allocated memory for 5 minutes

Memory Usage (Working Set)

Physical memory consumed by the process

Varies by application. To be revisited later

Usage > 90% of allocated memory for 5 minutes

Garbage Collection (% Time in GC)

Percentage of time spent in garbage collection

Below 10%

> 10% for 5 minutes

Request Queue Length

Number of requests waiting to be processed

Below 50

> 50 for 5 minutes

Response Success Rate

Percentage of successful responses

Above 95%

< 95% for 5 minutes

HTTP Status 5xx

Server error responses

0

Any occurrence

HTTP Status 4xx

Client error responses

Varies by application. To be revisited later

Increase > 50% compared to 24-hour average

Database Connection Errors

Issues connecting to the database

0

Any occurrence

Authentication Failures

Unauthorized access attempts

0

Any occurrence

 

 

Message Queue

Service

Metric

Description

Acceptable Threshold

Alert Trigger Condition

SNS

NumberOfNotificationsFailed

The number of notifications that failed to deliver.

0

Any failures

SQS

ApproximateAgeOfOldestMessage

The age of the oldest message in the queue, indicating processing delays.

60 minutes

Messages older than 60 minutes, indicating slow processing or a stalled consumer

SQS

NumberOfMessagesDeleted

The number of messages deleted from the queue, which indicates successful processing.

Consistent with expected volume

Decrease by 50% over 30 minutes compared to the average rate, indicating processing issues

 

2.3       Tickets and prioritization

The priority for raising tickets based on the metrics provided should be determined by the impact and urgency of each alert condition. An Alert has to be triggered to discovery support based on the alert trigger conditions defined in the tables above

This can be reviewed and adjust the priority levels based on usage and business impact. 

Critical Priority

·        CPU Usage: CPU usage > 95% for 5 minutes. High CPU utilization can lead to service degradation and outages.

·        Failed Requests: Any occurrence. Indicates a failure in processing requests, directly impacting user transactions. Context log

·        HTTP Status 5xx: Any occurrence. Server errors directly affect the availability of services.

·        Database Connection Errors: Any occurrence. Indicates problems accessing the database, which could cripple application functionality.

High Priority

·        Memory Usage (Private Bytes, Virtual Bytes, Working Set): Usage > 90% of allocated memory for 5 minutes. High memory usage can lead to application crashes or severe performance degradation.

·        Garbage Collection (% Time in GC): > 10% for 5 minutes. Excessive garbage collection can indicate memory leaks or inefficient memory use, impacting performance.

Medium Priority

·        Request Queue Length: > 50 for 5 minutes. While this indicates a backlog, it may be manageable in the short term but needs to be monitored for worsening trends.

·        HTTP Status 4xx: Increase > 50% compared to 24-hour average. These are client errors and may not always indicate a server-side issue, but a significant increase could point to API or user experience problems.

·        Message Queue metrics

 

 

3        AWS CloudWatch Logs Insights: Common Error Pattern Queries

This section details a set of queries for AWS CloudWatch Logs Insights, designed to efficiently identify and diagnose common error patterns in log data. It includes queries for general errors, .NET exceptions, HTTP status codes, database connection issues, and more, enhancing the monitoring and troubleshooting process.

3.1       CorrelationID

Correlation is injected by the application and will be used to track the request flow. This ID ties together all actions and requests that are part of the same user action or transaction. Logging it with each request helps in tracing the flow of a request across different services and components.

·        X-Correlation-Id for User Actions: Application Generates a unique X-Correlation-Id for the entire user action sequence (e.g., submitting a form) at the frontend level (Angular App) provides a consistent identifier to track the flow of this action across all involved components and services.

·        Propagation of Identifiers: Application Forwarding both X-Correlation-Id through all service calls, including those to subsequent microservices or AWS services (SNS/SQS), ensures that every transactional step in the process can be traced back to the original user action.

fields @timestamp, @message

| filter @message like /X-Correlation-Id: 20240129140153_86d1715f-b839-45c9-b857-66b27922305f/

| sort @timestamp desc

| limit 100

 

3.2       General Errors and Exceptions

fields @timestamp, @message

| filter @message like /ERROR|EXCEPTION/

| sort @timestamp desc

| limit 100

3.3       NET Specific Exceptions

fields @timestamp, @message

| filter @message like /NullReferenceException|OutOfMemoryException|StackOverflowException/

| sort @timestamp desc

| limit 100

3.4       HTTP Status Codes (Client and Server Errors)

fields @timestamp, @message

| filter @message like /HTTP\/\d\.\d\" 5\d\d|HTTP\/\d\.\d\" 4\d\d/

| sort @timestamp desc

| limit 100

3.5       Database Connection Errors

fields @timestamp, @message

| filter @message like /Connection failed|Connection refused|SQL Exception/

| sort @timestamp desc

| limit 100

3.6       Authentication and Security Issues

fields @timestamp, @message

| filter @message like /unauthorized|forbidden|access denied|authentication failed/

| sort @timestamp desc

| limit 100

3.7       Configuration Errors

fields @timestamp, @message

| filter @message like /Configuration error|invalid configuration|missing configuration/

| sort @timestamp desc

| limit 100

3.8       Connectivity and Timeout Issues

fields @timestamp, @message

| filter @message like /Connection timeout|Connection lost|unable to connect/

| sort @timestamp desc

| limit 100

3.9       Service or Dependency Failures

fields @timestamp, @message

| filter @message like /Service unavailable|dependency failure|failed to connect to service/

| sort @timestamp desc

| limit 100

 

4        Raise issue Tickets

Automate sending email support tickets involves setting up a process where cloudwatch  automatically notifies discovery support team of issues through email.

This can be achieved by using AWS services to send emails based on the results from CloudWatch Logs Insights. we can use Amazon Simple Email Service (SES) to send emails that automatically create or update tickets. Here's how to set it up:

1.      Configure SES to have necessary permissions to send emails to Discovery Support

2.      Create Lambda function, which will be triggered by cloudwatch logs insights results as defined in section 3

3.      Parse the query results to identify exceptions

4.      Format the email content based on exceptions found.

5.      USE AMAZON SES to send mail to Discoverysupport

6.      Trigger lambda functions based on schedule internals i.e every 30 min

a.      We can have multiple lambda functions which can run at different frequencies if required.

 

 

4.1       Avoid raising duplicate tickets

b.      Create S3 bucket to store lambda logs

c.      Log Content: Each log entry should contain identifiers for the tickets created, such as timestamp, ticket description, error/message ID, or any relevant detail that helps in identifying duplicates.

d.      Log check

                                                    i.     Lambda function read the log file from S3

                                                   ii.     Check for Existing Tickets: Parse the content of the log file to check if a ticket for the current issue has already been raised. Implement logic based on timestamps, error identifiers, or message content to determine duplication.

                                                  iii.     Proceed Based on Check: If an existing ticket is found, update the log with new occurrence details but skip creating a new ticket. Otherwise, proceed to create a new ticket and log the action.

 

4.2       Configuration to turn off the notifications.

Have an option to turn off notifications not to overload with alerts to technical teams till the issue is addressed

4.3       Clear Logs

Every few hours have a copy of logs(error) in S3 buckets. Clear logs every XX hours. Provide access to S3 to QA/Few tech teams

 

4.4       Strategies to manage large number of Alerts

Incase if there are large number of alerts generated (for  example http 500 errors) then Instead of raising a ticket for each error, configure alerts based on error rates or spikes in HTTP 500 errors over a certain threshold.

·        This query groups exceptions by message within 30-minute windows, counting occurrences in log insights

 

fields @message, @timestamp

| filter @message like /Exception/

| stats count(*) as exceptionCount by bin(30m) as timeWindow, @message

| sort timeWindow desc

| limit 100

5        Dashboards

Proposing few dashboards for observability on the metrics collected

5.1       Overview Dashboard

Purpose: To provide a high-level view of the overall health of the application.

Widgets:

·        CPU and Memory Usage: Display graphs for CPU usage and memory metrics across different microservices.

·        Garbage Collection Metrics: Visualize GC heap size and % time in GC.

·        Throughput and Error Rates: Show the rate of requests and errors over time, including HTTP status codes.

5.2       Performance Dashboard:

Purpose: To focus on the performance aspects of the application.

Widgets:

·        I/O Metrics: Graphs for Disk and Network I/O metrics.

·        Thread and Task Metrics: Display thread count, thread pool count, and lock contention rate.

·        ASP.NET Core Specific Metrics: Visualize request queue length, failed requests, and response success rate.

5.3       Error and Anomaly Detection Dashboard:

Purpose: To quickly identify and diagnose common error patterns and anomalies.

Widgets:

·        Error and Exception Graphs: Filters and graphs for general errors, .NET specific exceptions, and HTTP status codes.

·        Database Connection Errors: Display trends and spikes in database connectivity issues.

·        Security and Configuration Errors: Graphs for authentication failures, configuration errors, and connectivity issues.

5.4       AWS Services Monitoring Dashboard (SNS and SQS):

Purpose: To monitor the performance and health of AWS services used by Application.

Widgets:

·        SNS Metrics: Display NumberOfNotificationsDelivered, NumberOfNotificationsFailed

·        SQS Metrics: Show NumberOfMessagesDeleted, ApproximateNumberOfMessagesVisible

 

6        Conclusion

In conclusion, the integration of AWS monitoring tools like CloudWatch, Application Insights, coupled with the use of custom metrics, provides a robust framework for observability in the application. This approach not only centralizes log management and performance monitoring. The suggested dashboards will offer real-time insights into the application's operational health, performance bottlenecks, and potential areas of improvement, ensuring high availability and optimal performance of the services. By continuously monitoring these metrics, team can swiftly respond to issues, maintain service quality, and improve customer satisfaction. Additionally, these insights can guide future enhancements and resource optimization strategies for the platform.

Claim Based Authorization

  1.      Claim Based Authorization ·         Token Validation: As requests come into the Ocelot API Gateway, the first step is to validat...