1
Introduction
The application logs across microservices are currently
centralized in CloudWatch. However, the
true potential of this data is only realized when it is proactively monitored
and analyzed for insights.
With the increasing complexity of .NET applications, it becomes
essential to gather custom metrics that go beyond standard logs. Tools like
PerfMon provide deep insights into vital performance indicators such as CPU
usage, memory leaks, and exception rates. By integrating these tools with AWS
CloudWatch, we can create a robust monitoring ecosystem tailored to our
application's specific needs.
2
Amazon CloudWatch Application Insights
2.1 Enable
CloudWatch agent
·
Install CloudWatch Agent: On EC2
instances where.NET and SQL Server applications are running, install the
CloudWatch Agent. This agent collects metrics and logs from servers and sends
them to CloudWatch.
·
Configure the Agent:
Configure the CloudWatch Agent to collect logs and metrics. For .NET
applications, we want to collect logs from IIS, application event logs, and
custom application logs. For SQL Server, collect SQL Server logs and Windows
application event logs.
·
Setup and configure CloudWatch application Insights.
o
Application Insights offers two avenues:
automatic collection and custom metric integration. For advanced debugging, Garbage collections,
thread metrics, IO metris, asp .net core metrics, we need to configure custom metrics.
o
Using the CloudWatch agent for Windows:
2.2 Custom
metrics
AWS cloudwatch agent can collect performance counter data from
Windows servers and send it to CloudWatch.
This table serves as a guideline for setting up CloudWatch Alarms and
notifying support teams.
Metric
|
Description
|
Acceptable Threshold
|
Alert Trigger Condition
|
CPU Usage
|
Monitor CPU
utilization of microservices and batch jobs
|
Below 85%
|
CPU usage > 85% for
5 minutes
|
Memory Usage (Private
Bytes)
|
Committed process
memory (managed and native)
|
Varies by application
|
Usage > 90% of
allocated memory for 5 minutes
|
Memory Usage (Virtual
Bytes)
|
Total virtual memory
allocated for the process
|
Varies by application.
To be revisited later
|
Usage > 90% of
allocated memory for 5 minutes
|
Memory Usage (Working
Set)
|
Physical memory
consumed by the process
|
Varies by application.
To be revisited later
|
Usage > 90% of
allocated memory for 5 minutes
|
Garbage Collection (%
Time in GC)
|
Percentage of time
spent in garbage collection
|
Below 10%
|
> 10% for 5 minutes
|
Request Queue Length
|
Number of requests
waiting to be processed
|
Below 50
|
> 50 for 5 minutes
|
Response Success Rate
|
Percentage of
successful responses
|
Above 95%
|
< 95% for 5 minutes
|
HTTP Status 5xx
|
Server error responses
|
0
|
Any occurrence
|
HTTP Status 4xx
|
Client error responses
|
Varies by application.
To be revisited later
|
Increase > 50%
compared to 24-hour average
|
Database Connection
Errors
|
Issues connecting to
the database
|
0
|
Any occurrence
|
Authentication
Failures
|
Unauthorized access
attempts
|
0
|
Any occurrence
|
Message Queue
Service
|
Metric
|
Description
|
Acceptable Threshold
|
Alert Trigger Condition
|
SNS
|
NumberOfNotificationsFailed
|
The number of
notifications that failed to deliver.
|
0
|
Any failures
|
SQS
|
ApproximateAgeOfOldestMessage
|
The age of the oldest
message in the queue, indicating processing delays.
|
60 minutes
|
Messages older than 60
minutes, indicating slow processing or a stalled consumer
|
SQS
|
NumberOfMessagesDeleted
|
The number of messages
deleted from the queue, which indicates successful processing.
|
Consistent with
expected volume
|
Decrease by 50% over 30
minutes compared to the average rate, indicating processing issues
|
2.3 Tickets and
prioritization
The priority for raising tickets based on the metrics
provided should be determined by the impact and urgency of each alert condition.
An Alert has to be triggered to discovery support based on the alert
trigger conditions defined in the tables above
This can be reviewed and adjust the priority levels based on
usage and business impact.
Critical Priority
·
CPU Usage: CPU usage > 95% for 5 minutes.
High CPU utilization can lead to service degradation and outages.
·
Failed Requests: Any
occurrence. Indicates a failure in processing requests, directly impacting user
transactions. Context
log
·
HTTP Status 5xx: Any occurrence. Server errors
directly affect the availability of services.
·
Database Connection Errors: Any occurrence.
Indicates problems accessing the database, which could cripple application
functionality.
High Priority
·
Memory Usage (Private Bytes, Virtual Bytes, Working
Set): Usage > 90% of allocated memory for 5 minutes. High memory usage can
lead to application crashes or severe performance degradation.
·
Garbage Collection (% Time in GC): > 10% for
5 minutes. Excessive garbage collection can indicate memory leaks or
inefficient memory use, impacting performance.
Medium Priority
·
Request Queue Length: > 50 for 5 minutes.
While this indicates a backlog, it may be manageable in the short term but
needs to be monitored for worsening trends.
·
HTTP Status 4xx: Increase > 50% compared to
24-hour average. These are client errors and may not always indicate a
server-side issue, but a significant increase could point to API or user
experience problems.
·
Message Queue metrics
3
AWS CloudWatch Logs Insights: Common Error Pattern Queries
This section details a set of queries for AWS CloudWatch Logs
Insights, designed to efficiently identify and diagnose common error
patterns in log data. It includes queries for general errors, .NET exceptions,
HTTP status codes, database connection issues, and more, enhancing the
monitoring and troubleshooting process.
3.1 CorrelationID
Correlation is injected by the application and will be used to
track the request flow. This ID ties together all actions and requests that are
part of the same user action or transaction. Logging it with each request helps
in tracing the flow of a request across different services and components.
·
X-Correlation-Id for User Actions: Application
Generates a unique X-Correlation-Id for the entire user action sequence (e.g.,
submitting a form) at the frontend level (Angular App) provides a consistent
identifier to track the flow of this action across all involved components and
services.
·
Propagation of Identifiers: Application
Forwarding both X-Correlation-Id through all service calls, including those to
subsequent microservices or AWS services (SNS/SQS), ensures that every
transactional step in the process can be traced back to the original user
action.
fields @timestamp, @message
| filter @message like /X-Correlation-Id:
20240129140153_86d1715f-b839-45c9-b857-66b27922305f/
| sort @timestamp desc
| limit 100
3.2 General
Errors and Exceptions
fields @timestamp, @message
| filter @message like /ERROR|EXCEPTION/
| sort @timestamp desc
| limit 100
3.3 NET Specific
Exceptions
fields @timestamp, @message
| filter @message like
/NullReferenceException|OutOfMemoryException|StackOverflowException/
| sort @timestamp desc
| limit 100
3.4 HTTP Status
Codes (Client and Server Errors)
fields @timestamp, @message
| filter @message like /HTTP\/\d\.\d\"
5\d\d|HTTP\/\d\.\d\" 4\d\d/
| sort @timestamp desc
| limit 100
3.5 Database
Connection Errors
fields @timestamp, @message
| filter @message like /Connection failed|Connection
refused|SQL Exception/
| sort @timestamp desc
| limit 100
3.6 Authentication
and Security Issues
fields @timestamp, @message
| filter @message like
/unauthorized|forbidden|access denied|authentication failed/
| sort @timestamp desc
| limit 100
3.7 Configuration
Errors
fields @timestamp, @message
| filter @message like /Configuration error|invalid
configuration|missing configuration/
| sort @timestamp desc
| limit 100
3.8 Connectivity
and Timeout Issues
fields @timestamp, @message
| filter @message like /Connection
timeout|Connection lost|unable to connect/
| sort @timestamp desc
| limit 100
3.9 Service or
Dependency Failures
fields @timestamp, @message
| filter @message like /Service
unavailable|dependency failure|failed to connect to service/
| sort @timestamp desc
| limit 100
4
Raise issue Tickets
Automate sending email support tickets involves setting up a
process where cloudwatch automatically
notifies discovery support team of issues through email.
This can be achieved by using AWS services to send emails
based on the results from CloudWatch Logs Insights. we can use Amazon Simple
Email Service (SES) to send emails that automatically create or update tickets.
Here's how to set it up:
1.
Configure SES to have necessary permissions to
send emails to Discovery Support
2.
Create Lambda function, which will be triggered
by cloudwatch logs insights results as defined in section 3
3.
Parse the query results to identify exceptions
4.
Format the email content based on exceptions
found.
5.
USE AMAZON SES to send mail to Discoverysupport
6.
Trigger lambda functions based on schedule
internals i.e every 30 min
a.
We can have multiple lambda functions which can
run at different frequencies if required.
4.1
Avoid raising duplicate tickets
b.
Create S3 bucket to store lambda logs
c.
Log Content: Each log entry should contain
identifiers for the tickets created, such as timestamp, ticket description,
error/message ID, or any relevant detail that helps in identifying duplicates.
d.
Log check
i. Lambda
function read the log file from S3
ii. Check
for Existing Tickets: Parse the content of the log file to check if a ticket
for the current issue has already been raised. Implement logic based on
timestamps, error identifiers, or message content to determine duplication.
iii. Proceed
Based on Check: If an existing ticket is found, update the log with new
occurrence details but skip creating a new ticket. Otherwise, proceed to create
a new ticket and log the action.
4.2
Configuration to turn off the notifications.
Have an option to turn off notifications not to overload
with alerts to technical teams till the issue is addressed
4.3
Clear Logs
Every few hours have a copy of logs(error) in S3 buckets.
Clear logs every XX hours. Provide access to S3 to QA/Few tech teams
4.4
Strategies to manage large number of Alerts
Incase if there are large number of alerts generated
(for example http 500 errors) then Instead
of raising a ticket for each error, configure alerts based on error rates or
spikes in HTTP 500 errors over a certain threshold.
·
This query groups exceptions by message within 30-minute
windows, counting occurrences in log insights
fields @message, @timestamp
| filter @message like /Exception/
| stats count(*) as exceptionCount by bin(30m) as
timeWindow, @message
| sort timeWindow desc
| limit 100
5
Dashboards
Proposing few dashboards for observability on the metrics
collected
5.1 Overview
Dashboard
Purpose: To provide a high-level view of the overall health of the
application.
Widgets:
·
CPU and Memory Usage: Display graphs for CPU
usage and memory metrics across different microservices.
·
Garbage Collection Metrics: Visualize GC heap
size and % time in GC.
·
Throughput and Error Rates: Show the rate of
requests and errors over time, including HTTP status codes.
5.2 Performance
Dashboard:
Purpose: To focus on the performance aspects of the application.
Widgets:
·
I/O Metrics: Graphs for Disk and Network I/O
metrics.
·
Thread and Task Metrics: Display thread count,
thread pool count, and lock contention rate.
·
ASP.NET Core Specific Metrics: Visualize
request queue length, failed requests, and response success rate.
5.3 Error and
Anomaly Detection Dashboard:
Purpose: To quickly identify and diagnose common error patterns
and anomalies.
Widgets:
·
Error and Exception Graphs: Filters and graphs
for general errors, .NET specific exceptions, and HTTP status codes.
·
Database Connection Errors: Display trends and
spikes in database connectivity issues.
·
Security and Configuration Errors: Graphs for
authentication failures, configuration errors, and connectivity issues.
5.4 AWS Services
Monitoring Dashboard (SNS and SQS):
Purpose: To monitor the performance
and health of AWS services used by Application.
Widgets:
·
SNS Metrics: Display
NumberOfNotificationsDelivered, NumberOfNotificationsFailed
·
SQS Metrics: Show NumberOfMessagesDeleted,
ApproximateNumberOfMessagesVisible
6
Conclusion
In conclusion, the integration of AWS monitoring tools like CloudWatch,
Application Insights, coupled with the use of custom metrics, provides a robust
framework for observability in the application. This approach not only
centralizes log management and performance monitoring. The suggested dashboards
will offer real-time insights into the application's operational health,
performance bottlenecks, and potential areas of improvement, ensuring high
availability and optimal performance of the services. By continuously
monitoring these metrics, team can swiftly respond to issues, maintain service
quality, and improve customer satisfaction. Additionally, these insights can
guide future enhancements and resource optimization strategies for the platform.