Sequence list with relevant compliance and data protection controls to implement data governance in the system
Vamsi Tokala's Blog
Things I learn about and want to share.
Wednesday, August 28, 2024
Friday, April 26, 2024
Claim Based Authorization
1.
Claim Based Authorization
·
Token Validation: As requests come into the
Ocelot API Gateway, the first step is to validate the JWT token issued by FAMS.
This validation checks the token's integrity and authenticity.
·
Fetch User Claims: Once the token is validated,
Ocelot should then communicate with the admin microservice to retrieve specific
claims related to the user's roles and permissions. This is crucial for
implementing fine-grained access control based on the roles associated with the
token's user.
·
Validate Token
o custom
middleware in Ocelot to intercept incoming requests. Extract the JWT token from
the Authorization header. Validate the token’s signature, issuer, and
expiration using FAMS's KID (Same as H2M token validation strategy).
·
Retrieve User Claims
o After
successful token validation, extract the user identifier from the token (claim
that identifies the user).
o Make
an API call from Ocelot to the admin microservice, passing the user identifier
to fetch the corresponding roles and permissions.
o The
admin microservice should respond with the necessary claims which define what
actions the user is authorized to perform.
·
Enforce Authorization
o Utilize
the fetched claims to enforce authorization policies within Ocelot. This can be
done through route rules in Ocelot configuration.
o Based
on the claims, decide whether to forward the request to downstream services or
reject it.
·
Caching
o
Caching roles and permissions in Ocelot if they
do not change frequently, to reduce the number of requests to the admin
microservice.
2. Cross Zone Authorization
Users who are
allowed to make a cross zone call will have a role defined in admin
microservice (or in IAM). That scope will be added
to the authorization header which can then be used in make cross zone api call
else reject in its own zone
For cross zone
call, add custom claim Boolean flag indicating cross zone access.
·
Ocelot receives cross zone request with role, extracts
the JWT token.
·
Forwards the token to authorization service.
·
Authorization service validates the token and
check cross zone permission.
·
Authorization service will allow/deny the request.
Monday, April 15, 2024
Barcode Printing Solution
Tuesday, March 26, 2024
Authentication in depth in Microservices architecture
I had posted an article on two tier authentication earlier. Please read the post below for details
https://vamsitokala.blogspot.com/2023/09/two-tier-authentication-in.html
I am adding some more insights on the Authentication process in microservices architecture
1.
Angular Client (User) to Microservice
Communication
For client-to-microservice interactions where the client is
a angular web application with FAMS as IDP
·
Client Authenticates Directly with FAMS:
o
Angular UI initiates the authentication process
with FAMS, This involves redirecting the
user to login page of IDP(intranet)
o
Upon Authentication, IDP issues an ID token,
Access Token, Refresh token
·
Sending Requests to Microservices:
o
Client Sends Request with Access Token: When
making requests to the microservices, the client includes this token in the
Authorization header.
o
These requests are directed towards the API
gateway ( Ocelot), which acts as the entry point to the microservices
architecture.
·
API Gateway Validates Token:
o
The API gateway (Ocelot) intercepts the incoming
request, validates the token (ensures it's valid, checks expiration, and
verifies it against IDP JWKS endpoint), and if valid, forwards the request to
the appropriate microservice.
o
Microservices themselves might not need to
validate the token again and they trust the API gateway's authentication
process.
·
Cache
o
Caching Public Keys: Cache FAMS’s JWKS end point
at the api gateway to minimize network calls for key retrieval during token
validation.
2.
Microservice to Microservice
(Service-to-Service) Communication
For microservice-to-microservice communication that requires
server-to-server authentication without user context:
·
Microservices Act as Clients to Cognito:
o
In this scenario, microservices acts as a
"client" and uses the Client Credentials OAuth flow to authenticate
with AWS Cognito and obtain an access token.
·
Obtaining a Token Before Reaching the API
Gateway
o
A
microservice that needs to make a request to another microservice through the
API gateway first obtains a token from Cognito using its own credentials
(client ID and secret).
·
Include the Token in API Calls:
o
This service then makes the API call through the
API gateway, including the access token in the Authorization header.
·
API Gateway Validates the Token:
o
As with user-initiated requests, the API gateway
validates the token before routing the request to the target microservice.
3.
Implementation Considerations
By centralizing authentication
logic at the API gateway level, you can streamline security management and
ensure consistent authentication across all services.
·
Secure Storage of Credentials
o
Services using the Client Credentials flow must
securely store their AWS Cognito credentials (client ID and secret) in AWS
secret manager.
·
Configuration in AWS Cognito
o
Create a User Pool - All Microservices are part
of single eLIMS system sharing the same user base, A single user pool might be
sufficient.
o
Configure each Microservice as an App Client in Cognito
user pool
§
Enable OAUTH flow for each client for M2M
·
Token Validation
o
IDP (intranet) Token Validation
§
Validate token signature against JWKS endpoint
for FAMS
·
Check
·
Token Expiration
·
Issuer (ISS)
·
Audience (aud)
o
Audience should match app client id
o
Cognito Token Validation
§
By setting AWSCognito:Authority, it
automatically retrieves JWKS from AWS cognito and validates it implicitly
§
we want to disable audience validation. For
example, if we have a microservices architecture where multiple APIs are using
the same JWT for authentication, wemight set ValidateAudience to false to
allow any of the APIs to accept the token.
§
In general, it’s recommended to keep
ValidateAudience set to true to ensure the token is intended for the correct
recipient. But the final decision depends on specific use case and security
requirements.
Monday, March 18, 2024
Observability in AWS
1
Introduction
The application logs across microservices are currently
centralized in CloudWatch. However, the
true potential of this data is only realized when it is proactively monitored
and analyzed for insights.
With the increasing complexity of .NET applications, it becomes
essential to gather custom metrics that go beyond standard logs. Tools like
PerfMon provide deep insights into vital performance indicators such as CPU
usage, memory leaks, and exception rates. By integrating these tools with AWS
CloudWatch, we can create a robust monitoring ecosystem tailored to our
application's specific needs.
2
Amazon CloudWatch Application Insights
2.1 Enable
CloudWatch agent
·
Install CloudWatch Agent: On EC2
instances where.NET and SQL Server applications are running, install the
CloudWatch Agent. This agent collects metrics and logs from servers and sends
them to CloudWatch.
·
Configure the Agent:
Configure the CloudWatch Agent to collect logs and metrics. For .NET
applications, we want to collect logs from IIS, application event logs, and
custom application logs. For SQL Server, collect SQL Server logs and Windows
application event logs.
·
Setup and configure CloudWatch application Insights.
o
Application Insights offers two avenues:
automatic collection and custom metric integration. For advanced debugging, Garbage collections,
thread metrics, IO metris, asp .net core metrics, we need to configure custom metrics.
o
Using the CloudWatch agent for Windows:
2.2 Custom
metrics
AWS cloudwatch agent can collect performance counter data from
Windows servers and send it to CloudWatch.
This table serves as a guideline for setting up CloudWatch Alarms and
notifying support teams.
Metric |
Description |
Acceptable Threshold |
Alert Trigger Condition |
CPU Usage |
Monitor CPU
utilization of microservices and batch jobs |
Below 85% |
CPU usage > 85% for
5 minutes |
Memory Usage (Private
Bytes) |
Committed process
memory (managed and native) |
Varies by application |
Usage > 90% of
allocated memory for 5 minutes |
Memory Usage (Virtual
Bytes) |
Total virtual memory
allocated for the process |
Varies by application.
To be revisited later |
Usage > 90% of
allocated memory for 5 minutes |
Memory Usage (Working
Set) |
Physical memory
consumed by the process |
Varies by application.
To be revisited later |
Usage > 90% of
allocated memory for 5 minutes |
Garbage Collection (%
Time in GC) |
Percentage of time
spent in garbage collection |
Below 10% |
> 10% for 5 minutes |
Request Queue Length |
Number of requests
waiting to be processed |
Below 50 |
> 50 for 5 minutes |
Response Success Rate |
Percentage of
successful responses |
Above 95% |
< 95% for 5 minutes |
HTTP Status 5xx |
Server error responses |
0 |
Any occurrence |
HTTP Status 4xx |
Client error responses |
Varies by application.
To be revisited later |
Increase > 50%
compared to 24-hour average |
Database Connection
Errors |
Issues connecting to
the database |
0 |
Any occurrence |
Authentication
Failures |
Unauthorized access
attempts |
0 |
Any occurrence |
Message Queue
Service |
Metric |
Description |
Acceptable Threshold |
Alert Trigger Condition |
SNS |
NumberOfNotificationsFailed |
The number of
notifications that failed to deliver. |
0 |
Any failures |
SQS |
ApproximateAgeOfOldestMessage |
The age of the oldest
message in the queue, indicating processing delays. |
60 minutes |
Messages older than 60
minutes, indicating slow processing or a stalled consumer |
SQS |
NumberOfMessagesDeleted |
The number of messages
deleted from the queue, which indicates successful processing. |
Consistent with
expected volume |
Decrease by 50% over 30
minutes compared to the average rate, indicating processing issues |
2.3 Tickets and
prioritization
The priority for raising tickets based on the metrics
provided should be determined by the impact and urgency of each alert condition.
An Alert has to be triggered to discovery support based on the alert
trigger conditions defined in the tables above
This can be reviewed and adjust the priority levels based on
usage and business impact.
Critical Priority
·
CPU Usage: CPU usage > 95% for 5 minutes.
High CPU utilization can lead to service degradation and outages.
·
Failed Requests: Any
occurrence. Indicates a failure in processing requests, directly impacting user
transactions. Context
log
·
HTTP Status 5xx: Any occurrence. Server errors
directly affect the availability of services.
·
Database Connection Errors: Any occurrence.
Indicates problems accessing the database, which could cripple application
functionality.
High Priority
·
Memory Usage (Private Bytes, Virtual Bytes, Working
Set): Usage > 90% of allocated memory for 5 minutes. High memory usage can
lead to application crashes or severe performance degradation.
·
Garbage Collection (% Time in GC): > 10% for
5 minutes. Excessive garbage collection can indicate memory leaks or
inefficient memory use, impacting performance.
Medium Priority
·
Request Queue Length: > 50 for 5 minutes.
While this indicates a backlog, it may be manageable in the short term but
needs to be monitored for worsening trends.
·
HTTP Status 4xx: Increase > 50% compared to
24-hour average. These are client errors and may not always indicate a
server-side issue, but a significant increase could point to API or user
experience problems.
·
Message Queue metrics
3
AWS CloudWatch Logs Insights: Common Error Pattern Queries
This section details a set of queries for AWS CloudWatch Logs
Insights, designed to efficiently identify and diagnose common error
patterns in log data. It includes queries for general errors, .NET exceptions,
HTTP status codes, database connection issues, and more, enhancing the
monitoring and troubleshooting process.
3.1 CorrelationID
Correlation is injected by the application and will be used to
track the request flow. This ID ties together all actions and requests that are
part of the same user action or transaction. Logging it with each request helps
in tracing the flow of a request across different services and components.
·
X-Correlation-Id for User Actions: Application
Generates a unique X-Correlation-Id for the entire user action sequence (e.g.,
submitting a form) at the frontend level (Angular App) provides a consistent
identifier to track the flow of this action across all involved components and
services.
·
Propagation of Identifiers: Application
Forwarding both X-Correlation-Id through all service calls, including those to
subsequent microservices or AWS services (SNS/SQS), ensures that every
transactional step in the process can be traced back to the original user
action.
fields @timestamp, @message
| filter @message like /X-Correlation-Id:
20240129140153_86d1715f-b839-45c9-b857-66b27922305f/
| sort @timestamp desc
| limit 100
3.2 General
Errors and Exceptions
fields @timestamp, @message
| filter @message like /ERROR|EXCEPTION/
| sort @timestamp desc
| limit 100
3.3 NET Specific
Exceptions
fields @timestamp, @message
| filter @message like
/NullReferenceException|OutOfMemoryException|StackOverflowException/
| sort @timestamp desc
| limit 100
3.4 HTTP Status
Codes (Client and Server Errors)
fields @timestamp, @message
| filter @message like /HTTP\/\d\.\d\"
5\d\d|HTTP\/\d\.\d\" 4\d\d/
| sort @timestamp desc
| limit 100
3.5 Database
Connection Errors
fields @timestamp, @message
| filter @message like /Connection failed|Connection
refused|SQL Exception/
| sort @timestamp desc
| limit 100
3.6 Authentication
and Security Issues
fields @timestamp, @message
| filter @message like
/unauthorized|forbidden|access denied|authentication failed/
| sort @timestamp desc
| limit 100
3.7 Configuration
Errors
fields @timestamp, @message
| filter @message like /Configuration error|invalid
configuration|missing configuration/
| sort @timestamp desc
| limit 100
3.8 Connectivity
and Timeout Issues
fields @timestamp, @message
| filter @message like /Connection
timeout|Connection lost|unable to connect/
| sort @timestamp desc
| limit 100
3.9 Service or
Dependency Failures
fields @timestamp, @message
| filter @message like /Service
unavailable|dependency failure|failed to connect to service/
| sort @timestamp desc
| limit 100
4
Raise issue Tickets
Automate sending email support tickets involves setting up a
process where cloudwatch automatically
notifies discovery support team of issues through email.
This can be achieved by using AWS services to send emails
based on the results from CloudWatch Logs Insights. we can use Amazon Simple
Email Service (SES) to send emails that automatically create or update tickets.
Here's how to set it up:
1.
Configure SES to have necessary permissions to
send emails to Discovery Support
2.
Create Lambda function, which will be triggered
by cloudwatch logs insights results as defined in section 3
3.
Parse the query results to identify exceptions
4.
Format the email content based on exceptions
found.
5.
USE AMAZON SES to send mail to Discoverysupport
6.
Trigger lambda functions based on schedule
internals i.e every 30 min
a.
We can have multiple lambda functions which can
run at different frequencies if required.
4.1
Avoid raising duplicate tickets
b.
Create S3 bucket to store lambda logs
c.
Log Content: Each log entry should contain
identifiers for the tickets created, such as timestamp, ticket description,
error/message ID, or any relevant detail that helps in identifying duplicates.
d.
Log check
i. Lambda
function read the log file from S3
ii. Check
for Existing Tickets: Parse the content of the log file to check if a ticket
for the current issue has already been raised. Implement logic based on
timestamps, error identifiers, or message content to determine duplication.
iii. Proceed
Based on Check: If an existing ticket is found, update the log with new
occurrence details but skip creating a new ticket. Otherwise, proceed to create
a new ticket and log the action.
4.2
Configuration to turn off the notifications.
Have an option to turn off notifications not to overload
with alerts to technical teams till the issue is addressed
4.3
Clear Logs
Every few hours have a copy of logs(error) in S3 buckets.
Clear logs every XX hours. Provide access to S3 to QA/Few tech teams
4.4
Strategies to manage large number of Alerts
Incase if there are large number of alerts generated
(for example http 500 errors) then Instead
of raising a ticket for each error, configure alerts based on error rates or
spikes in HTTP 500 errors over a certain threshold.
·
This query groups exceptions by message within 30-minute
windows, counting occurrences in log insights
fields @message, @timestamp
| filter @message like /Exception/
| stats count(*) as exceptionCount by bin(30m) as
timeWindow, @message
| sort timeWindow desc
| limit 100
5
Dashboards
Proposing few dashboards for observability on the metrics
collected
5.1 Overview
Dashboard
Purpose: To provide a high-level view of the overall health of the
application.
Widgets:
·
CPU and Memory Usage: Display graphs for CPU
usage and memory metrics across different microservices.
·
Garbage Collection Metrics: Visualize GC heap
size and % time in GC.
·
Throughput and Error Rates: Show the rate of
requests and errors over time, including HTTP status codes.
5.2 Performance
Dashboard:
Purpose: To focus on the performance aspects of the application.
Widgets:
·
I/O Metrics: Graphs for Disk and Network I/O
metrics.
·
Thread and Task Metrics: Display thread count,
thread pool count, and lock contention rate.
·
ASP.NET Core Specific Metrics: Visualize
request queue length, failed requests, and response success rate.
5.3 Error and
Anomaly Detection Dashboard:
Purpose: To quickly identify and diagnose common error patterns
and anomalies.
Widgets:
·
Error and Exception Graphs: Filters and graphs
for general errors, .NET specific exceptions, and HTTP status codes.
·
Database Connection Errors: Display trends and
spikes in database connectivity issues.
·
Security and Configuration Errors: Graphs for
authentication failures, configuration errors, and connectivity issues.
5.4 AWS Services
Monitoring Dashboard (SNS and SQS):
Purpose: To monitor the performance
and health of AWS services used by Application.
Widgets:
·
SNS Metrics: Display
NumberOfNotificationsDelivered, NumberOfNotificationsFailed
·
SQS Metrics: Show NumberOfMessagesDeleted,
ApproximateNumberOfMessagesVisible
6
Conclusion
In conclusion, the integration of AWS monitoring tools like CloudWatch,
Application Insights, coupled with the use of custom metrics, provides a robust
framework for observability in the application. This approach not only
centralizes log management and performance monitoring. The suggested dashboards
will offer real-time insights into the application's operational health,
performance bottlenecks, and potential areas of improvement, ensuring high
availability and optimal performance of the services. By continuously
monitoring these metrics, team can swiftly respond to issues, maintain service
quality, and improve customer satisfaction. Additionally, these insights can
guide future enhancements and resource optimization strategies for the platform.
Data Governance compliance List
Sequence list with relevant compliance and data protection controls to implement data governance in the system
-
Microservices are decoupled, self-contained units, which makes security pivotal. Two-tier authentication can offer an extra layer of prote...
-
In a microservices architecture, data is typically distributed among different services, each with its own database. Implementing a multi-l...
-
Choosing between SQL Server editions requires a careful balance of high availability benefits against cost implications. For applications th...