Sunday, June 21, 2026

This Blog Has Migrated to Substack

To better share deep technical blueprints, system design frameworks, and enterprise architecture insights, I have officially moved all my technical writing over to a new, modern publication platform.

My new home is at: vamsitokala.substack.com

What to Expect on the New Platform:

  • Better Technical Layouts: Clearer system design breakdowns, visual workflows, and architectural patterns.

  • Direct Updates: You can subscribe for free to get new architecture breakdowns sent straight to your inbox without having to check back manually.

  • Upcoming Deep-Dives: I will be continuing my focus on enterprise data layers, cloud optimization, and building governed AI patterns for heavily regulated environments.

The complete archive—including my earlier write-ups has already been fully migrated and is live right now.

Head over to the new link, explore the updated architecture blueprints, and join the free subscription list to stay connected:

👉 Click Here to Visit the New Publication

Wednesday, June 17, 2026

Designing a Governed AI Assistant for eLIMS

When I started thinking about this solution, the problem was very simple.

In eLIMS, users have a lot of data, but getting quick answers is not always easy. A user may want to know something like:

Show me studies not completed on time.

To answer this, the system has to check study details, planned completion date, completion data, actual completion date, user access, legal entity, and then decide whether the study is on time, delayed, or missing information.

So the question I had was:

Can we allow users to ask this in simple English, but still keep the backend safe and controlled?

That is the main problem this solution is solving.

Overall Design



The Main Design Decision

The most important decision I made was:

AI should not directly query the database.

AI is only used to understand what the user is asking.

It creates a plan.

That plan may be right or wrong. So I do not execute it directly. First, I send it to the validator.

Only if the validator says the plan is safe, the backend will execute it.

This is the main difference between this solution and a normal chatbot.

A chatbot may answer directly.
Here, AI only prepares the plan.
The application still controls what can run.

Why I Designed It This Way

In a system like eLIMS, we cannot allow AI to freely access data.

There are roles.
There are legal entities.
There are allowed services.
There are allowed fields.
There are audit needs.
There should not be any write operation from this assistant.

So I kept a clear boundary.

AI can understand the question.
Validator decides whether the plan is allowed.
Execution engine decides what data to read.
Audit service records what happened.

This keeps the user experience simple, but the backend is still controlled.

What Happens When User Asks a Question

Let us take this example:

Show me studies not completed on time.

The Angular UI sends this question to the .NET API.

The API asks the plan generator to convert this question into a structured plan. The plan may say that study service and core lab service are needed, and the output should include delayed or indeterminate studies.

But before doing anything with data, the validator checks the plan.

It checks:


This validator is very important. Even if AI gives a wrong service name or tries to use a field that is not allowed, the request will stop there.

No data will be read.

Execution Logic

Once the plan is valid, the execution engine takes over.

It checks whether the user has the right role and legal entity access.

After that, it reads the approved data and applies the business logic.

For the current use case, it checks study planned completion date and TestP actual completion date.

The classification is simple:

CaseResult
Actual completion date is before or equal to planned dateOn Time
Actual completion date is after planned dateDelayed
Planned date or actual completion date is missingIndeterminate

I kept Indeterminate separately because missing data should not be hidden. In real systems, missing data is also an important signal.

Why Service Registry Is Used

I did not want service names and fields to be hardcoded everywhere.

So I introduced a service registry.

The service registry tells the system:


This makes the design easier to extend.

Tomorrow, if we want to add sample service or protocol service, we do not have to change the full design. We add the service contract, allowed fields, and purpose. Then the same pattern can continue.

Why Audit Is Important

I also added audit as part of the flow.

For every request, we should know:

What did the user ask?
What plan was generated?
Which validation checks passed?
Which services were called?
What result was returned?

This is useful for support, debugging, and review.

In production systems, we cannot just show an answer and forget how it was created. We need traceability.

Final Architecture Pattern

The full pattern is this:





This is the pattern I wanted to prove with this solution.

Not AI directly reading everything.
Not a chatbot giving random answers.
Not a hardcoded report.

It is a controlled assistant.

The user gets a simple way to ask questions.
The system keeps control of validation, access, execution, and audit.

That is the balance I wanted in this design.

My View

For enterprise applications, AI should be used carefully.

It should help users ask better questions and get faster insights. But it should not bypass the application rules.

In this solution, AI is useful because it understands the user’s question. But the actual responsibility still stays with the application.

That is why I designed the eLIMS Insight Assistant as:

Question → Plan → Validate → Authorize → Execute → Audit → Result

This keeps it simple for the user and safe for the system.



Saturday, June 6, 2026

How I Review an FSD: A Practical Framework from Enterprise Delivery

A Practical Framework from Enterprise Delivery

For me, an FSD review is not just a document review. It is an early architecture checkpoint to identify design gaps, integration risks, data issues, security gaps, performance concerns, and production support challenges.

A good solution architect should not only ask, “Can we build this?”
We should ask, “Can we build this correctly, securely, scalably, testably, and supportably?”


1. Understand the Business Problem First

Before reviewing screens, fields, buttons, or validations, I first check:

  • Why is this requirement needed?

  • Which user or business problem is it solving?

  • What process does it support?

  • What happens if this feature is not delivered?

  • Is the requirement solving the real problem or only automating an unclear process?


2. Validate the Domain Logic

The FSD should be aligned with the business/domain model.

I check:

  • Are the right business entities defined?

  • Are relationships between entities clear?

  • Is the lifecycle/status flow explained?

  • Are exceptions and edge cases covered?

  • Is historical data or versioning required?

In regulated or scientific systems, even a small configuration change can impact calculation, audit, reporting, or approval.


3. Confirm Source of Truth

This is very important in microservices.

The FSD should clearly answer:

  • Which module owns the data?

  • Who can create or update it?

  • Which modules only consume it?

  • Is data fetched live, synced, cached, or stored as a snapshot?

  • What happens when the source data changes?

Lack of source-of-truth clarity can lead to duplicate data, inconsistent behavior, and production issues.


4. Review Data Model and Traceability

Even if the requirement looks simple, the data design should support future needs.

I check:

  • Mandatory and optional fields

  • Status and lifecycle

  • Audit requirements

  • History/versioning

  • Deactivation versus deletion

  • Reporting and downstream usage

A simple question I ask is:

After six months, can we still explain why the system behaved in a particular way?


5. Evaluate Integration Impact

A small change in one module may impact many systems.

I check:

  • APIs impacted

  • Events/messages required

  • Synchronous vs asynchronous flow

  • Retry and failure handling

  • Cache impact

  • Downstream reports or jobs

  • Impact on other microservices

This helps avoid surprises during development, UAT, and production.


6. Review Non-Functional Requirements

Functional requirements are not enough.

I also check:

  • Performance and volume

  • Security and authorization

  • Audit and compliance

  • Error handling

  • Monitoring and logging

  • Data retention

  • Production support needs

Example: If the FSD says export data, I ask:

  • How many records?

  • Should it be async?

  • Who can download?

  • Where will the file be stored?

  • How long should it be retained?


7. Check Security and Access Control

Access should not be defined only at UI level.

I check:

  • Page access

  • API access

  • Record-level access

  • Field-level access

  • Workflow-step access

  • Export/report access

What the UI hides, the backend must also protect.


8. Make Sure It Is Testable and Supportable

The FSD should help QA and support teams.

I check whether it supports:

  • Positive and negative test cases

  • Integration testing

  • Regression testing

  • Security testing

  • Performance testing

  • Audit testing

  • Production troubleshooting

A feature is not complete only because it works in UAT. It should also be monitorable, supportable, and recoverable in production.


Final Thought

A solution architect reviews an FSD to reduce ambiguity early.

The goal is to help BA, FA, developers, QA, DBA, DevOps, security, and support teams move with the same understanding.

A good FSD review prevents rework, reduces production risk, and ensures the solution is practical for real enterprise delivery.

Friday, March 20, 2026

Understanding How Certificates Are Used in Applications


Certificates are used to establish trust in secure communication. In simple terms, they help prove identity when two systems connect.

1. Server certificate

This is the most common use.

The server presents the certificate to the client so the client knows it is talking to the right system.

Example use cases

  • public website

  • internal web portal

  • REST API endpoint

  • CDN custom domain

  • load balancer HTTPS endpoint

Example names

  • portal.company.com

  • api.company.com

  • admin.internal.company.com

If a user opens https://portal.company.com, the server or load balancer presents the certificate. This is normal server-side TLS.

Where it may be installed

  • Application Load Balancer

  • CloudFront

  • API Gateway

  • IIS / nginx / Apache

  • Network Load Balancer with TLS listener


2. Client certificate

Here, the client presents a certificate to the server.

This is used when the server wants to verify who the calling system is.

Example use cases

  • machine-to-machine integration

  • secure partner API access

  • device authentication

  • VPN authentication

  • service account authentication without username/password

Example names

  • integration-client.pfx

  • partner-api-client-cert

  • device-auth-cert

If an order processing service calls a supplier API and sends a client certificate during the HTTPS connection, that is client authentication.


3. Mutual TLS (mTLS)

In mTLS, both sides present certificates.

  • server proves its identity to client

  • client also proves its identity to server

Example use cases

  • B2B integrations

  • secure internal service-to-service calls

  • healthcare or banking APIs

  • zero-trust internal APIs

Example names

  • payments-api.company.com

  • partner-gateway.vendor.com

  • inventory-service.internal.company.com

If inventory-service calls partner-gateway.vendor.com and both sides validate certificates, that is mTLS.


Where certificates can be installed

On a Load Balancer

Used when TLS terminates at the load balancer.

Example use cases

  • one entry point for many websites

  • centralized HTTPS management

  • host-based routing for multiple apps

Example names

  • shop.company.com

  • careers.company.com

  • support.company.com

A load balancer presents the right certificate based on the hostname.


On the Application Server

Used when TLS terminates directly on the server.

Example use cases

  • legacy applications

  • internal admin tools

  • direct server-hosted portals

  • applications not behind a centralized ingress

Example names

  • reports.internal.company.local

  • admin-node-07.corp.local

  • legacy-app.company.local

The certificate is installed directly on the server and bound in IIS or nginx.


In the Windows Certificate Store

Applications or IIS can load certificates from the Windows certificate store.

Example use cases

  • IIS-hosted website

  • .NET Windows service

  • internal scheduler calling external API

Example names

  • client-auth-service-cert

  • portal-web-cert

  • erp-integration-cert


As Files

Certificates may also exist as:

  • .pfx

  • .pem

  • .crt

  • .key

  • .jks

Example use cases

  • Linux web servers

  • Java applications

  • containerized services

  • outbound secure API integrations

Example names

  • server.crt

  • server.key

  • client-auth.pfx

  • service-keystore.jks


In Secrets Manager or Config

Some applications load certificates at runtime from secret or config stores.

Example use cases

  • microservices

  • containerized apps

  • outbound client-auth integrations

  • automated batch jobs

Example names

  • PAYMENT_GATEWAY_CLIENT_CERT

  • MTLS_CERT_PATH

  • PARTNER_API_KEYSTORE


How to know what a certificate is doing

If it is attached to:

  • load balancer HTTPS listener

  • CDN custom domain

  • API custom domain

  • IIS HTTPS binding

then it is usually being used as a server certificate.

If the application is configured with:

  • .pfx

  • keystore

  • thumbprint

  • ClientCertificates

  • X509Certificate2

then it may be used as a client certificate.


Important note

A certificate may contain both:

  • Server Authentication

  • Client Authentication

But that does not mean both are actually used.

The real question is:
Is the certificate being used only for server TLS, or also for client authentication?


Quick examples

Example 1: Public website

  • portal.company.com

  • certificate attached to load balancer

  • users access over HTTPS

Result:
server TLS only

Example 2: Internal portal

  • admin.internal.company.com

  • certificate installed in IIS

  • client certificates not required

Result:
server TLS only

Example 3: Secure partner integration

  • order service calls partner API

  • app loads partner-client.pfx

  • cert attached to outbound HTTPS client

Result:
client certificate usage

Example 4: B2B mutual TLS

  • partner-gateway.vendor.com

  • both systems exchange and validate certificates

Result:
mTLS


Final takeaway

Certificates can be installed on load balancers, CDNs, API gateways, servers, applications, secret stores, or appliances. Their role depends on who presents the certificate and where TLS terminates.

In short:

  • server presents certificate → server TLS

  • client presents certificate → client authentication

  • both present certificates → mTLS

Monday, March 16, 2026

Investigating S3 Cost Growth in AWS – How We Identified the Issue and What We Found

S3 Investigation Summary

We started this analysis because S3 cost was not reducing as expected, even after cleanup activities.

Cost signal

Recent S3 monthly cost stayed around:

  • Oct 2025: ~$5.8k

  • Nov 2025: ~$5.8k

  • Dec 2025: ~$4.7k

  • Jan 2026: ~$5.3k

  • Feb 2026: ~$5.9k

So S3 was still running at roughly $5k–$6k/month.

How we validated it

We did not rely only on folder view in S3 console.
We validated using:

  • bucket size metrics

  • Storage Lens

  • storage class split

  • prefix drill-down

This confirmed:

  • bucket size is around 260–270 TB

  • most of the data is still in Standard

  • storage is concentrated under a non-prod backup prefix

Key finding

This is not mainly a versioning issue or another-region issue.
The main problem is:

  • large backup data retained in Standard

  • under a non-prod backup path

  • with likely retention / lifecycle gap

Incomplete multipart upload issue

One additional contributor may be incomplete multipart uploads.

In this case, large SQL backup files are uploaded to S3 in parts.
If a backup/upload job fails in the middle and the upload is not completed or aborted, S3 keeps the uploaded parts.

That means:

  • no final usable backup object

  • but storage is still consumed

  • and cost continues in Standard storage

This usually happens when:

  • backup job fails midway

  • retry starts a new upload

  • old partial upload is not cleaned up

Fix direction

Main actions:

  • review retention for non-prod backup data

  • apply / correct lifecycle rules for affected prefixes

  • move older backups from Standard to Glacier / archive

  • enable abort incomplete multipart uploads

  • validate whether old backup copies can be deleted

Expected saving

Based on current S3 run-rate, expected saving is roughly:

  •  up to ~$3k/month, if retention can be reduced further

Short root cause statement

Root cause: S3 growth is mainly driven by large non-prod backup data remaining in Standard storage longer than required. In addition, failed multipart backup uploads may be leaving orphaned uploaded parts in S3, adding to storage without creating final usable backup files.

This Blog Has Migrated to Substack

To better share deep technical blueprints, system design frameworks, and enterprise architecture insights, I have officially moved all my te...