Sunday, July 12, 2026

Why Late-Stage Release Scope Control Fails (And How to Fix It)

Why Late-Stage Release Scope Control Fails (And How to Fix It)

Enforcing strict scope control right before a major release often backfires—creating bottlenecks, tanking team morale, and introducing more risk just when you need stability.

I’ve broken down the structural reasons why these late-stage interventions fail, along with the framework you should use instead to keep your timelines predictable.

👉 Read the full article on Substack

Sunday, June 21, 2026

This Blog Has Migrated to Substack

To better share deep technical blueprints, system design frameworks, and enterprise architecture insights, I have officially moved all my technical writing over to a new, modern publication platform.

My new home is at: vamsitokala.substack.com

What to Expect on the New Platform:

Better Technical Layouts: Clearer system design breakdowns, visual workflows, and architectural patterns.
Direct Updates: You can subscribe for free to get new architecture breakdowns sent straight to your inbox without having to check back manually.
Upcoming Deep-Dives: I will be continuing my focus on enterprise data layers, cloud optimization, and building governed AI patterns for heavily regulated environments.

The complete archive—including my earlier write-ups has already been fully migrated and is live right now.

Head over to the new link, explore the updated architecture blueprints, and join the free subscription list to stay connected:

👉 Click Here to Visit the New Publication

Wednesday, June 17, 2026

Designing a Governed AI Assistant for eLIMS

When I started thinking about this solution, the problem was very simple.

In eLIMS, users have a lot of data, but getting quick answers is not always easy. A user may want to know something like:

Show me studies not completed on time.

To answer this, the system has to check study details, planned completion date, completion data, actual completion date, user access, legal entity, and then decide whether the study is on time, delayed, or missing information.

So the question I had was:

Can we allow users to ask this in simple English, but still keep the backend safe and controlled?

That is the main problem this solution is solving.

Overall Design

The Main Design Decision

The most important decision I made was:

AI should not directly query the database.

AI is only used to understand what the user is asking.

It creates a plan.

That plan may be right or wrong. So I do not execute it directly. First, I send it to the validator.

Only if the validator says the plan is safe, the backend will execute it.

This is the main difference between this solution and a normal chatbot.

A chatbot may answer directly.
Here, AI only prepares the plan.
The application still controls what can run.

Why I Designed It This Way

In a system like eLIMS, we cannot allow AI to freely access data.

There are roles.
There are legal entities.
There are allowed services.
There are allowed fields.
There are audit needs.
There should not be any write operation from this assistant.

So I kept a clear boundary.

AI can understand the question.
Validator decides whether the plan is allowed.
Execution engine decides what data to read.
Audit service records what happened.

This keeps the user experience simple, but the backend is still controlled.

What Happens When User Asks a Question

Let us take this example:

Show me studies not completed on time.

The Angular UI sends this question to the .NET API.

The API asks the plan generator to convert this question into a structured plan. The plan may say that study service and core lab service are needed, and the output should include delayed or indeterminate studies.

But before doing anything with data, the validator checks the plan.

It checks:

This validator is very important. Even if AI gives a wrong service name or tries to use a field that is not allowed, the request will stop there.

No data will be read.

Execution Logic

Once the plan is valid, the execution engine takes over.

It checks whether the user has the right role and legal entity access.

After that, it reads the approved data and applies the business logic.

For the current use case, it checks study planned completion date and TestP actual completion date.

The classification is simple:

Case	Result
Actual completion date is before or equal to planned date	On Time
Actual completion date is after planned date	Delayed
Planned date or actual completion date is missing	Indeterminate

I kept Indeterminate separately because missing data should not be hidden. In real systems, missing data is also an important signal.

Why Service Registry Is Used

I did not want service names and fields to be hardcoded everywhere.

So I introduced a service registry.

The service registry tells the system:

This makes the design easier to extend.

Tomorrow, if we want to add sample service or protocol service, we do not have to change the full design. We add the service contract, allowed fields, and purpose. Then the same pattern can continue.

Why Audit Is Important

I also added audit as part of the flow.

For every request, we should know:

What did the user ask?
What plan was generated?
Which validation checks passed?
Which services were called?
What result was returned?

This is useful for support, debugging, and review.

In production systems, we cannot just show an answer and forget how it was created. We need traceability.

Final Architecture Pattern

The full pattern is this:

This is the pattern I wanted to prove with this solution.

Not AI directly reading everything.
Not a chatbot giving random answers.
Not a hardcoded report.

It is a controlled assistant.

The user gets a simple way to ask questions.
The system keeps control of validation, access, execution, and audit.

That is the balance I wanted in this design.

My View

For enterprise applications, AI should be used carefully.

It should help users ask better questions and get faster insights. But it should not bypass the application rules.

In this solution, AI is useful because it understands the user’s question. But the actual responsibility still stays with the application.

That is why I designed the eLIMS Insight Assistant as:

Question → Plan → Validate → Authorize → Execute → Audit → Result

This keeps it simple for the user and safe for the system.

Saturday, June 6, 2026

How I Review an FSD: A Practical Framework from Enterprise Delivery

A Practical Framework from Enterprise Delivery

For me, an FSD review is not just a document review. It is an early architecture checkpoint to identify design gaps, integration risks, data issues, security gaps, performance concerns, and production support challenges.

A good solution architect should not only ask, “Can we build this?”
We should ask, “Can we build this correctly, securely, scalably, testably, and supportably?”

1. Understand the Business Problem First

Before reviewing screens, fields, buttons, or validations, I first check:

Why is this requirement needed?
Which user or business problem is it solving?
What process does it support?
What happens if this feature is not delivered?
Is the requirement solving the real problem or only automating an unclear process?

2. Validate the Domain Logic

The FSD should be aligned with the business/domain model.

I check:

Are the right business entities defined?
Are relationships between entities clear?
Is the lifecycle/status flow explained?
Are exceptions and edge cases covered?
Is historical data or versioning required?

In regulated or scientific systems, even a small configuration change can impact calculation, audit, reporting, or approval.

3. Confirm Source of Truth

This is very important in microservices.

The FSD should clearly answer:

Which module owns the data?
Who can create or update it?
Which modules only consume it?
Is data fetched live, synced, cached, or stored as a snapshot?
What happens when the source data changes?

Lack of source-of-truth clarity can lead to duplicate data, inconsistent behavior, and production issues.

4. Review Data Model and Traceability

Even if the requirement looks simple, the data design should support future needs.

I check:

Mandatory and optional fields
Status and lifecycle
Audit requirements
History/versioning
Deactivation versus deletion
Reporting and downstream usage

A simple question I ask is:

After six months, can we still explain why the system behaved in a particular way?

5. Evaluate Integration Impact

A small change in one module may impact many systems.

I check:

APIs impacted
Events/messages required
Synchronous vs asynchronous flow
Retry and failure handling
Cache impact
Downstream reports or jobs
Impact on other microservices

This helps avoid surprises during development, UAT, and production.

6. Review Non-Functional Requirements

Functional requirements are not enough.

I also check:

Performance and volume
Security and authorization
Audit and compliance
Error handling
Monitoring and logging
Data retention
Production support needs

Example: If the FSD says export data, I ask:

How many records?
Should it be async?
Who can download?
Where will the file be stored?
How long should it be retained?

7. Check Security and Access Control

Access should not be defined only at UI level.

I check:

Page access
API access
Record-level access
Field-level access
Workflow-step access
Export/report access

What the UI hides, the backend must also protect.

8. Make Sure It Is Testable and Supportable

The FSD should help QA and support teams.

I check whether it supports:

Positive and negative test cases
Integration testing
Regression testing
Security testing
Performance testing
Audit testing
Production troubleshooting

A feature is not complete only because it works in UAT. It should also be monitorable, supportable, and recoverable in production.

Final Thought

A solution architect reviews an FSD to reduce ambiguity early.

The goal is to help BA, FA, developers, QA, DBA, DevOps, security, and support teams move with the same understanding.

A good FSD review prevents rework, reduces production risk, and ensures the solution is practical for real enterprise delivery.

Friday, March 20, 2026

Understanding How Certificates Are Used in Applications

Certificates are used to establish trust in secure communication. In simple terms, they help prove identity when two systems connect.

1. Server certificate

This is the most common use.

The server presents the certificate to the client so the client knows it is talking to the right system.

Example use cases

public website
internal web portal
REST API endpoint
CDN custom domain
load balancer HTTPS endpoint

Example names

portal.company.com
api.company.com
admin.internal.company.com

If a user opens https://portal.company.com, the server or load balancer presents the certificate. This is normal server-side TLS.

Where it may be installed

Application Load Balancer
CloudFront
API Gateway
IIS / nginx / Apache
Network Load Balancer with TLS listener

2. Client certificate

Here, the client presents a certificate to the server.

This is used when the server wants to verify who the calling system is.

Example use cases

machine-to-machine integration
secure partner API access
device authentication
VPN authentication
service account authentication without username/password

Example names

integration-client.pfx
partner-api-client-cert
device-auth-cert

If an order processing service calls a supplier API and sends a client certificate during the HTTPS connection, that is client authentication.

3. Mutual TLS (mTLS)

In mTLS, both sides present certificates.

server proves its identity to client
client also proves its identity to server

Example use cases

B2B integrations
secure internal service-to-service calls
healthcare or banking APIs
zero-trust internal APIs

Example names

payments-api.company.com
partner-gateway.vendor.com
inventory-service.internal.company.com

If inventory-service calls partner-gateway.vendor.com and both sides validate certificates, that is mTLS.

Where certificates can be installed

On a Load Balancer

Used when TLS terminates at the load balancer.

Example use cases

one entry point for many websites
centralized HTTPS management
host-based routing for multiple apps

Example names

shop.company.com
careers.company.com
support.company.com

A load balancer presents the right certificate based on the hostname.

On the Application Server

Used when TLS terminates directly on the server.

Example use cases

legacy applications
internal admin tools
direct server-hosted portals
applications not behind a centralized ingress

Example names

reports.internal.company.local
admin-node-07.corp.local
legacy-app.company.local

The certificate is installed directly on the server and bound in IIS or nginx.

In the Windows Certificate Store

Applications or IIS can load certificates from the Windows certificate store.

Example use cases

IIS-hosted website
.NET Windows service
internal scheduler calling external API

Example names

client-auth-service-cert
portal-web-cert
erp-integration-cert

As Files

Certificates may also exist as:

.pfx
.pem
.crt
.key
.jks

Example use cases

Linux web servers
Java applications
containerized services
outbound secure API integrations

Example names

server.crt
server.key
client-auth.pfx
service-keystore.jks

In Secrets Manager or Config

Some applications load certificates at runtime from secret or config stores.

Example use cases

microservices
containerized apps
outbound client-auth integrations
automated batch jobs

Example names

PAYMENT_GATEWAY_CLIENT_CERT
MTLS_CERT_PATH
PARTNER_API_KEYSTORE

How to know what a certificate is doing

If it is attached to:

load balancer HTTPS listener
CDN custom domain
API custom domain
IIS HTTPS binding

then it is usually being used as a server certificate.

If the application is configured with:

.pfx
keystore
thumbprint
ClientCertificates
X509Certificate2

then it may be used as a client certificate.

Important note

A certificate may contain both:

Server Authentication
Client Authentication

But that does not mean both are actually used.

The real question is:
Is the certificate being used only for server TLS, or also for client authentication?

Quick examples

Example 1: Public website

portal.company.com
certificate attached to load balancer
users access over HTTPS

Result:
server TLS only

Example 2: Internal portal

admin.internal.company.com
certificate installed in IIS
client certificates not required

Result:
server TLS only

Example 3: Secure partner integration

order service calls partner API
app loads partner-client.pfx
cert attached to outbound HTTPS client

Result:
client certificate usage

Example 4: B2B mutual TLS

partner-gateway.vendor.com
both systems exchange and validate certificates

Result:
mTLS

Final takeaway

Certificates can be installed on load balancers, CDNs, API gateways, servers, applications, secret stores, or appliances. Their role depends on who presents the certificate and where TLS terminates.

In short:

server presents certificate → server TLS
client presents certificate → client authentication
both present certificates → mTLS

Monday, March 16, 2026

Investigating S3 Cost Growth in AWS – How We Identified the Issue and What We Found

S3 Investigation Summary

We started this analysis because S3 cost was not reducing as expected, even after cleanup activities.

Cost signal

Recent S3 monthly cost stayed around:

Oct 2025: ~$5.8k
Nov 2025: ~$5.8k
Dec 2025: ~$4.7k
Jan 2026: ~$5.3k
Feb 2026: ~$5.9k

So S3 was still running at roughly $5k–$6k/month.

How we validated it

We did not rely only on folder view in S3 console.
We validated using:

bucket size metrics
Storage Lens
storage class split
prefix drill-down

This confirmed:

bucket size is around 260–270 TB
most of the data is still in Standard
storage is concentrated under a non-prod backup prefix

Key finding

This is not mainly a versioning issue or another-region issue.
The main problem is:

large backup data retained in Standard
under a non-prod backup path
with likely retention / lifecycle gap

Incomplete multipart upload issue

One additional contributor may be incomplete multipart uploads.

In this case, large SQL backup files are uploaded to S3 in parts.
If a backup/upload job fails in the middle and the upload is not completed or aborted, S3 keeps the uploaded parts.

That means:

no final usable backup object
but storage is still consumed
and cost continues in Standard storage

This usually happens when:

backup job fails midway
retry starts a new upload
old partial upload is not cleaned up

Fix direction

Main actions:

review retention for non-prod backup data
apply / correct lifecycle rules for affected prefixes
move older backups from Standard to Glacier / archive
enable abort incomplete multipart uploads
validate whether old backup copies can be deleted

Expected saving

Based on current S3 run-rate, expected saving is roughly:

up to ~$3k/month, if retention can be reduced further

Short root cause statement

Root cause: S3 growth is mainly driven by large non-prod backup data remaining in Standard storage longer than required. In addition, failed multipart backup uploads may be leaving orphaned uploaded parts in S3, adding to storage without creating final usable backup files.