Vendor-Agnostic AI Architecture

Your AI vendor is a dependency, not a destiny.

In 2025 a large federal customer had to migrate AI vendors mid-program. The teams that survived it had built one thing: a model-agnostic architecture, where the provider is a swappable part. This page is the full blueprint — the provider gateway, the model-selection guardrails, and the data-security layers (multi-tenant isolation, encryption in transit with mTLS, and encryption at rest) — decomposed far enough that a high-schooler and a 25-year veteran both finish it thinking “nothing was left hand-wavy.”

Provider PortabilityMulti-Tenant Data IntegrityEncryption + mTLSRunnable Code

neutral interface, any provider

vendor SDK imports in app code

independent walls per tenant

encryption states: transit + rest

100%

network hops under TLS

mTLS

on every internal + egress hop

🎯The whole idea in one sentence — then we break it down

🔑The abstraction statement

Build the system so that which AI vendor you use and whose data is flowing through it are both decisions the architecture enforces — not things a developer has to remember to get right on every call.

That sentence is the entire page. Everything below is the decomposition — because an abstraction you cannot break down into concrete, configured, diagrammed parts is just a slogan. We will take it apart into six layers, and for each one show the actual code, the actual config, and the exact place a private key or a tenant ID physically lives.

🔀

The Model Gateway

One provider-neutral interface. App code never imports a vendor SDK. Each vendor is a thin adapter.

⚖️

The Decision Matrix

How the gateway picks a provider per request — capability, cost, latency, data residency, compliance, fallback.

🛡️

Guardrails

The gate around the model — input filters for injection and jailbreaks, output filters for PII leaks and policy.

🏢

Multi-Tenant Integrity

Tenant, workspace, and role injected once at the controller — never passed by hand, never droppable.

🔐

Encryption in Transit

TLS on every hop, mTLS internally and on egress. Where each private key lives, and how it is configured.

🗄️

Encryption at Rest

AES-256-GCM, envelope encryption, per-tenant data keys in a KMS — and crypto-shredding for real deletion.

📊The picture, before the prose

Here is the full request lifecycle — one user request, from the edge of the network to the model and back. Watch where identity gets attached, where the encryption changes shape, and where the data finally comes to rest. Every box below has its own section.

Animated data-flow diagram of a model-agnostic AI request lifecycle. A request packet enters at the edge over TLS 1.3, reaches a controller that injects tenant ID, workspace ID, and role from the verified token into a request-scoped context. It passes through input guardrails (prompt-injection, jailbreak, PII scan), then a model gateway with a decision matrix that routes to one of four interchangeable provider adapters (Anthropic, OpenAI, AWS Bedrock, Google Vertex). The egress hop to the provider uses mTLS with a client certificate whose private key never leaves the pod. The response flows back through output guardrails (PII leak, policy, hallucination check) and is persisted with AES-256-GCM encryption at rest using a per-tenant data key, with the root key held in a KMS, plus an append-only audit log. — One request, six layers. Identity is injected once at the controller and travels with the request; the provider is chosen by the gateway and is fully swappable; encryption changes shape (TLS → mTLS → AES-at-rest) but never has a gap.
Download: GIF (animated, for LinkedIn / Slack) · PNG (still, high-res)

If the image has not rendered yet, here is the same lifecycle as plain text — this is the spine of the whole page:

   User / Agent
        │   TLS 1.3  (encryption in transit — Layer 5)
        ▼
 ┌──────────────────────────────────────────────────────┐
 │ EDGE / LOAD BALANCER          terminates public TLS   │
 └──────────────────────────────────────────────────────┘
        │   mTLS  (client certificate from here inward)
        ▼
 ┌──────────────────────────────────────────────────────┐
 │ CONTROLLER  — identity injected ONCE  (Layer 4)       │
 │   tenant_id · workspace_id · role                     │
 │   read from the verified token → request-scoped ctx   │
 └──────────────────────────────────────────────────────┘
        ▼
 ┌──────────────────────────────────────────────────────┐
 │ INPUT GUARDRAILS  (Layer 3)                           │
 │   prompt-injection · jailbreak · PII scan             │
 └──────────────────────────────────────────────────────┘
        ▼
 ┌──────────────────────────────────────────────────────┐
 │ MODEL GATEWAY  — one neutral interface  (Layer 1)     │
 │   decision matrix picks a provider      (Layer 2)     │
 │   ┌─────────┐ ┌────────┐ ┌─────────┐ ┌──────────────┐ │
 │   │Anthropic│ │ OpenAI │ │ Bedrock │ │ Vertex / ... │ │
 │   └─────────┘ └────────┘ └─────────┘ └──────────────┘ │
 └──────────────────────────────────────────────────────┘
        │   mTLS egress — client cert, private key
        ▼   stays in the pod / HSM, never on the wire
   [  LLM PROVIDER  ]
        ▼
 ┌──────────────────────────────────────────────────────┐
 │ OUTPUT GUARDRAILS  (Layer 3)                          │
 │   PII leak · policy violation · hallucination check   │
 └──────────────────────────────────────────────────────┘
        ▼
 ┌──────────────────────────────────────────────────────┐
 │ PERSIST + AUDIT  (Layer 6)                            │
 │   AES-256-GCM at rest · per-tenant data key           │
 │   KMS holds the root key · append-only audit log      │
 └──────────────────────────────────────────────────────┘

🧭Layer 0 — why your vendor is a dependency, not a destiny

Before the layers, the motivation — because if you do not feel the problem, the solution looks like over-engineering.

When you call an LLM provider, you are taking a hard dependency on a company you do not control. That company can — and routinely does — have an outage, change its prices, deprecate the exact model your prompts were tuned for, change its data-handling terms, or lose a compliance certification your customer requires. None of those are hypothetical. All of them happened to someone in the last year.

⚠️The lock-in tax is paid at the worst possible time

If your application code is full of openai.chat.completions.create(...) calls, then “switch providers” is a multi-week refactor touching every file that talks to a model. You will be asked to do that refactor precisely when there is an outage, a price shock, or a compliance deadline — i.e. under maximum pressure. The vendor-agnostic architecture moves that cost forward and shrinks it: switching becomes a config change plus, at most, one new adapter.

This is not anti-vendor. You will still pick the best model for each job — that is Layer 2. The point is that the choice stays yours, reversible, and cheap to change. The same instinct that says “do not hard-code your database vendor into every query” says “do not hard-code your model vendor into every prompt.”

🔀Layer 1 — the model gateway (the provider abstraction)

The model gateway is one internal module with one job: expose a provider-neutral way to call an LLM. Application code calls the gateway. The gateway — and only the gateway — knows that Anthropic, OpenAI, Bedrock, and Vertex exist.

Step one is a neutral request and response shape. It is deliberately small — the common denominator of what every provider can do:

📄 gateway/types.ts — the provider-neutral contract

typescript

// The shape the WHOLE application speaks. No vendor words in here.
export interface ModelRequest {
  messages: { role: 'system' | 'user' | 'assistant'; content: string }[]
  maxTokens: number
  temperature?: number
  // a capability tag, NOT a vendor model name — Layer 2 turns this into a real model
  task: 'cheap-extract' | 'general' | 'hard-reasoning' | 'long-context'
}

export interface ModelResponse {
  text: string
  provider: string          // which vendor actually served it (for logs/audit)
  model: string             // the concrete model id that was used
  usage: { inputTokens: number; outputTokens: number; costUsd: number }
}

// Every vendor adapter implements exactly this. That's the entire seam.
export interface ModelProvider {
  readonly name: string
  complete(req: ModelRequest): Promise<ModelResponse>
}

↕ Scroll

Step two is one thin adapter per vendor. Each adapter translates the neutral shape into that vendor's wire format and back. This is the only file in the codebase that imports a vendor SDK:

📄 gateway/adapters/anthropic.ts — one of several adapters

typescript

import Anthropic from '@anthropic-ai/sdk'
import { ModelProvider, ModelRequest, ModelResponse } from '../types'
import { resolveModel } from '../decision-matrix'

export class AnthropicAdapter implements ModelProvider {
  readonly name = 'anthropic'
  private client = new Anthropic()  // reads its key from the KMS-backed secret mount

  async complete(req: ModelRequest): Promise<ModelResponse> {
    const model = resolveModel('anthropic', req.task)   // Layer 2 decides the concrete model
    const system = req.messages.find(m => m.role === 'system')?.content
    const turns  = req.messages.filter(m => m.role !== 'system')

    const res = await this.client.messages.create({
      model,
      max_tokens: req.maxTokens,
      temperature: req.temperature ?? 0.7,
      system,
      messages: turns as Anthropic.MessageParam[],
    })

    const text = res.content.filter(b => b.type === 'text').map(b => b.text).join('')
    return {
      text,
      provider: this.name,
      model,
      usage: {
        inputTokens: res.usage.input_tokens,
        outputTokens: res.usage.output_tokens,
        costUsd: priceFor(model, res.usage),
      },
    }
  }
}

// gateway/adapters/openai.ts, bedrock.ts, vertex.ts are the same shape —
// import the vendor SDK, map the neutral request in, map the response out.

↕ Scroll

Step three is the gateway itself. It owns the cross-cutting concerns so every call gets them for free — provider selection, fallback, retries, timeouts, cost accounting, and tenant tagging:

📄 gateway/index.ts — the one entry point the app calls

typescript

import { ModelRequest, ModelResponse, ModelProvider } from './types'
import { AnthropicAdapter } from './adapters/anthropic'
import { OpenAIAdapter } from './adapters/openai'
import { BedrockAdapter } from './adapters/bedrock'
import { VertexAdapter } from './adapters/vertex'
import { chooseProvider } from './decision-matrix'   // Layer 2
import { tenantContext } from '../context/tenant'    // Layer 4
import { auditLog } from '../audit'                  // Layer 6

const providers: Record<string, ModelProvider> = {
  anthropic: new AnthropicAdapter(),
  openai:    new OpenAIAdapter(),
  bedrock:   new BedrockAdapter(),
  vertex:    new VertexAdapter(),
}

export async function complete(req: ModelRequest): Promise<ModelResponse> {
  const { tenantId, workspaceId } = tenantContext.get()   // ambient — see Layer 4

  // Layer 2 returns an ordered list: [primary, ...fallbacks]
  const order = chooseProvider(req, { tenantId })

  let lastErr: unknown
  for (const name of order) {
    try {
      const res = await providers[name].complete(req)
      await auditLog({ tenantId, workspaceId, provider: name, model: res.model, usage: res.usage })
      return res
    } catch (err) {
      lastErr = err            // outage / rate limit / deprecation → try the next provider
    }
  }
  throw new Error('All providers failed: ' + String(lastErr))
}

↕ Scroll

🔑Adding AWS Bedrock is a connector, not a redesign

When a new provider is required — say a customer mandates AWS Bedrock for FedRAMP reasons — you write one new file: gateway/adapters/bedrock.ts, implementing the same ModelProvider interface. You register it in the providers map. You add it to the decision matrix. Zero application files change, because no application file ever imported a vendor SDK in the first place. That is the whole payoff of Layer 1.

⚖️Layer 2 — the model-selection decision matrix (the guardrails on which model)

The gateway can reach four providers. Which one should serve a given request? That is not a vibe — it is a checklist. This is what the LinkedIn mock interview was really asking: what criteria do you use to choose a model, and what are the guardrails on that choice?

Here is the matrix. Every row is a question the router asks before a single token is spent:

Criterion	The question it answers	Example consequence
Capability fit	Is this task simple extraction, or hard multi-step reasoning?	cheap-extract → Haiku / Flash; hard-reasoning → Opus / o-series
Cost per token	What does this task cost at each provider, at expected volume?	A 10M-call/day classifier on a frontier model is a budget fire
Latency budget	Is a human waiting on this, or is it a background job?	Interactive chat → fast model; nightly batch → cheapest model
Context window	Does the prompt plus retrieved context fit the model's window?	A 400K-token document review forces a long-context model
Data residency	Is this tenant's data allowed to leave a region or a boundary?	An EU or GovCloud tenant pins to an in-boundary provider only
Compliance posture	Does this provider hold the certification this tenant requires?	FedRAMP / SOC 2 / HIPAA — a missing cert removes a provider
Provider health	Is the primary provider currently degraded or rate-limiting?	Circuit-breaker open → skip straight to the fallback
Fallback chain	If the chosen provider fails mid-request, who is next?	Always return an ordered list, never a single provider

In code, the matrix is a pure function: request in, ordered provider list out. Pure means it is trivially testable — and a model-selection decision you cannot unit-test is a decision you cannot defend in an audit.

📄 gateway/decision-matrix.ts

typescript

import { ModelRequest } from './types'
import { getTenantPolicy } from '../tenancy/policy'   // per-tenant residency + compliance rules
import { circuitState } from './health'

// Returns an ORDERED list: [primary, ...fallbacks]. Never a single value.
export function chooseProvider(
  req: ModelRequest,
  ctx: { tenantId: string },
): string[] {
  const policy = getTenantPolicy(ctx.tenantId)   // e.g. { residency: 'us-gov', certs: ['fedramp'] }

  // 1. HARD FILTER — drop any provider this tenant is not allowed to use.
  //    Residency and compliance are guardrails, not preferences: they cannot be traded away.
  let eligible = ALL_PROVIDERS.filter(p =>
    p.regions.includes(policy.residency) &&
    policy.certs.every(c => p.certifications.includes(c)),
  )

  // 2. HEALTH FILTER — drop providers whose circuit breaker is currently open.
  eligible = eligible.filter(p => circuitState(p.name) !== 'open')

  // 3. RANK the survivors by capability fit, then latency, then cost.
  const ranked = eligible.sort((a, b) =>
    capabilityScore(b, req.task) - capabilityScore(a, req.task) ||
    a.p50LatencyMs - b.p50LatencyMs ||
    costFor(a, req) - costFor(b, req),
  )

  if (ranked.length === 0) {
    throw new Error(`No compliant provider for tenant ${ctx.tenantId}`)  // fail closed
  }
  return ranked.map(p => p.name)   // primary first, fallbacks after
}

↕ Scroll

💡Capability and cost are preferences. Residency and compliance are guardrails.

Notice the two-stage shape: a hard filter first, then a ranking. The hard filter encodes the things that are never negotiable — a GovCloud tenant's data does not go to a non-GovCloud provider to save money, ever. The ranking encodes the things you optimize once the non-negotiables are satisfied. Mixing those two into one weighted score is the classic mistake: it lets a cost saving silently outvote a compliance requirement. Keep them separate, and the system fails closed — if nothing is compliant, it errors rather than guessing.

🛡️Layer 3 — guardrails (the gate around the model)

Guardrails are the code that sits between the user and the model, and again between the model and anything the model's output touches. The model is powerful and gullible; the guardrails are the seatbelt.

⬇️

Input guardrails — before the model

Prompt-injection detection — is the user (or a retrieved document) trying to override the system prompt?
Jailbreak detection — known patterns that try to unlock disallowed behavior.
PII / secret scanning — strip or block card numbers, SSNs, credentials before they ever reach a third-party provider.
Topic + scope limits — keep the request inside what this product is allowed to answer.

⬆️

Output guardrails — after the model

PII leak check — did the model echo back sensitive data it should not have, or data from the wrong tenant?
Policy / safety filter — does the output violate content policy or a regulatory rule?
Hallucination check — for grounded tasks, is every claim supported by the retrieved context?
Schema validation — if structured output was promised, enforce it before it reaches a downstream system.

In the lifecycle, guardrails wrap the gateway call — nothing reaches a provider unchecked, and nothing leaves the model unchecked:

📄 guardrails/wrap.ts — guardrails wrap the gateway, not the other way round

typescript

import { complete } from '../gateway'
import { ModelRequest, ModelResponse } from '../gateway/types'
import { scanInput, scanOutput, GuardrailError } from './checks'
import { tenantContext } from '../context/tenant'

export async function guardedComplete(req: ModelRequest): Promise<ModelResponse> {
  const { tenantId } = tenantContext.get()

  // ── INPUT GATE ────────────────────────────────────────────────
  const inVerdict = await scanInput(req, tenantId)
  if (inVerdict.blocked) {
    throw new GuardrailError('input', inVerdict.reason)   // never reaches a provider
  }

  // ── THE MODEL CALL (Layers 1 + 2) ─────────────────────────────
  const res = await complete(inVerdict.sanitizedRequest)

  // ── OUTPUT GATE ───────────────────────────────────────────────
  const outVerdict = await scanOutput(res, tenantId)
  if (outVerdict.blocked) {
    throw new GuardrailError('output', outVerdict.reason)  // never reaches the user
  }
  return outVerdict.sanitizedResponse
}

↕ Scroll

💡Guardrails are provider-agnostic too

Because guardrails wrap the gateway and not a specific vendor, they keep working unchanged when you switch providers. The injection scanner does not care whether the model underneath is Claude or GPT or a Bedrock-hosted model — it inspects the neutral request and the neutral response. One more thing the abstraction buys you.

🏢Layer 4 — multi-tenant data integrity (the part that actually leaks)

This is the layer the mock interview pushed hardest on, and rightly so. In a multi-tenant system, the catastrophic failure is not downtime — it is Tenant A seeing Tenant B's data. And here is the uncomfortable truth about how that happens:

⚠️Cross-tenant leaks are almost never a missing check — they are a forgotten one

Nobody writes SELECT * FROM documents on purpose. What happens is: a query is written correctly with WHERE tenant_id = ?, and then six months later someone adds a new query in a hurry and forgets the filter — or forgets to thread the tenantId argument through the fourth function in the call chain. The leak is a dropped parameter. So the architecture's job is to make the parameter impossible to drop.

The rule: inject, don't pass

Do not pass tenantId, workspaceId, and role as ordinary function arguments that every layer has to remember to forward. Instead, the controller establishes them once, at the very edge of the request, from the verified token — and injects them into a request-scoped context that every lower layer reads automatically.

🪪

tenant_id

Which customer organization. The hard partition — data never crosses it.

📁

workspace_id

Which project / team inside that customer. A softer partition for scoping within a tenant.

🎭

role

What this user may do (RBAC) — admin, member, read-only — also injected, also enforced below.

Step one — the controller reads identity from the verified token and injects it. This is the only place identity is set:

📄 controllers/chat.ts — identity is established ONCE, here

typescript

import { Router } from 'express'
import { verifyToken } from '../auth'
import { tenantContext } from '../context/tenant'
import { guardedComplete } from '../guardrails/wrap'

export const chatRouter = Router()

chatRouter.post('/chat', async (req, res, next) => {
  // 1. Verify the token. This is the trust boundary — nothing before it is trusted.
  const claims = await verifyToken(req.headers.authorization)
  //    claims = { tenantId, workspaceId, role, userId }  — cryptographically verified

  // 2. INJECT identity into the request-scoped context, then run the handler INSIDE it.
  //    Everything downstream of tenantContext.run() can read this — and nothing can
  //    run a query without it, because the data layer refuses to (see step 3).
  await tenantContext.run(claims, async () => {
    const answer = await guardedComplete({
      messages: req.body.messages,
      maxTokens: 1024,
      task: 'general',
    })
    res.json({ answer: answer.text })
  })
})

↕ Scroll

Step two — the context itself. In Node this is AsyncLocalStorage (Python: contextvars; Java: a request-scoped bean). It is “ambient” — available to any code running inside the request, without being handed down explicitly:

📄 context/tenant.ts — the ambient, request-scoped identity

typescript

import { AsyncLocalStorage } from 'node:async_hooks'

export interface TenantClaims {
  tenantId: string
  workspaceId: string
  role: 'admin' | 'member' | 'readonly'
  userId: string
}

const als = new AsyncLocalStorage<TenantClaims>()

export const tenantContext = {
  run: <T>(claims: TenantClaims, fn: () => Promise<T>) => als.run(claims, fn),

  // Throws if called outside a request scope. That throw is a FEATURE:
  // it means no code path can quietly run "tenant-less".
  get(): TenantClaims {
    const claims = als.getStore()
    if (!claims) throw new Error('No tenant context — query attempted outside a request scope')
    return claims
  },
}

↕ Scroll

Step three — the data layer reads the tenant from the context itself. There is no public way to query without a tenant filter, because the function that would let you do that does not exist:

📄 db/scoped.ts — the ONLY way to reach the database

typescript

import { pool } from './pool'
import { tenantContext } from '../context/tenant'

// Every query goes through here. There is no exported "raw query" function.
export async function scopedQuery<T>(sql: string, params: unknown[] = []): Promise<T[]> {
  const { tenantId } = tenantContext.get()    // ambient — cannot be forgotten, cannot be faked

  const client = await pool.connect()
  try {
    // Wall 1 (application): bind the tenant into a session variable for THIS connection.
    await client.query('SET LOCAL app.tenant_id = $1', [tenantId])
    // Wall 2 (database): Postgres row-level security reads that same variable — see below.
    const result = await client.query(sql, params)
    return result.rows as T[]
  } finally {
    client.release()
  }
}

↕ Scroll

📄 migrations/001_rls.sql — Wall 2, enforced by the database itself

sql

-- Row-level security: even a query with NO tenant filter only sees its own tenant's rows.
-- This is the independent second wall — it holds even if application code has a bug.
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON documents
  USING (tenant_id = current_setting('app.tenant_id')::uuid);

-- Now: SELECT * FROM documents  -- returns ONLY the current tenant's rows.
-- The filter is no longer the developer's responsibility — it is the database's.

🔑Three independent walls — and why 'independent' is the whole point

A request now passes three walls that each stop a cross-tenant leak on their own:

Wall 1 — the ambient context. Application code physically cannot run a query without a tenant, because scopedQuery reads it from tenantContext.get(), which throws when absent.
Wall 2 — Postgres row-level security. Even a query that somehow forgot its WHERE clause only sees the current tenant's rows, because the database enforces it.
Wall 3 — per-tenant encryption keys (Layer 6). Even raw bytes read off disk for the wrong tenant are ciphertext, because each tenant's data is encrypted under its own key.

“Independent” means a single bug breaches at most one wall. You do not get a cross-tenant leak from one mistake — you would need three simultaneous, unrelated failures. That is what defense-in-depth actually means, made concrete.

And the model call inherits all of this for free: back in Layer 1, the gateway read tenantContext.get() to tag every provider call and every audit row with the tenant. Prompts, logs, and cost accounting stay partitioned by the same identity that partitions the database — because it is the same identity, injected once.

🔐Layer 5 — encryption in transit (and exactly where the private key lives)

Encryption in transit means: while data is moving across a network, anyone who taps the wire sees ciphertext. The statement is easy. The engineering is in the details — which hops, what kind of TLS, and where the keys physically live. Let us not leave any of that hand-wavy.

Every hop, no exceptions

There is no hop — however “internal” — that travels in plaintext. The network perimeter is not a substitute for encryption; that assumption is exactly what zero-trust architecture exists to kill.

HOP                                   PROTECTION         WHO PROVES IDENTITY
─────────────────────────────────────────────────────────────────────────────
browser  → load balancer              TLS 1.3            server only
load balancer → controller            mTLS               server + client
controller → gateway / guardrails     mTLS               server + client
gateway  → LLM provider (egress)      mTLS               server + client
service  → database                   TLS 1.3 + cert     server (+ client cert)

The public edge uses ordinary TLS 1.3: the browser checks the server's certificate, the browser stays anonymous (the user authenticates with a token, separately). From the load balancer inward, every hop upgrades to mTLS — mutual TLS — where the client also presents a certificate and the server refuses the connection without one.

How mTLS actually works — and where each private key lives

This is the question the interview asked directly. In a client-certificate exchange there are two key pairs, and the entire security of the scheme rests on where the private halves live and never leave:

🖥️

The server's key pair

Public certificate: sent to the client during the handshake. Travels the wire freely — it is public.

Private key: never leaves the server. Mounted into the pod from a secrets store or a hardware security module, file permissions 0400, readable only by the server process's user. Not in the container image. Not in source control. Not in an environment variable. Not in a log line.

📦

The client's key pair

Public certificate: sent to the server during the handshake so the server can verify the caller. Also public, also fine on the wire.

Private key: never leaves the client. Same treatment — mounted secret or HSM, 0400, process-readable only. The client uses it to sign one value during the handshake, proving it holds the key without ever transmitting it.

🔑The one sentence to remember about private keys

Only public certificates ever cross the wire. Private keys are mounted, never sent. A private key's entire life is: generated inside a boundary (a KMS, an HSM, or a sealed secrets process), delivered to exactly one workload as a read-only mount, used in memory, and destroyed when the pod dies. If a private key has ever been in a git repo, a Slack message, a Dockerfile, or a log — it is not a private key anymore, it is a public one, and the certificate must be revoked.

The mTLS handshake, step by step — note that the private keys are used but never sent:

CLIENT                                                    SERVER
  │                                                          │
  │ ── 1. ClientHello ─────────────────────────────────────▶ │
  │                                                          │
  │ ◀── 2. ServerHello + server's PUBLIC certificate ──────  │
  │        + "I require a client certificate too"            │
  │                                                          │
  │   3. verify server cert against the trusted CA bundle    │
  │      (is this really who I meant to call?)               │
  │                                                          │
  │ ── 4. client's PUBLIC certificate ─────────────────────▶ │
  │ ── 5. a signature made WITH the client's PRIVATE key ──▶ │
  │      (the private key never leaves — only the signature) │
  │                                                          │
  │                          6. verify client cert against   │
  │                             the trusted CA bundle, and   │
  │                             verify the signature         │
  │                                                          │
  │ ◀════ 7. encrypted channel open — both sides proven ════▶│

And the configuration — concretely, this is all it is. Issue a cert + key pair per service from an internal certificate authority, mount them, point the client and server at the file paths, and pin the CA bundle so only internally-issued certificates are trusted:

📄 k8s/gateway-deployment.yaml — the private key is a read-only mount, nothing more

yaml

# The cert + key pair is issued by the internal CA (cert-manager / Vault PKI)
# and delivered as a secret. The pod MOUNTS it — the key is never in the image.
volumes:
  - name: mtls-certs
    secret:
      secretName: gateway-client-cert   # contains tls.crt, tls.key, ca.crt
      defaultMode: 0400                 # read-only, owner-only

containers:
  - name: model-gateway
    volumeMounts:
      - name: mtls-certs
        mountPath: /etc/mtls
        readOnly: true                  # the workload cannot even modify it

↕ Scroll

📄 gateway/http-client.ts — point the client at the mounted paths

typescript

import { Agent } from 'undici'
import { readFileSync } from 'node:fs'

// The private key is READ from the mount at startup, held in memory, and used to
// sign handshakes. It is never logged, never serialized, never sent.
export const mtlsAgent = new Agent({
  connect: {
    cert: readFileSync('/etc/mtls/tls.crt'),   // our PUBLIC certificate — ok to present
    key:  readFileSync('/etc/mtls/tls.key'),   // our PRIVATE key — used in memory only
    ca:   readFileSync('/etc/mtls/ca.crt'),    // pin: trust ONLY internally-issued certs
    rejectUnauthorized: true,                  // refuse any peer we cannot verify
  },
})

// Every egress call from the gateway uses this agent. In a service mesh
// (Istio, Linkerd) the sidecar does all of the above and the app code stays clean —
// but the model is identical: mounted key, public cert on the wire, pinned CA.

↕ Scroll

🗄️Layer 6 — encryption at rest (and what 'where does the key live' means here)

Encryption at rest means: while data sits in storage — the database, the object store, the backups, the logs — it is ciphertext. A stolen disk or a leaked snapshot is useless without the keys. Again the statement is simple; the engineering is in which keys, held where, and who can ask them to do what.

Envelope encryption — the professional pattern

Naive encryption uses one key for everything; if it leaks, everything is exposed, and rotating it means re-encrypting the world. Envelope encryption fixes both problems with a hierarchy of keys:

┌─────────────────────────────────────────────────────────────────┐
│  KMS  (hardware security module)                                 │
│                                                                   │
│   ROOT KEY  ── never leaves the KMS hardware. Ever.               │
│      │        the application cannot read it — it can only ASK   │
│      │        the KMS to use it, and every ask is logged.        │
│      ▼                                                            │
│   wraps (encrypts) each tenant's DATA KEY                         │
└──────┬────────────────────────────────────────────────────────────┘
       │   the app stores only the WRAPPED (encrypted) data key
       ▼
 ┌───────────────────────┐   ┌───────────────────────┐
 │ Tenant A   data key   │   │ Tenant B   data key   │   ... one per tenant
 │ (stored wrapped)      │   │ (stored wrapped)      │
 └──────────┬────────────┘   └──────────┬────────────┘
            │ to read Tenant A's data:  │
            │  1. ask KMS to unwrap A's data key (in memory only)
            │  2. AES-256-GCM decrypt the rows with it
            │  3. discard the unwrapped key
            ▼
   ciphertext rows in Postgres / object store / backups

So “where does the key live” has a precise answer at this layer:

The root key lives inside the KMS / HSM hardware and never comes out. The application never holds it — it can only request operations (“unwrap this data key”), and each request is authenticated, authorized, and audit-logged.
Each tenant data key is stored only in its wrapped (encrypted) form, next to the data. On its own it is useless — it has to be unwrapped by the KMS, in memory, for each use, then discarded.
The application holds neither key at rest. At most it holds an unwrapped data key in memory for the duration of one operation.

📄 crypto/envelope.ts — the app asks the KMS; it never holds the root key

typescript

import { KMSClient, DecryptCommand } from '@aws-sdk/client-kms'
import { createCipheriv, createDecipheriv, randomBytes } from 'node:crypto'
import { tenantContext } from '../context/tenant'
import { getWrappedDataKey } from './keystore'

const kms = new KMSClient()

// Decrypt a stored field for the CURRENT tenant (identity comes from Layer 4).
export async function decryptField(ciphertext: Buffer, iv: Buffer, tag: Buffer): Promise<string> {
  const { tenantId } = tenantContext.get()

  // 1. Fetch this tenant's data key — but it is WRAPPED; we cannot use it yet.
  const wrapped = await getWrappedDataKey(tenantId)

  // 2. Ask the KMS to unwrap it. The ROOT key does the work inside the KMS;
  //    we get back the data key in memory only. We never see the root key.
  const { Plaintext: dataKey } = await kms.send(new DecryptCommand({ CiphertextBlob: wrapped }))

  // 3. AES-256-GCM decrypt the field with the unwrapped data key.
  const decipher = createDecipheriv('aes-256-gcm', dataKey as Buffer, iv)
  decipher.setAuthTag(tag)
  const plain = Buffer.concat([decipher.update(ciphertext), decipher.final()])

  // 4. Discard the unwrapped key — it lived in memory for one operation.
  ;(dataKey as Buffer).fill(0)
  return plain.toString('utf8')
}

↕ Scroll

🔑Per-tenant keys give you crypto-shredding — real, provable deletion

Because each tenant has its own data key, deleting a tenant for real does not mean hunting down every row and every backup. You destroy that tenant's key in the KMS. Instantly, every copy of their data — including snapshots, including backups you cannot even reach — becomes permanently unreadable ciphertext. That is crypto-shredding, and for “right to be forgotten” and federal data-handling requirements it is the difference between a deletion you can prove and one you can only claim.

The layers of at-rest encryption stack — each is independent, like the tenancy walls:

💽

Full-disk

The volume itself is encrypted. Stops a physically stolen disk. Always on.

🗃️

Database TDE

Transparent data encryption — the DB files are ciphertext. Stops a leaked snapshot.

🔒

Field-level

The most sensitive columns — prompts, documents, PII — encrypted per-tenant on top.

🧩Putting it together — one request, all six layers

Walk a single request through everything we built. This is the animated diagram at the top, narrated:

Edge. The request arrives over TLS 1.3 (Layer 5). The load balancer terminates public TLS and re-originates the connection as mTLS for everything inward.
Controller. The token is verified. Identity — tenant_id, workspace_id, role — is injected once into the request-scoped context (Layer 4). Nothing downstream has to be handed identity; it is ambient and enforced.
Input guardrails. The request is scanned for prompt injection, jailbreaks, and PII (Layer 3). If it fails, it never reaches a provider.
Gateway + decision matrix. The gateway (Layer 1) asks the decision matrix (Layer 2) for an ordered provider list — hard-filtered by the tenant's residency and compliance policy, then ranked by capability, latency, and cost. It calls the primary provider through a vendor adapter.
Egress. The call leaves over mTLS with a client certificate; the private key that proves the gateway's identity was mounted into the pod and never touches the wire (Layer 5). If the provider is down, the gateway falls through to the next one on the list — no code change, no incident.
Output guardrails. The response is scanned for PII leaks, policy violations, and hallucinations before the user ever sees it (Layer 3).
Persist + audit. Anything stored is written as AES-256-GCM ciphertext under the tenant's own data key, whose root key never leaves the KMS (Layer 6). An append-only audit row — tenant, workspace, provider, model, token usage, cost — is written under the same injected identity.

💡Notice what never appeared in application code

In that whole lifecycle, the application's feature code never imported a vendor SDK, never passed a tenant ID by hand, never touched a private key, and never built an unscoped query. Every one of those was the architecture's job, not the developer's memory. That is the abstraction statement from the top, fully decomposed.

🎯

Leadership Takeaway

The reason to build it this way is not elegance — it is that the two most expensive failures in an AI system, vendor lock-in and a cross-tenant data leak, both come from the same root cause: a critical decision left to a developer to remember on every call. Make the provider a swappable adapter and make tenant identity ambient-and-enforced, and you have not added process — you have removed the two ways the system was most likely to hurt you. For a regulated or federal program, that is not a nice-to-have; it is the price of entry. And it is cheaper to build in from the first diagram than to retrofit under the pressure of an outage or an audit.

🌅If you remember nothing else

🔑The page in four sentences

Put one model gateway between your app and every AI vendor, so switching providers is a config change and adding one is a single adapter. Choose the provider per request with a decision matrix that hard-filters on residency and compliance before it ever optimizes for cost. Make tenant identity injected once at the controller and ambient everywhere below, backed by row-level security and per-tenant keys, so a cross-tenant leak needs three simultaneous failures, not one. Encrypt every hop in transit (TLS, then mTLS, with private keys mounted and never sent) and every byte at rest (AES-256-GCM, envelope-encrypted, root key sealed in a KMS) — so the vendor is a dependency you control, and the data is safe whether it is moving or still.

Related Architecture

🏛️

Model Committee — Routing Patterns →

The four routing patterns behind Layer 2 — rule-based, classifier, cascading, and parallel adversarial — plus the LLM Council.

🔌

MCP Server Pattern →

The same abstraction instinct applied to tools — wrap a REST API as an MCP server so any agent consumes it natively.

🏗️

Enterprise Agent Platform — Anatomy →

Where the gateway, guardrails, and tenancy layers sit inside a full production agent platform.

🔭

Observability & Evals →

The audit and measurement layer — how every gateway call gets traced, costed, and regression-tested.

Published 2026-05-14 · Sam Muthu · sammuthu.com