🎯Overview 🌌Cosmic Managed AI 📚Foundations 🏛️Model Committee 🧭RAG Anatomy 🗂️Enterprise RAG 🛠️Agent Frameworks 🏗️Platform Anatomy 🔭Observability & Evals 🧪Local LLM Field Notes 💻Polyglot 🏭AI Factory ⌨️CosmicKeys 📊WatchAlgo ⚖️AI Underwriting

Enterprise Agent Platform Anatomy

Anatomy of an Enterprise Agent Platform — control plane, data plane, guardrails

Every senior AI JD now lists “control plane,” “guardrails,” “RBAC,” “audit logging,” “model routing,” “observability,” and “governance” as if they form a single platform category. They do — but most teams only implement three or four of the layers and call it done. This page is the full architectural view: what each layer is, why it exists, how the layers interact, and where Microsoft, AWS, and custom stacks fit into the same mental model.

Control plane + data plane10 architectural layersVendor-neutralWorking code + diagrams

🧭Why “control plane” is the right mental model

Kubernetes taught a generation of engineers the most important distinction in distributed systems architecture: the control plane configures the system; the data plane runs the traffic. The control plane is where operators declare intent (“these pods should be running”); the data plane is where the actual workload executes. The two layers scale independently, have different availability requirements, and have different failure modes.

The same separation applies to enterprise agent platforms, and the best teams treat it as the load-bearing architectural decision. Model versions, prompt versions, tool allowlists, RBAC policies, governance rules, and audit log retention all live in the control plane. Inference calls, agent orchestration, caching, rate limiting, and safety checks all live in the data plane. Guardrails are a cross-cutting concern. Observability collects from the data plane and feeds back into the control plane.

Most early agent deployments skip this distinction and wire everything into one monolith. That works for a demo. It breaks the moment you need to roll out a new prompt version without redeploying code, enforce data-classification-based RBAC, prove regulatory compliance to an auditor, or route traffic between three different model vendors based on cost. At that point teams retrofit the control plane, which is always more expensive than building it from the start.

🔑The one-sentence architectural claim

An enterprise agent platform is a control plane (config, policy, registries, audit) + a data plane (routing, orchestration, inference, cache) + guardrails as a cross-cutting layer + observability as a feedback loop. Every other architectural decision follows from that decomposition.

🗺️The full architecture — one diagram

Read the diagram top to bottom. The control plane holds the configuration and policy. Input guardrails sanitize what flows into the data plane. The data plane handles the actual runtime — model routing on the left, agent orchestration in the middle, inference and caching on the right. Output guardrails validate and enforce policy on outputs. Observability collectors feed traces, evaluations, cost, and latency data back up to the control plane, closing the loop.

Enterprise Agent Platform — Control Plane + Data Plane + Guardrails

💡How to read this diagram

The five boxes in the control plane are what operators and platform engineers configure. The three boxes in the data plane are what runs when a user or system makes a request. The orange bands are guardrails — they sit between the control plane and data plane on the way in, and between the data plane and the user on the way out. The purple band at the bottom is observability — it spans the full width because it collects from every layer. The dotted arrows show the control feedback loop: observability data informs the control plane, which updates policies, which change data-plane behavior on the next request.

⚙️The control plane — what actually lives there

Five components. Each one is its own service with its own durable store, its own RBAC model, and its own API. Treat them as infrastructure, not as code embedded in the agent.

📚

1. Model Registry

Versioned catalog of every model the platform can route to

Claude 4.7 OpusGPT-5Gemini 2.5 ProInternal fine-tuned (Llama 3.1 70B)Safety classifiers (LlamaGuard)

Each registered model has metadata: provider, endpoint, cost per 1M tokens, capabilities (reasoning, tool-use, vision, long-context), regulatory approval status (can it process PII? PCI? PHI?), and rollout tier (production, canary, experimental). The router (data plane) queries the registry to decide which model serves each request. New model versions get registered before they're routable, so rollout is a separate operation from deployment.

✍️

2. Prompt Registry

Versioned, role-scoped prompts with A/B testing

system-prompt/triage/v3.2system-prompt/triage/v3.3-canarygolden-set/triage/2026-04

Prompts are configuration, not code. Treat them with the same rigor as database schemas: versioned, reviewed, tested against a golden set before rollout, deployable independently of code. Modern prompt registries (Langfuse Prompts, LangSmith Prompt Hub, Braintrust prompts) let you A/B test a canary prompt against production traffic, compare win rates on an eval set, and promote the winner. Without this layer, prompt changes require code deploys and everyone is afraid to touch them.

🛠️

3. Tool Registry

Catalog of callable tools with schemas, RBAC, and audit

MCP server: splunk-queryMCP server: iam-graph-lookupInternal: pam-session-fetchInternal: zelle-transaction-freeze

The Model Context Protocol (MCP) standardized how LLMs discover and call tools — think of it as “USB for agent tool use.” The tool registry lists every tool the platform exposes, its input/output schema, which agents are authorized to call it, and what audit event gets emitted on invocation. High-risk tools (anything that can write or take action, like freezing transactions or disabling accounts) live here with extra scrutiny: mandatory human-approval gates, rate limits, and kill-switches.

🔐

4. Policy & RBAC

Who can invoke what models, what tools, on what data

role: fraud-analyst → model: Claude Opus + tools: [splunk, iam-graph]role: intern → model: Haiku only, no tool accessdata class: PII → vendor: internal only

Three dimensions of policy: identity-based (who is making the request — user, service account, agent), resource-based (which model, which tools, which data), and context-based (what is the data classification, what is the time of day, what is the risk score). Modern platforms evaluate all three per request using policy engines like Open Policy Agent (OPA), Cedar, or custom policy languages. For regulated environments, this is where you enforce “PII data may only be processed by internal models, never by external vendor APIs.”

📋

5. Audit Log

Every decision traceable, with full provenance chain

Request ID + user + timestampModel + version + prompt versionTools invoked + arguments + resultsGuardrail verdicts + confidence scores

Regulators don't just want to know “the AI made a decision.” They want to know which model, which prompt version, what context it retrieved, what tools it called, what guardrails approved it, and who authorized the overall workflow. The audit log captures the full provenance chain, stored immutably with retention that matches regulatory requirements (7 years for SOX, longer for some banking regulations). OCC and FFIEC examiners will ask for this during AI audits in 2026 and beyond. If the audit trail can't reconstruct a decision end-to-end, the AI system is not production-ready for regulated workloads.

⚠️The common mistake

Teams often treat audit logging as a post-hoc add-on — write logs to stdout, ship them to Splunk, call it done. That fails regulatory review. Audit logs need structured schemas, correlation IDs that span the full request lifecycle, and immutable storage. Retrofitting this after the platform is in production is painful; building it into the control plane from day one is nearly free.

⚡The data plane — runtime traffic

This is what runs when a user or system actually makes a request. Three major components — router, orchestrator, inference — plus caching and rate-limiting as cross-cutting concerns.

🚦

Model Router

Decides which model serves each request

Cost-based: Opus for critical, Haiku for bulkCapability-based: reasoning vs classificationFallback chain: primary → secondary → localA/B testing: 5% traffic to new model

The router is where most of the operational leverage lives. A good router routes requests to the cheapest capable model by default, falls back on vendor failures, enforces per-tenant rate limits, applies cost budgets (“this tenant has $500/month; throttle at 80%”), and supports canary deployments (“5% of traffic goes to the new model; if eval win-rate exceeds 0.8, promote to 100%”). Teams that skip the router end up hard-coding model names in application code, which makes every model upgrade a code deploy across every service.

🧠

Agent Orchestrator

Single-agent or multi-agent execution with tool use

Single-agent loop (tool-using)Multi-agent: orchestrator + specialistsAdversarial review: LLM Council patternState management across steps

The orchestrator is the loop that drives the agent: call the model, parse the response, execute tool calls, feed results back, repeat until termination. Single-agent works for simple tool-using workflows. Multi-agent is worth the orchestration overhead when you need either parallel execution (embarrassingly parallel work) or adversarial review (where the cost of an error is high enough to justify running 2-3 models and requiring consensus). The LLM Council pattern I built for the AI Factory is a specific multi-agent design: Claude proposes, Gemini and Codex review in parallel, consensus required before action. It's a structural answer to hallucinations in high-stakes decisions.

⚡

Inference + Cache

Actually call the LLM (or a cached response)

Vendor APIs (Anthropic, OpenAI, Google)Self-hosted LLMs (vLLM, TGI)Prompt cache (Anthropic native)Semantic cache (Redis + embeddings)Batch queue for non-urgent jobs

The final layer. Vendor APIs for frontier models, self-hosted for regulatory or cost reasons, prompt caching for repeated prefix tokens (Anthropic's prompt caching gives ~10× cost reduction on cache hits), semantic caching for “this query is similar to one I've seen before” (requires embedding the request and searching a cache), and batch queues for workloads that don't need real-time latency (50% cost savings on OpenAI and Anthropic batch APIs). At WatchAlgo scale (30+ hour autonomous runs), the caching layer is the difference between a $200 job and a $2,000 job.

🛡️Guardrails — where they sit and why they're separate

Guardrails are not part of the agent. They are a separate layer between the request and the LLM (input side) and between the LLM and the response (output side). Keeping them separate means they can be updated, audited, and failure-tested independently of the agent's logic.

Input guardrails

🛡️ Presidio — Microsoft's PII detection and redaction. Names, SSNs, phone numbers, emails get replaced with placeholders before the prompt reaches the LLM.
🦙 LlamaGuard — Meta's small safety classifier. Scores prompts for unsafe content (jailbreaks, toxic requests, out-of-policy asks) before they reach the frontier model.
💉 Rebuff — prompt injection defense. Detects when external text (e.g., log messages, document content) is trying to manipulate the LLM.
Lakera Guard — commercial equivalent of Rebuff + LlamaGuard, often used in regulated environments.

Output guardrails

🧱 Guardrails AI — structured output validation. Enforces JSON schemas, constrains tone, strips PII from responses, catches hallucinations at the shape layer.
🛡️ NeMo Guardrails — NVIDIA's policy enforcement DSL. Programmable rails that can block responses based on content policies, business rules, or regulatory constraints.
Custom validators — schema checks, confidence thresholds, consistency checks against retrieved context.
Human-in-the-loop — the ultimate output guardrail. For high-stakes actions, the LLM's recommendation never executes without a human signoff.

💬Why separate, not embedded

If you embed guardrails in the agent's prompt (“don't output PII”), you're trusting the model to self-regulate — and the model fails that trust roughly 1-5% of the time. Separating the guardrail into a dedicated layer that runs deterministically on every input and output reduces failure rates by 10-100×. It's the same reason we don't ask application code to self-enforce database-level constraints: separation of concerns is an architectural virtue, not a stylistic one.

🔭Observability — the feedback loop

Observability is not optional in regulated environments. Every LLM call, every tool call, every guardrail verdict, every cost, every latency — captured and traced end-to-end. This data feeds three downstream needs: regulatory audit (reproduce any decision), continuous evaluation (measure quality over time), and cost/performance optimization (find the slow and expensive paths).

The tooling landscape is covered in depth at sammuthu.com/ai-ml/observability-evals — Langfuse self-hosted for data-residency-sensitive environments, LangSmith for LangChain-first stacks, Arize Phoenix for OpenTelemetry-native deployments, Braintrust for evaluation-first workflows. For this page, the key point is architectural: observability collectors span the entire data plane and feed back into the control plane. When a prompt version underperforms on the eval set, the control plane knows to roll back. When a model vendor has elevated latency, the router knows to shift traffic. The feedback loop only works if the collectors are comprehensive and the control plane can act on the signal.

python

from langfuse import Langfuse, observe
from anthropic import Anthropic

langfuse = Langfuse()  # reads LANGFUSE_HOST from env
client = Anthropic()

@observe(as_type="generation")
def triage_agent(alert: dict, retrieved_context: list) -> dict:
    """Triage agent — entry point for the LLM Council."""
    messages = build_triage_prompt(alert, retrieved_context)
    response = client.messages.create(
        model="claude-opus-4-7",  # from model registry
        max_tokens=4096,
        messages=messages,
    )
    return parse_triage_output(response)

@observe()
def llm_council_run(alert: dict) -> dict:
    """Parent trace — all child calls automatically nest in Langfuse UI."""
    context = retrieve_from_rag(alert)  # traced
    draft = triage_agent(alert, context)  # traced
    reviews = [
        review_agent(draft, reviewer_model="gemini-2.5-pro"),  # traced
        review_agent(draft, reviewer_model="gpt-5-codex"),  # traced
    ]
    return reach_consensus(draft, reviews)  # traced

↕ Scroll

Langfuse instrumentation at the orchestrator layer — every LLM call, tool call, and decision becomes a span in a traceable tree.

⚖️Responsible AI — the compliance overlay

“Responsible AI” is often treated as a governance checkbox — write a policy document, run an annual review, done. That framing fails. In an enterprise agent platform, responsible-AI requirements are architectural: they constrain which models can serve which requests, which data classes can flow to which vendors, what audit detail must be captured, and what explainability the system must produce.

Five dimensions that matter for 2026 and beyond:

🎯 Fairness

Bias detection on outcomes across protected attributes. Requires sampling + ground-truth labeling. Not solved by “the model is unbiased” — has to be measured per deployment.

🔍 Explainability

For every decision: what model, what prompt, what retrieved context, what tools, what reasoning. Required for regulatory audit. Structured output + audit log makes this possible.

🔒 Privacy

Data minimization (don't send more than needed), data classification enforcement (PII routes to internal models only), data residency (EU data stays in EU).

🛡️ Safety

Guardrails layer. Jailbreak detection, PII redaction, toxic-content filtering, prompt injection defense. Layered, not single-point.

👤 Human oversight

High-stakes actions require human approval. Low-stakes actions can auto-execute within pre-approved policy. Defining “high-stakes” is the architecture decision.

📊 Accountability

Clear ownership: who is responsible when the system makes a wrong decision? The control plane's RBAC and audit log establish this chain.

🔑The architectural claim

Responsible AI is not a policy document. It is the set of constraints that shape the control plane, data plane, and guardrail architecture. If you built the platform without those constraints in mind, no amount of governance review will retrofit them cheaply.

🪟Microsoft agent ecosystem — concrete mapping

The Microsoft stack for enterprise agents has matured substantially in 2025-2026 and deserves a specific walkthrough, both because it's a common enterprise choice and because it names its components differently from the generic architecture above. The mapping is clean once you see it.

Azure AI Foundry — the control plane

Microsoft's unified platform for agent development and governance. Holds the model catalog (Azure OpenAI deployments, Meta Llama, Mistral, internal fine-tunes), the prompt catalog, the tool catalog (via MCP and Azure Function integrations), evaluation datasets, responsible-AI policies, and content-filter configurations. This is where the RBAC, audit, and governance configuration lives. If a JD mentions “Azure AI Foundry experience,” they mean the control plane.

Azure OpenAI Service + Azure AI Model Catalog — the model routing + inference layer

Azure OpenAI Service hosts the OpenAI frontier models (GPT-5, Codex) inside Azure tenancy with data-residency guarantees. The broader Azure AI Model Catalog adds Llama, Mistral, and specialty models. Together they form the inference layer plus Microsoft-flavored model routing (the “deployment” abstraction is effectively a routing entry). Prompt caching, content filtering, and rate limiting are built in at this layer.

Semantic Kernel — the orchestration framework

Microsoft's open-source agent orchestration library. Handles single-agent tool-using loops and multi-agent coordination, with native integration to Azure OpenAI and the tool catalog. Comparable to LangGraph or AutoGen from the generic landscape, but Microsoft-native and production-hardened. When a JD says “Microsoft agent ecosystem — orchestration, tool integration,” they usually mean Semantic Kernel.

Copilot Studio — the agent authoring layer

Low-code agent builder for business users. Sits on top of Azure AI Foundry and Semantic Kernel. Not relevant for senior-engineer roles building the platform, but worth knowing because many enterprise AI deployments mix Copilot Studio (business user agents) with custom Semantic Kernel agents (engineering-built agents) — and the platform has to support both.

Azure Content Safety + Prompt Shields — the guardrails layer

Azure's native guardrail service. Content Safety handles toxicity, bias, PII. Prompt Shields handles jailbreak and prompt injection detection. Both are callable as REST APIs before and after the LLM call, integrated with Foundry policies. Comparable to LlamaGuard + Rebuff from the generic landscape, but Microsoft-native and SLA-backed.

Azure Monitor + Application Insights — observability

Native tracing and telemetry. Semantic Kernel instruments via OpenTelemetry, so traces flow into Azure Monitor with LLM-specific semantic conventions. For teams standardized on Microsoft monitoring, this is the path of least resistance. For multi-cloud or Azure-skeptical teams, Langfuse self-hosted or Arize Phoenix work equally well.

🏢The same architecture across three stacks

Microsoft, AWS, and a custom-harness approach all implement the same architectural pattern with different named components. This table is the Rosetta Stone — given a requirement, you can find the right component in whichever stack you're working in.

Architectural Layer

Microsoft

AWS Bedrock

Custom / Open-source

Control plane

Azure AI Foundry

Amazon Bedrock Studio

Langfuse + custom registries

Model registry

Azure AI Model Catalog

Bedrock model catalog

YAML + tagging

Prompt registry

Foundry Prompt Flow

Bedrock Prompt Management

Langfuse Prompts / LangSmith

Tool registry

Foundry Agents + MCP

Bedrock Agent Actions

MCP server + internal APIs

Policy & RBAC

Entra ID + Foundry policies

IAM + Bedrock guardrails

OPA / Cedar

Audit log

Azure Monitor + Purview

CloudTrail + Bedrock logs

OpenTelemetry + immutable store

Model router

AOAI deployments

Bedrock routing

LiteLLM / custom router

Orchestrator

Semantic Kernel

Strands Agents SDK

LangGraph / AutoGen / custom

Input guardrails

Prompt Shields + Content Safety

Bedrock Guardrails

Presidio + LlamaGuard + Rebuff

Output guardrails

Content Safety + Foundry evals

Bedrock Guardrails

Guardrails AI + NeMo Guardrails

Observability

Azure Monitor + App Insights

CloudWatch + Bedrock logs

Langfuse / Phoenix / LangSmith

💡The practical truth

Most real-world enterprise deployments are hybrid: buy the control plane from a vendor (Microsoft or AWS), build the agents using the vendor's orchestration SDK, but layer custom observability and custom guardrails on top for portability. Locking in fully to one vendor's control plane makes multi-cloud migration expensive; building everything custom is slower than most teams can justify. The middle path — vendor control plane + portable agent code + portable observability — is where most production deployments land.

🏭Custom vs off-the-shelf — an opinionated take

The build-vs-buy question at the platform layer has a clear answer and a murky answer. Clear answer: buy the control plane. Registries, RBAC, audit logs, policy engines — these are commodity capabilities where the vendors have done the work and regulatory review is already done. Building these from scratch is pointless.

Murky answer: the orchestration and guardrails layer. Here the trade-off is real. Off-the-shelf orchestrators (Semantic Kernel, LangGraph, CrewAI) cover the common patterns but can fight you when you need something non-standard — like the LLM Council adversarial-consensus pattern I built for the AI Factory. That pattern isn't well-served by any off-the-shelf orchestrator because the consensus logic is custom and the failure modes are specific to the workload. For that specific use case, a 200-line custom harness beats adopting a framework.

My default architectural recommendation for enterprise teams in 2026:

Control plane: buy (Azure AI Foundry, AWS Bedrock Studio, or equivalent)
Orchestration: default to the vendor's SDK (Semantic Kernel, Strands, LangGraph) unless you have a proven non-standard pattern
Guardrails: layer the vendor's first-party guardrails (Prompt Shields, Bedrock Guardrails) with at least one portable OSS layer (LlamaGuard or Guardrails AI) for defense in depth
Observability: portable from day one (Langfuse self-hosted or OpenTelemetry-first), not vendor-locked

The AI Factory I built for WatchAlgo is an example of going custom on every layer — which was the right choice for a one-person team shipping a specific multi-agent pattern, and would be the wrong choice for a 50-engineer enterprise team that needs audit, governance, and multi-tenant RBAC on day one. Context matters more than ideology.

Related Architecture

🏭

AI Factory — the custom-harness case study →

The three-layer agentic framework I built at Zen Algorithms — what “custom” looks like when you commit to it.

🛠️

Agent Frameworks Comparison →

Companion piece — LangChain, LangGraph, CrewAI, AutoGen, Semantic Kernel, and custom harness honestly compared.

🔭

Observability, Evals & Safety Tooling →

Langfuse, LangSmith, Braintrust, LlamaGuard, Guardrails AI — the tooling landscape referenced throughout this page.

🧭

Enterprise RAG Anatomy →

The 15-step production RAG pipeline — the retrieval layer that grounds enterprise agents in organizational knowledge.