Back to AI/ML Overview
Enterprise Agent Platform Anatomy

Anatomy of an Enterprise Agent Platform β€” control plane, data plane, guardrails

Every senior AI JD now lists β€œ,” β€œ,” β€œRBAC,” β€œaudit logging,” β€œ,” β€œ,” and β€œgovernance” as if they form a single platform category. They do β€” but most teams only implement three or four of the layers and call it done. This page is the full architectural view: what each layer is, why it exists, how the layers interact, and where Microsoft, AWS, and custom stacks fit into the same mental model.

Control plane + data plane10 architectural layersVendor-neutralWorking code + diagrams

🧭Why β€œcontrol plane” is the right mental model

Kubernetes taught a generation of engineers the most important distinction in distributed systems architecture: the configures the system; the runs the traffic. The is where operators declare intent (β€œthese pods should be running”); the is where the actual workload executes. The two layers scale independently, have different availability requirements, and have different failure modes.

The same separation applies to enterprise agent platforms, and the best teams treat it as the load-bearing architectural decision. Model versions, prompt versions, tool allowlists, policies, governance rules, and audit log retention all live in the . Inference calls, agent orchestration, caching, rate limiting, and safety checks all live in the . are a cross-cutting concern. Observability collects from the and feeds back into the .

Most early agent deployments skip this distinction and wire everything into one monolith. That works for a demo. It breaks the moment you need to roll out a new prompt version without redeploying code, enforce data-classification-based , prove regulatory compliance to an auditor, or route traffic between three different model vendors based on cost. At that point teams retrofit the , which is always more expensive than building it from the start.

πŸ”‘The one-sentence architectural claim
An enterprise agent platform is a (config, policy, registries, audit) + a (routing, orchestration, inference, cache) + as a cross-cutting layer + observability as a feedback loop. Every other architectural decision follows from that decomposition.

πŸ—ΊοΈThe full architecture β€” one diagram

Read the diagram top to bottom. The holds the configuration and policy. Input sanitize what flows into the . The handles the actual runtime β€” on the left, agent orchestration in the middle, inference and caching on the right. Output validate and enforce policy on outputs. Observability collectors feed traces, , cost, and latency data back up to the , closing the loop.

Enterprise Agent Platform β€” Control Plane + Data Plane + Guardrails
CONTROL PLANE Β· configuration, policy, registries, auditπŸ“šModel RegistryClaude, GPT, Gemini,internal fine-tunedversioned + tagged✍️Prompt Registryversioned, A/B-tested,role-scopedgolden + canaryπŸ› οΈTool RegistryMCP + internal APIs,schema + RBACper-agent allowlistπŸ”Policy & RBACwho can call what,data class gatingPII / PCI / SOXπŸ“‹Audit Logevery call, tool use,decision + provenanceOCC / FFIEC readyINPUT GUARDRAILSπŸ›‘οΈ Presidio (PII redaction)πŸ¦™ LlamaGuard (jailbreak)πŸ’‰ Rebuff (prompt injection)DATA PLANE Β· runtime traffic, orchestration, inference🚦Model Routerβ€’ Cost-based routingβ€’ Capability matchβ€’ Fallback chainβ€’ Rate limitingβ€’ Cost controlsOpus critical Β· Haiku bulk🧠Agent Orchestratorsingle-agent + multi-agent patternsπŸ›οΈOrchestratorClaude OpusπŸ”Reviewer AGeminiπŸ”Reviewer BCodex / GPT⚑Inference + Cacheβ€’ Vendor APIsβ€’ Self-hosted LLMsβ€’ Prompt cacheβ€’ Semantic cacheβ€’ Batch queuecache hit: 10Γ— cheaperOUTPUT GUARDRAILS🧱 Guardrails AI (schema validation)πŸ›‘οΈ NeMo Guardrails (policy enforcement)OBSERVABILITY COLLECTORSπŸ”­ Langfuse / LangSmith / Arize Phoenix β†’ traces, evals, cost, latency
πŸ’‘How to read this diagram
The five boxes in the are what operators and platform engineers configure. The three boxes in the are what runs when a user or system makes a request. The orange bands are β€” they sit between the and on the way in, and between the and the user on the way out. The purple band at the bottom is observability β€” it spans the full width because it collects from every layer. The dotted arrows show the control feedback loop: observability data informs the , which updates policies, which change data-plane behavior on the next request.

βš™οΈThe control plane β€” what actually lives there

Five components. Each one is its own service with its own durable store, its own model, and its own API. Treat them as infrastructure, not as code embedded in the agent.

πŸ“š

1. Model Registry

Versioned catalog of every model the platform can route to

Claude 4.7 OpusGPT-5Gemini 2.5 ProInternal fine-tuned (Llama 3.1 70B)Safety classifiers (LlamaGuard)

Each registered model has metadata: provider, endpoint, cost per 1M tokens, capabilities (reasoning, tool-use, vision, long-context), regulatory approval status (can it process PII? PCI? ?), and rollout tier (production, canary, experimental). The router () queries the registry to decide which model serves each request. New model versions get registered before they're routable, so rollout is a separate operation from deployment.

✍️

2. Prompt Registry

Versioned, role-scoped prompts with A/B testing

system-prompt/triage/v3.2system-prompt/triage/v3.3-canarygolden-set/triage/2026-04

Prompts are configuration, not code. Treat them with the same rigor as database schemas: versioned, reviewed, tested against a golden set before rollout, deployable independently of code. Modern prompt registries ( Prompts, Prompt Hub, prompts) let you A/B test a canary prompt against production traffic, compare win rates on an set, and promote the winner. Without this layer, prompt changes require code deploys and everyone is afraid to touch them.

πŸ› οΈ

3. Tool Registry

Catalog of callable tools with schemas, RBAC, and audit

MCP server: splunk-queryMCP server: iam-graph-lookupInternal: pam-session-fetchInternal: zelle-transaction-freeze

The () standardized how discover and call tools β€” think of it as β€œUSB for agent .” The tool registry lists every tool the platform exposes, its input/output schema, which agents are authorized to call it, and what audit event gets emitted on invocation. High-risk tools (anything that can write or take action, like freezing transactions or disabling accounts) live here with extra scrutiny: mandatory human-approval gates, rate limits, and kill-switches.

πŸ”

4. Policy & RBAC

Who can invoke what models, what tools, on what data

role: fraud-analyst β†’ model: Claude Opus + tools: [splunk, iam-graph]role: intern β†’ model: Haiku only, no tool accessdata class: PII β†’ vendor: internal only

Three dimensions of policy: identity-based (who is making the request β€” user, service account, agent), resource-based (which model, which tools, which data), and context-based (what is the data classification, what is the time of day, what is the risk score). Modern platforms evaluate all three per request using policy engines like Open Policy Agent (OPA), Cedar, or custom policy languages. For regulated environments, this is where you enforce β€œPII data may only be processed by internal models, never by external vendor APIs.”

πŸ“‹

5. Audit Log

Every decision traceable, with full provenance chain

Request ID + user + timestampModel + version + prompt versionTools invoked + arguments + resultsGuardrail verdicts + confidence scores

Regulators don't just want to know β€œthe AI made a decision.” They want to know which model, which prompt version, what context it retrieved, what tools it called, what approved it, and who authorized the overall workflow. The audit log captures the full provenance chain, stored immutably with retention that matches regulatory requirements (7 years for SOX, longer for some banking regulations). OCC and FFIEC examiners will ask for this during AI audits in 2026 and beyond. If the audit trail can't reconstruct a decision end-to-end, the AI system is not production-ready for regulated workloads.

⚠️The common mistake
Teams often treat audit logging as a post-hoc add-on β€” write logs to stdout, ship them to Splunk, call it done. That fails regulatory review. Audit logs need structured schemas, correlation IDs that span the full request lifecycle, and immutable storage. Retrofitting this after the platform is in production is painful; building it into the from day one is nearly free.

⚑The data plane β€” runtime traffic

This is what runs when a user or system actually makes a request. Three major components β€” router, orchestrator, inference β€” plus caching and rate-limiting as cross-cutting concerns.

🚦

Model Router

Decides which model serves each request

Cost-based: Opus for critical, Haiku for bulkCapability-based: reasoning vs classificationFallback chain: primary β†’ secondary β†’ localA/B testing: 5% traffic to new model

The router is where most of the operational leverage lives. A good router routes requests to the cheapest capable model by default, falls back on vendor failures, enforces per-tenant rate limits, applies cost budgets (β€œthis tenant has $500/month; throttle at 80%”), and supports canary deployments (β€œ5% of traffic goes to the new model; if win-rate exceeds 0.8, promote to 100%”). Teams that skip the router end up hard-coding model names in application code, which makes every model upgrade a code deploy across every service.

🧠

Agent Orchestrator

Single-agent or multi-agent execution with tool use

Single-agent loop (tool-using)Multi-agent: orchestrator + specialistsAdversarial review: LLM Council patternState management across steps

The orchestrator is the loop that drives the agent: call the model, parse the response, execute , feed results back, repeat until termination. Single-agent works for simple tool-using workflows. Multi-agent is worth the orchestration overhead when you need either parallel execution (embarrassingly parallel work) or adversarial review (where the cost of an error is high enough to justify running 2-3 models and requiring consensus). The pattern I built for the is a specific multi-agent design: Claude proposes, and Codex review in parallel, consensus required before action. It's a structural answer to in high-stakes decisions.

⚑

Inference + Cache

Actually call the LLM (or a cached response)

Vendor APIs (Anthropic, OpenAI, Google)Self-hosted LLMs (vLLM, TGI)Prompt cache (Anthropic native)Semantic cache (Redis + embeddings)Batch queue for non-urgent jobs

The final layer. Vendor APIs for , self-hosted for regulatory or cost reasons, for repeated prefix tokens (Anthropic's gives ~10Γ— cost reduction on cache hits), semantic caching for β€œthis query is similar to one I've seen before” (requires the request and searching a cache), and batch queues for workloads that don't need real-time latency (50% cost savings on OpenAI and Anthropic batch APIs). At WatchAlgo scale (30+ hour autonomous runs), the caching layer is the difference between a $200 job and a $2,000 job.

πŸ›‘οΈGuardrails β€” where they sit and why they're separate

are not part of the agent. They are a separate layer between the request and the (input side) and between the and the response (output side). Keeping them separate means they can be updated, audited, and failure-tested independently of the agent's logic.

Input
  • πŸ›‘οΈ Presidio β€” Microsoft's PII detection and redaction. Names, SSNs, phone numbers, emails get replaced with placeholders before the prompt reaches the .
  • πŸ¦™ β€” Meta's small safety classifier. Scores prompts for unsafe content (jailbreaks, toxic requests, out-of-policy asks) before they reach the .
  • πŸ’‰ Rebuff β€” defense. Detects when external text (e.g., log messages, document content) is trying to manipulate the .
  • Lakera Guard β€” commercial equivalent of Rebuff + , often used in regulated environments.
Output
  • 🧱 AI β€” validation. Enforces JSON schemas, constrains tone, strips PII from responses, catches at the shape layer.
  • πŸ›‘οΈ NeMo β€” NVIDIA's policy enforcement DSL. Programmable rails that can block responses based on content policies, business rules, or regulatory constraints.
  • Custom validators β€” schema checks, confidence thresholds, consistency checks against retrieved context.
  • Human-in-the-loop β€” the ultimate output . For high-stakes actions, the 's recommendation never executes without a human signoff.
πŸ’¬Why separate, not embedded
If you embed in the agent's prompt (β€œdon't output PII”), you're trusting the model to self-regulate β€” and the model fails that trust roughly 1-5% of the time. Separating the into a dedicated layer that runs deterministically on every input and output reduces failure rates by 10-100Γ—. It's the same reason we don't ask application code to self-enforce database-level constraints: separation of concerns is an architectural virtue, not a stylistic one.

πŸ”­Observability β€” the feedback loop

Observability is not optional in regulated environments. Every call, every , every verdict, every cost, every latency β€” captured and traced end-to-end. This data feeds three downstream needs: regulatory audit (reproduce any decision), continuous (measure quality over time), and cost/performance optimization (find the slow and expensive paths).

The tooling landscape is covered in depth at sammuthu.com/ai-ml/observability- β€” self-hosted for data-residency-sensitive environments, for -first stacks, Arize Phoenix for OpenTelemetry-native deployments, for -first workflows. For this page, the key point is architectural: observability collectors span the entire and feed back into the . When a prompt version underperforms on the set, the knows to roll back. When a model vendor has elevated latency, the router knows to shift traffic. The feedback loop only works if the collectors are comprehensive and the can act on the signal.

python
from langfuse import Langfuse, observe
from anthropic import Anthropic

langfuse = Langfuse()  # reads LANGFUSE_HOST from env
client = Anthropic()

@observe(as_type="generation")
def triage_agent(alert: dict, retrieved_context: list) -> dict:
    """Triage agent β€” entry point for the LLM Council."""
    messages = build_triage_prompt(alert, retrieved_context)
    response = client.messages.create(
        model="claude-opus-4-7",  # from model registry
        max_tokens=4096,
        messages=messages,
    )
    return parse_triage_output(response)

@observe()
def llm_council_run(alert: dict) -> dict:
    """Parent trace β€” all child calls automatically nest in Langfuse UI."""
    context = retrieve_from_rag(alert)  # traced
    draft = triage_agent(alert, context)  # traced
    reviews = [
        review_agent(draft, reviewer_model="gemini-2.5-pro"),  # traced
        review_agent(draft, reviewer_model="gpt-5-codex"),  # traced
    ]
    return reach_consensus(draft, reviews)  # traced
↕ Scroll

Langfuse instrumentation at the orchestrator layer β€” every LLM call, tool call, and decision becomes a span in a traceable tree.

βš–οΈResponsible AI β€” the compliance overlay

β€œResponsible AI” is often treated as a governance checkbox β€” write a policy document, run an annual review, done. That framing fails. In an enterprise agent platform, responsible-AI requirements are architectural: they constrain which models can serve which requests, which data classes can flow to which vendors, what audit detail must be captured, and what explainability the system must produce.

Five dimensions that matter for 2026 and beyond:

🎯 Fairness

Bias detection on outcomes across protected attributes. Requires sampling + ground-truth labeling. Not solved by β€œthe model is unbiased” β€” has to be measured per deployment.

πŸ” Explainability

For every decision: what model, what prompt, what retrieved context, what tools, what reasoning. Required for regulatory audit. + audit log makes this possible.

πŸ”’ Privacy

Data minimization (don't send more than needed), data classification enforcement (PII routes to internal models only), data residency (EU data stays in EU).

πŸ›‘οΈ Safety

layer. detection, PII redaction, toxic-content filtering, defense. Layered, not single-point.

πŸ‘€ Human oversight

High-stakes actions require human approval. Low-stakes actions can auto-execute within pre-approved policy. Defining β€œhigh-stakes” is the architecture decision.

πŸ“Š Accountability

Clear ownership: who is responsible when the system makes a wrong decision? The 's and audit log establish this chain.

πŸ”‘The architectural claim
Responsible AI is not a policy document. It is the set of constraints that shape the , , and architecture. If you built the platform without those constraints in mind, no amount of governance review will retrofit them cheaply.

πŸͺŸMicrosoft agent ecosystem β€” concrete mapping

The Microsoft stack for enterprise agents has matured substantially in 2025-2026 and deserves a specific walkthrough, both because it's a common enterprise choice and because it names its components differently from the generic architecture above. The mapping is clean once you see it.

β€” the

Microsoft's unified platform for agent development and governance. Holds the model catalog ( deployments, Meta Llama, Mistral, internal fine-tunes), the prompt catalog, the tool catalog (via and Azure Function integrations), datasets, responsible-AI policies, and content-filter configurations. This is where the , audit, and governance configuration lives. If a JD mentions β€œ experience,” they mean the .

+ Azure AI Model Catalog β€” the + inference layer

hosts the OpenAI (GPT-5, Codex) inside Azure tenancy with data-residency guarantees. The broader Azure AI Model Catalog adds Llama, Mistral, and specialty models. Together they form the inference layer plus Microsoft-flavored (the β€œdeployment” abstraction is effectively a routing entry). , content filtering, and rate limiting are built in at this layer.

Semantic Kernel β€” the orchestration framework

Microsoft's open-source agent orchestration library. Handles single-agent tool-using loops and multi-agent coordination, with native integration to and the tool catalog. Comparable to LangGraph or from the generic landscape, but Microsoft-native and production-hardened. When a JD says β€œMicrosoft agent ecosystem β€” orchestration, tool integration,” they usually mean Semantic Kernel.

Copilot Studio β€” the agent authoring layer

Low-code for business users. Sits on top of and Semantic Kernel. Not relevant for senior-engineer roles building the platform, but worth knowing because many enterprise AI deployments mix Copilot Studio (business user agents) with custom Semantic Kernel agents (engineering-built agents) β€” and the platform has to support both.

Azure Content Safety + Prompt Shields β€” the layer

Azure's native service. Content Safety handles toxicity, bias, PII. Prompt Shields handles and detection. Both are callable as REST APIs before and after the call, integrated with Foundry policies. Comparable to + Rebuff from the generic landscape, but Microsoft-native and SLA-backed.

Azure Monitor + Application Insights β€” observability

Native tracing and telemetry. Semantic Kernel instruments via OpenTelemetry, so traces flow into Azure Monitor with -specific semantic conventions. For teams standardized on Microsoft monitoring, this is the path of least resistance. For multi-cloud or Azure-skeptical teams, self-hosted or Arize Phoenix work equally well.

🏒The same architecture across three stacks

Microsoft, AWS, and a custom-harness approach all implement the same architectural pattern with different named components. This table is the Rosetta Stone β€” given a requirement, you can find the right component in whichever stack you're working in.

Architectural Layer
Microsoft
AWS Bedrock
Custom / Open-source
Control plane
Azure AI Foundry
Amazon Bedrock Studio
Langfuse + custom registries
Model registry
Azure AI Model Catalog
Bedrock model catalog
YAML + tagging
Prompt registry
Foundry Prompt Flow
Bedrock Prompt Management
Langfuse Prompts / LangSmith
Tool registry
Foundry Agents + MCP
Bedrock Agent Actions
MCP server + internal APIs
Policy & RBAC
Entra ID + Foundry policies
IAM + Bedrock guardrails
OPA / Cedar
Audit log
Azure Monitor + Purview
CloudTrail + Bedrock logs
OpenTelemetry + immutable store
Model router
AOAI deployments
Bedrock routing
LiteLLM / custom router
Orchestrator
Semantic Kernel
Strands Agents SDK
LangGraph / AutoGen / custom
Input guardrails
Prompt Shields + Content Safety
Bedrock Guardrails
Presidio + LlamaGuard + Rebuff
Output guardrails
Content Safety + Foundry evals
Bedrock Guardrails
Guardrails AI + NeMo Guardrails
Observability
Azure Monitor + App Insights
CloudWatch + Bedrock logs
Langfuse / Phoenix / LangSmith
πŸ’‘The practical truth
Most real-world enterprise deployments are hybrid: buy the from a vendor (Microsoft or AWS), build the agents using the vendor's orchestration SDK, but layer custom observability and custom on top for portability. Locking in fully to one vendor's makes multi-cloud migration expensive; building everything custom is slower than most teams can justify. The middle path β€” vendor + portable agent code + portable observability β€” is where most production deployments land.

🏭Custom vs off-the-shelf β€” an opinionated take

The build-vs-buy question at the platform layer has a clear answer and a murky answer. Clear answer: buy the . Registries, , audit logs, policy engines β€” these are commodity capabilities where the vendors have done the work and regulatory review is already done. Building these from scratch is pointless.

Murky answer: the orchestration and layer. Here the trade-off is real. Off-the-shelf orchestrators (Semantic Kernel, LangGraph, ) cover the common patterns but can fight you when you need something non-standard β€” like the adversarial-consensus pattern I built for the . That pattern isn't well-served by any off-the-shelf orchestrator because the consensus logic is custom and the failure modes are specific to the workload. For that specific use case, a 200-line custom harness beats adopting a framework.

My default architectural recommendation for enterprise teams in 2026:

  • : buy (, AWS Bedrock Studio, or equivalent)
  • Orchestration: default to the vendor's SDK (Semantic Kernel, Strands, LangGraph) unless you have a proven non-standard pattern
  • : layer the vendor's first-party (Prompt Shields, Bedrock ) with at least one portable OSS layer ( or AI) for defense in depth
  • Observability: portable from day one ( self-hosted or OpenTelemetry-first), not vendor-locked

The I built for WatchAlgo is an example of going custom on every layer β€” which was the right choice for a one-person team shipping a specific multi-agent pattern, and would be the wrong choice for a 50-engineer enterprise team that needs audit, governance, and multi-tenant on day one. Context matters more than ideology.