Every senior AI JD now lists βcontrol plane,β βguardrails,β βRBAC,β βaudit logging,β βmodel routing,β βobservability,β and βgovernanceβ as if they form a single platform category. They do β but most teams only implement three or four of the layers and call it done. This page is the full architectural view: what each layer is, why it exists, how the layers interact, and where Microsoft, AWS, and custom stacks fit into the same mental model.
Kubernetes taught a generation of engineers the most important distinction in distributed systems architecture: the control plane configures the system; the data plane runs the traffic. The control plane is where operators declare intent (βthese pods should be runningβ); the data plane is where the actual workload executes. The two layers scale independently, have different availability requirements, and have different failure modes.
The same separation applies to enterprise agent platforms, and the best teams treat it as the load-bearing architectural decision. Model versions, prompt versions, tool allowlists, RBAC policies, governance rules, and audit log retention all live in the control plane. Inference calls, agent orchestration, caching, rate limiting, and safety checks all live in the data plane. Guardrails are a cross-cutting concern. Observability collects from the data plane and feeds back into the control plane.
Most early agent deployments skip this distinction and wire everything into one monolith. That works for a demo. It breaks the moment you need to roll out a new prompt version without redeploying code, enforce data-classification-based RBAC, prove regulatory compliance to an auditor, or route traffic between three different model vendors based on cost. At that point teams retrofit the control plane, which is always more expensive than building it from the start.
Read the diagram top to bottom. The control plane holds the configuration and policy. Input guardrails sanitize what flows into the data plane. The data plane handles the actual runtime β model routing on the left, agent orchestration in the middle, inference and caching on the right. Output guardrails validate and enforce policy on outputs. Observability collectors feed traces, evaluations, cost, and latency data back up to the control plane, closing the loop.
Five components. Each one is its own service with its own durable store, its own RBAC model, and its own API. Treat them as infrastructure, not as code embedded in the agent.
Versioned catalog of every model the platform can route to
Each registered model has metadata: provider, endpoint, cost per 1M tokens, capabilities (reasoning, tool-use, vision, long-context), regulatory approval status (can it process PII? PCI? PHI?), and rollout tier (production, canary, experimental). The router (data plane) queries the registry to decide which model serves each request. New model versions get registered before they're routable, so rollout is a separate operation from deployment.
Versioned, role-scoped prompts with A/B testing
Prompts are configuration, not code. Treat them with the same rigor as database schemas: versioned, reviewed, tested against a golden set before rollout, deployable independently of code. Modern prompt registries (Langfuse Prompts, LangSmith Prompt Hub, Braintrust prompts) let you A/B test a canary prompt against production traffic, compare win rates on an eval set, and promote the winner. Without this layer, prompt changes require code deploys and everyone is afraid to touch them.
Catalog of callable tools with schemas, RBAC, and audit
The Model Context Protocol (MCP) standardized how LLMs discover and call tools β think of it as βUSB for agent tool use.β The tool registry lists every tool the platform exposes, its input/output schema, which agents are authorized to call it, and what audit event gets emitted on invocation. High-risk tools (anything that can write or take action, like freezing transactions or disabling accounts) live here with extra scrutiny: mandatory human-approval gates, rate limits, and kill-switches.
Who can invoke what models, what tools, on what data
Three dimensions of policy: identity-based (who is making the request β user, service account, agent), resource-based (which model, which tools, which data), and context-based (what is the data classification, what is the time of day, what is the risk score). Modern platforms evaluate all three per request using policy engines like Open Policy Agent (OPA), Cedar, or custom policy languages. For regulated environments, this is where you enforce βPII data may only be processed by internal models, never by external vendor APIs.β
Every decision traceable, with full provenance chain
Regulators don't just want to know βthe AI made a decision.β They want to know which model, which prompt version, what context it retrieved, what tools it called, what guardrails approved it, and who authorized the overall workflow. The audit log captures the full provenance chain, stored immutably with retention that matches regulatory requirements (7 years for SOX, longer for some banking regulations). OCC and FFIEC examiners will ask for this during AI audits in 2026 and beyond. If the audit trail can't reconstruct a decision end-to-end, the AI system is not production-ready for regulated workloads.
This is what runs when a user or system actually makes a request. Three major components β router, orchestrator, inference β plus caching and rate-limiting as cross-cutting concerns.
Decides which model serves each request
The router is where most of the operational leverage lives. A good router routes requests to the cheapest capable model by default, falls back on vendor failures, enforces per-tenant rate limits, applies cost budgets (βthis tenant has $500/month; throttle at 80%β), and supports canary deployments (β5% of traffic goes to the new model; if eval win-rate exceeds 0.8, promote to 100%β). Teams that skip the router end up hard-coding model names in application code, which makes every model upgrade a code deploy across every service.
Single-agent or multi-agent execution with tool use
The orchestrator is the loop that drives the agent: call the model, parse the response, execute tool calls, feed results back, repeat until termination. Single-agent works for simple tool-using workflows. Multi-agent is worth the orchestration overhead when you need either parallel execution (embarrassingly parallel work) or adversarial review (where the cost of an error is high enough to justify running 2-3 models and requiring consensus). The LLM Council pattern I built for the AI Factory is a specific multi-agent design: Claude proposes, Gemini and Codex review in parallel, consensus required before action. It's a structural answer to hallucinations in high-stakes decisions.
Actually call the LLM (or a cached response)
The final layer. Vendor APIs for frontier models, self-hosted for regulatory or cost reasons, prompt caching for repeated prefix tokens (Anthropic's prompt caching gives ~10Γ cost reduction on cache hits), semantic caching for βthis query is similar to one I've seen beforeβ (requires embedding the request and searching a cache), and batch queues for workloads that don't need real-time latency (50% cost savings on OpenAI and Anthropic batch APIs). At WatchAlgo scale (30+ hour autonomous runs), the caching layer is the difference between a $200 job and a $2,000 job.
Guardrails are not part of the agent. They are a separate layer between the request and the LLM (input side) and between the LLM and the response (output side). Keeping them separate means they can be updated, audited, and failure-tested independently of the agent's logic.
Observability is not optional in regulated environments. Every LLM call, every tool call, every guardrail verdict, every cost, every latency β captured and traced end-to-end. This data feeds three downstream needs: regulatory audit (reproduce any decision), continuous evaluation (measure quality over time), and cost/performance optimization (find the slow and expensive paths).
The tooling landscape is covered in depth at sammuthu.com/ai-ml/observability-evals β Langfuse self-hosted for data-residency-sensitive environments, LangSmith for LangChain-first stacks, Arize Phoenix for OpenTelemetry-native deployments, Braintrust for evaluation-first workflows. For this page, the key point is architectural: observability collectors span the entire data plane and feed back into the control plane. When a prompt version underperforms on the eval set, the control plane knows to roll back. When a model vendor has elevated latency, the router knows to shift traffic. The feedback loop only works if the collectors are comprehensive and the control plane can act on the signal.
from langfuse import Langfuse, observe
from anthropic import Anthropic
langfuse = Langfuse() # reads LANGFUSE_HOST from env
client = Anthropic()
@observe(as_type="generation")
def triage_agent(alert: dict, retrieved_context: list) -> dict:
"""Triage agent β entry point for the LLM Council."""
messages = build_triage_prompt(alert, retrieved_context)
response = client.messages.create(
model="claude-opus-4-7", # from model registry
max_tokens=4096,
messages=messages,
)
return parse_triage_output(response)
@observe()
def llm_council_run(alert: dict) -> dict:
"""Parent trace β all child calls automatically nest in Langfuse UI."""
context = retrieve_from_rag(alert) # traced
draft = triage_agent(alert, context) # traced
reviews = [
review_agent(draft, reviewer_model="gemini-2.5-pro"), # traced
review_agent(draft, reviewer_model="gpt-5-codex"), # traced
]
return reach_consensus(draft, reviews) # tracedLangfuse instrumentation at the orchestrator layer β every LLM call, tool call, and decision becomes a span in a traceable tree.
βResponsible AIβ is often treated as a governance checkbox β write a policy document, run an annual review, done. That framing fails. In an enterprise agent platform, responsible-AI requirements are architectural: they constrain which models can serve which requests, which data classes can flow to which vendors, what audit detail must be captured, and what explainability the system must produce.
Five dimensions that matter for 2026 and beyond:
Bias detection on outcomes across protected attributes. Requires sampling + ground-truth labeling. Not solved by βthe model is unbiasedβ β has to be measured per deployment.
For every decision: what model, what prompt, what retrieved context, what tools, what reasoning. Required for regulatory audit. Structured output + audit log makes this possible.
Data minimization (don't send more than needed), data classification enforcement (PII routes to internal models only), data residency (EU data stays in EU).
Guardrails layer. Jailbreak detection, PII redaction, toxic-content filtering, prompt injection defense. Layered, not single-point.
High-stakes actions require human approval. Low-stakes actions can auto-execute within pre-approved policy. Defining βhigh-stakesβ is the architecture decision.
Clear ownership: who is responsible when the system makes a wrong decision? The control plane's RBAC and audit log establish this chain.
The Microsoft stack for enterprise agents has matured substantially in 2025-2026 and deserves a specific walkthrough, both because it's a common enterprise choice and because it names its components differently from the generic architecture above. The mapping is clean once you see it.
Microsoft's unified platform for agent development and governance. Holds the model catalog (Azure OpenAI deployments, Meta Llama, Mistral, internal fine-tunes), the prompt catalog, the tool catalog (via MCP and Azure Function integrations), evaluation datasets, responsible-AI policies, and content-filter configurations. This is where the RBAC, audit, and governance configuration lives. If a JD mentions βAzure AI Foundry experience,β they mean the control plane.
Azure OpenAI Service hosts the OpenAI frontier models (GPT-5, Codex) inside Azure tenancy with data-residency guarantees. The broader Azure AI Model Catalog adds Llama, Mistral, and specialty models. Together they form the inference layer plus Microsoft-flavored model routing (the βdeploymentβ abstraction is effectively a routing entry). Prompt caching, content filtering, and rate limiting are built in at this layer.
Microsoft's open-source agent orchestration library. Handles single-agent tool-using loops and multi-agent coordination, with native integration to Azure OpenAI and the tool catalog. Comparable to LangGraph or AutoGen from the generic landscape, but Microsoft-native and production-hardened. When a JD says βMicrosoft agent ecosystem β orchestration, tool integration,β they usually mean Semantic Kernel.
Low-code agent builder for business users. Sits on top of Azure AI Foundry and Semantic Kernel. Not relevant for senior-engineer roles building the platform, but worth knowing because many enterprise AI deployments mix Copilot Studio (business user agents) with custom Semantic Kernel agents (engineering-built agents) β and the platform has to support both.
Azure's native guardrail service. Content Safety handles toxicity, bias, PII. Prompt Shields handles jailbreak and prompt injection detection. Both are callable as REST APIs before and after the LLM call, integrated with Foundry policies. Comparable to LlamaGuard + Rebuff from the generic landscape, but Microsoft-native and SLA-backed.
Native tracing and telemetry. Semantic Kernel instruments via OpenTelemetry, so traces flow into Azure Monitor with LLM-specific semantic conventions. For teams standardized on Microsoft monitoring, this is the path of least resistance. For multi-cloud or Azure-skeptical teams, Langfuse self-hosted or Arize Phoenix work equally well.
Microsoft, AWS, and a custom-harness approach all implement the same architectural pattern with different named components. This table is the Rosetta Stone β given a requirement, you can find the right component in whichever stack you're working in.
The build-vs-buy question at the platform layer has a clear answer and a murky answer. Clear answer: buy the control plane. Registries, RBAC, audit logs, policy engines β these are commodity capabilities where the vendors have done the work and regulatory review is already done. Building these from scratch is pointless.
Murky answer: the orchestration and guardrails layer. Here the trade-off is real. Off-the-shelf orchestrators (Semantic Kernel, LangGraph, CrewAI) cover the common patterns but can fight you when you need something non-standard β like the LLM Council adversarial-consensus pattern I built for the AI Factory. That pattern isn't well-served by any off-the-shelf orchestrator because the consensus logic is custom and the failure modes are specific to the workload. For that specific use case, a 200-line custom harness beats adopting a framework.
My default architectural recommendation for enterprise teams in 2026:
The AI Factory I built for WatchAlgo is an example of going custom on every layer β which was the right choice for a one-person team shipping a specific multi-agent pattern, and would be the wrong choice for a 50-engineer enterprise team that needs audit, governance, and multi-tenant RBAC on day one. Context matters more than ideology.