An architectural thesis for what a Managed AI service should look like when it is built by infrastructure-first players β vendor-agnostic, model-agnostic, framework-agnostic, vertically integrated from energy to inference. Not a hyperscaler-vertical product. A category prediction about where Managed AI is going by 2027, and the operating principles for whoever wants to win that future. CMAS is the open thesis; the implementation patterns referenced here are published across the Cosmic AI architecture catalog.
This page is an architecture thesis under the Cosmic portfolio brand β a different category from open-source projects (Mnemos) and live consumer products (CosmicKeys, WatchAlgo). Those are things shipped. This is a thing argued.
The audience: AI infrastructure operators in the vendor-agnostic / infra-first AI cloud category, platform architects evaluating Managed AI build vs. buy, and engineers thinking about where to spend the next five years.
The Managed AI category is being shaped right now, in 2026, by hyperscaler defaults that will define the next decade if nobody articulates a credible alternative. Vertex AI, Bedrock, and Azure ML each represent the right answer for customers who already live deep inside one hyperscaler cloud. They are not the right answer for the growing set of workloads that need vendor portability, sovereign deployment, or compute economics that hyperscaler retail pricing cannot reach.
The opening is real and time-bounded. Infrastructure-first operators who own the energy + datacenter + compute stack have a structural advantage they have not yet converted into a service-layer offering at parity with hyperscaler completeness. The next 18 months are when that conversion happens β or doesn't.
CMAS is the architectural shape of what the conversion looks like. It is published as an open thesis so operators can fork it, modify it, and ship the version that fits their constraint set. The architectural commitment β vendor-agnostic, infra-first, BYO-everything β is the part that should not move.
Managed AI sits between two layers most readers already understand:
Raw compute by itself is a commodity wrapped around a megawatt-hour. The application by itself doesn't know which model to call or how to evaluate the answer. The Managed AI layer is where the differentiation lives β and it is where most of today's vendor lock-in is being silently established.
The Managed AI layer used to be a βnice to haveβ you'd assemble yourself out of LangChain, a vector DB, a Prometheus dashboard, and a lot of glue. As enterprises scale past their first ten production AI workloads, that glue stops being optional. They either build a real service layer in-house, or they buy one from a hyperscaler. The third option β buy one from an infrastructure-first operator that is vendor-agnostic by design β is the one that doesn't fully exist yet at scale. That's the gap CMAS describes.
Three industry-shape forces converge into a 2025β2027 opening for vendor-agnostic Managed AI. The order matters: sovereignty is the primary enterprise hook; model commoditization makes vendor-agnosticism technically viable; energy is the scaling advantage that makes the economics work.
Enterprises that won't put their crown-jewel data into a hyperscaler's first-party LLM β regulatory, competitive, geopolitical, or trust reasons β are a growing, not shrinking, segment. They want the service layer (gateway, observability, RAG) but they want to bring their own models, run them in jurisdictions they choose, and audit every inference call. Hyperscalers can offer this in pieces. An infra-first operator with a vendor-agnostic service layer can offer it as the default. This isn't a 2026 trend β it is a structural enterprise requirement that will only get sharper as data-residency regimes (EU AI Act, sovereign-cloud mandates, sector-specific compliance) mature.
Open-weight frontier-class models (Llama, Qwen, DeepSeek, Mistral, and successors) are closing the capability gap with closed frontier models for an expanding set of workloads. The value migration is moving from βwho has the best modelβ toward βwho runs the model best for this workload.β That favors operators whose value is at the orchestration + execution layer, not at the model layer. This is the technical enabler β without strong open-weight models, vendor-agnosticism would be aspirational; with them, it is the natural shape of a non-hyperscaler Managed AI service.
GPU supply is not the chokepoint at scale; megawatt-hours delivered to the right rack at the right time is. Operators who own or co-locate the energy source (stranded gas, geothermal, behind-the-meter solar, dedicated nuclear PPAs) have a structural cost and capacity advantage that hyperscalers β who must buy retail power on long-cycle contracts β cannot replicate quickly. That advantage rolls up the stack: cheaper compute β cheaper service layer β cheaper Managed AI. Energy is a scaling enabler, not the primary decisive variable. Customers do not choose a Managed AI service because it is powered by geothermal energy. They choose it because the service layer is sovereign, vendor-agnostic, and economically aligned with their workload. Energy economics is how the operator can afford to offer all three.
A first-principles walkthrough of what an enterprise customer actually does to go from βI have a use case and an open-weight modelβ to βI am serving inference at scale.β This is the thesis from the customer's side of the counter.
Customer signs in via their own identity provider (BYOI β Okta / Azure AD / Workload Identity Federation). They land in a tenant scoped by their organization ID. No new account creation; the operator never holds a password.
A short form: which deployment envelope does this workload require β operator-hosted multi-tenant, single-tenant dedicated, sovereign in-country, or air-gapped customer-premise? The chosen envelope determines every downstream constraint. The customer never re-chooses; the operator never blurs.
Three paths: BYOM (upload weights or point at a private registry; the operator's provisioning pipeline packages them into a container with the right inference runtime), operator-hosted open-weight catalog (Llama, Qwen, DeepSeek, Mistral β already containerized, on warm capacity), or external frontier model (Anthropic, OpenAI, Google β inference routes through the operator's gateway, runs on the external provider). The choice is exposed as a policy decision, not a vendor decision.
The customer declares the policy that governs routing per request: cost ceiling, latency ceiling, privacy tier, jurisdiction, fall-back rules. Example: βfor this workload, prefer hosted Llama-70B; if unavailable within 200ms, fall back to Anthropic Claude; never route to a US-region model for EU customer requests.β The gateway enforces the policy. The application never names a model directly.
POST /v1/chat/completions against the operator's gateway endpoint. The application code is OpenAI-compatible at the surface, with a richer native API on top for the orchestration features the application actually uses.
For high-volume or high-cost workloads, the customer attaches an eval suite: golden inputs, scoring rubric, accepted-output examples. The platform replays this against alternative models continuously and surfaces substitution candidates when a cheaper model is passing the eval. (See the Eval-Anchored Model Substitution section below for the workflow detail.)
Every inference call writes to the audit log: provider, model, route reason, eval score, latency, cost. The customer gets one invoice per billing period covering all routes (local + external). The operator handles upstream provider settlement underneath. (See The Economic Model section for the multi-party billing flow.)
A Managed AI service layer that earns βvendor-agnosticβ as a claim β not a marketing word β consists of five capability pillars. Each is architecturally independent; each can be swapped without rewriting the others.
Runs the model, whoever made it, wherever it sits. The service layer never depends on a specific accelerator family, runtime, or cluster orchestrator. Workload portability across accelerator classes (current-gen NVIDIA, AMD, TPU, near-future ASICs), inference runtimes (TensorRT-LLM, vLLM, SGLang, llama.cpp), and generation transitions. Energy-aware scheduling. Auto-scale-to-zero for idle workloads. Capacity reservations survive accelerator generation changes β workload portability is the SLA, not GPU model. The abstraction boundary is named explicitly: customers code against a portable runtime contract; they do not name accelerator SKUs in their code, ever.
Where every inference call enters the system. The single architectural commitment that determines whether the service layer is genuinely vendor-agnostic or just claims to be. One API surface (OpenAI-compatible floor + richer native extensions). Pluggable backends β first-party hosted, BYOM, external providers. Model routing by policy. Per-request observability with provider, model, route reason, cost, latency recorded. Identity-aware policy gates before the model is even chosen.
Deep-dive β Model-Agnostic Architecture β the routing, security, and pluggable-provider patterns that make this pillar real instead of aspirational.
Where multi-step, multi-model, multi-tool workflows live. The layer that turns βmodel callβ into βagent that does the job.β First-class agent runtime with durable execution, retries, state persistence, idempotency. Tool calling with permission boundaries. Multi-model orchestration. Long-running workloads (30+ hour autonomous runs). Human-in-the-loop checkpoints by policy.
Deep-dives β AI Factory (three-layer operating model, 12 quality gates) and Model Committee (LLM Council pattern for multi-model consensus).
Separates a Managed AI service from a chat wrapper. Offline eval suites per workload. Online eval with traffic shadow-replayed against candidate models, drift detection, regression alerts. Cross-model scoring for like-for-like comparison. Cost-per-task and quality-per-task tracked together. Audit trail joining query β retrieved context β model output β eval score.
Deep-dive β Observability & Evals β production-grade patterns for keeping models honest.
The structural commitment that the customer never depends on the operator for anything they can sensibly bring themselves. BYOM (model weights), BYOR (retrieval β vector DB, embedding model, reranker), BYOF (framework β LangGraph, LlamaIndex, Vercel AI SDK, in-house), BYOI (identity β your SSO, your IdP), BYOK (keys β customer-managed encryption). What BYO buys the customer: portability. What BYO buys the operator: a defensible non-lock-in story hyperscalers structurally can't offer.
The five capability pillars describe what the service can do. The control plane defines who owns what, who is allowed to do what, and what the operator is contractually accountable for. This is what makes Managed AI managed rather than rented. Without an explicit control plane, the architecture is an AI-platform component stack; with one, it is a service.
Governance and security are the qualification gates for the category, not features. The operator's product definition of Managed AI must include:
For the enterprise customers CMAS targets, these guarantees are the reason to choose Managed over self-hosted.
Trust boundary starts at identity. Customer identity is federated from the customer's own IdP (BYOI). Per-request identity scoping ensures every inference call carries the human or service identity initiating it. Tenant isolation enforced at namespace, network, and storage layers β not just application layer.
Policy declares what the workload may do (which models, jurisdictions, cost ceilings, fall-back behaviors). The gateway enforces policy before any inference happens. Every enforcement decision β pass, fall-back, denial β is written to a tamper-evident audit log scoped to the customer's tenant.
Operator owns the provisioning pipeline (model packaging, container orchestration, hardware allocation, deployment-envelope enforcement). Customer owns the policy that drives provisioning. Lifecycle transitions are operator-managed but customer-visible β published in advance, audited after.
Four first-class envelopes: operator-hosted multi-tenant (shared with tenant isolation), single-tenant dedicated (dedicated cluster + reservation), sovereign in-country (customer-specified jurisdiction, no data egress), and air-gapped customer-premise (operator stack on customer hardware, signed offline releases). Sovereignty without explicit envelopes stays rhetorical.
This is the workflow that converts the architectural commitment to vendor-agnosticism into a customer-visible operational advantage. It is the single most economically valuable pattern a CMAS-shaped service can offer.
The same workflow was applied (at meta-level) to building WatchAlgo, a content factory generating algorithm-learning solutions and visualizations. The reference pattern was established for the first ~50 problems using an expensive frontier coding agent. Once the eval bar was concrete (does the visualization correctly animate the algorithm? does the code in all three flavors run?), the remaining 1500+ problems were generated using meaningfully cheaper models that consistently passed the eval. Total token spend for the scaled-out problems was a fraction of the reference cost.
This case study validates the workflow. It does not validate the broader market thesis β that's the job of the predictions in the 2027 Bet section below.
A CMAS-shaped Managed AI service composed end-to-end:
The architecture is an economic argument as much as a technical one. Three things make this economic model structurally different from the hyperscaler reseller model.
A CMAS-shaped service charges the customer at the vendor-agnostic level, not the model-or-provider level. Three primitive options:
Whatever the primitive, the contract is between the customer and the operator. The customer never sees an Anthropic invoice or an OpenAI invoice β they pay the operator.
The structural shape of the multi-party flow. The operator is the merchant of record for every inference call, regardless of where the inference physically happens:
This is the merchant-of-record pattern applied to AI inference. It is structurally what enterprises want when they buy managed rather than raw.
| Advantage | What it means in practice |
|---|---|
| Power-aware compute pricing | Operators who own or co-locate energy can price compute based on actual marginal cost, not retail-electricity-plus-margin |
| Capacity that survives accelerator transitions | Workload portability is a real SLA when the operator isn't married to one accelerator vendor's roadmap |
| No first-party model lock-in pressure | When the operator doesn't make a model, they have no economic reason to route customers toward one |
| Geographic + jurisdictional flexibility | Sovereign-cloud asks, data-residency requirements, and air-gapped deployments are easier to honor without hyperscaler global-network entanglement |
| Customer outcomes β revenue lock-in | Hyperscaler Managed AI margin depends on stickiness; infra-first Managed AI margin depends on the customer staying because the service is good, not because leaving is painful |
Hyperscalers also resell external models in their Managed AI services. The structural difference: a hyperscaler's reseller margin is partly subsidized by their first-party model business, which creates an incentive to steer customers toward the first-party model. A CMAS operator with no first-party model has no such incentive β every model is just another backend. The pricing, the routing, and the eval-anchored substitution recommendations all operate on the customer's quality + cost interests, not the operator's first-party promotion interests.
Three predictions about how the Managed AI market shape changes over the next ~30 months. Each derives from specific pillars + the control plane.
Enabled by: Pillars 1 + 2 + 5 (compute + gateway + BYO)
By 2027, βwe run on one cloud's Managed AI serviceβ becomes a procurement disqualifier for any workload above a certain size. Customers will demand model portability and inference-runtime portability as a procurement default β not as an advanced negotiation point. Vendor-agnostic operators are positioned to win this turn; hyperscalers will respond with βopenβ SKUs that are still functionally locked.
Enabled by: Pillar 1 (compute) + the energy substrate
For workloads with steady high-volume inference (most production AI in 2027 β chat, search, RAG, agentic), the energy cost per million tokens becomes a larger line item than the model license. Operators with structurally cheap energy will price the layer below cost levels hyperscalers can't reach. The value migrates downward in the stack.
Enabled by: Pillars 3 + 4 (orchestration + eval) + the control plane
At the same time the value migrates downward to energy, the margin migrates upward to orchestration and evaluation. Anybody can rent a GPU. The operators who own the agentic runtime, the eval framework, the observability story, and the trust boundary are the ones who get paid for the difference between βI have GPUsβ and βI have a Managed AI service.β This is where the high-skill engineering work concentrates.
Owns the bottom of the stack, owns the top of the stack, and is genuinely agnostic in the middle. That profile does not describe any of today's dominant Managed AI providers. It describes the operators that will exist if anyone makes them exist.
The architecture should feel adoptable, not all-at-once. Three starting points:
Start at the gateway. Put the operator's vendor-agnostic gateway in front of the hyperscaler's first-party models for one or two workloads. The customer's application code keeps working (gateway is OpenAI-compatible). The operator earns the right to the next workload by demonstrating eval-anchored substitution savings on the first. Migration is workload-by-workload, never big-bang.
Adopt the gateway + orchestration pillars first. The customer keeps their existing compute provider but starts using the operator's service layer. As capacity needs grow, the customer can migrate workloads to operator-hosted compute and decommission their internal glue progressively. The control plane is the second adoption step, not the first.
Adopt the orchestration + eval pillars first (replace the LangChain / vector-DB / Prometheus stack with operator-managed equivalents). The customer keeps their LLM provider choice initially; the gateway and BYO commitments let them substitute models later. Eval-anchored substitution is the migration mechanism β once the eval is locked, the model behind it can change without rewriting the application.
In all three paths, the customer never has to commit to the whole CMAS architecture before they have evidence it works. The architecture is adoptable in increments.
Why this thesis is published as Cosmic-brand architecture: the Managed AI category is being shaped right now, in 2026, by hyperscaler defaults that will define the next decade if nobody articulates a credible alternative. The alternative isn't a single product β it's an architectural specification any infrastructure-first operator could implement, and a checklist customers can use when evaluating Managed AI offerings.
What I want from operators reading this: disagreement, refinement, and execution. The thesis is wrong in details that only operators inside the build see. The pattern catalog cross-linked from this page is published so others can fork it, modify it, and ship the version that fits their constraint set. The architectural commitment β vendor-agnostic, infra-first, BYO-everything, with an explicit trust boundary and a clean billing topology β is the part that should not move.
What this is NOT: a product launch, a hyperscaler critique, or a consulting offer. I run a small AI-native lab; I publish architecture and I ship products. CMAS is the third kind of artifact: a category prediction with enough specificity that it's falsifiable. If by 2027 the Managed AI category has not unbundled in the direction this thesis describes, the thesis is wrong and I will say so. If it has, the operators who built CMAS-shaped services will be the ones who matter.
Seven architectural decisions any operator building CMAS-shaped has to make. These are intentionally not prescribed; they depend on the operator's specific constraints. They are the in-person conversation hooks.