A field guide to the eight model families that run production AI in 2026 β LLM, SLM, MoE, MLM, LAM, VLM, SAM, LCM β and the routing layer that composes them into coherent systems.
For most of the last three years, βAI modelβ meant one thing: a giant transformer trained on the entire internet, answering everything from legal questions to poetry requests to code generation. GPT-4, Claude, Gemini β all variations on the same theme. One model, one API, one way to do things.
That era is ending. In 2026, building production AI is no longer about finding βthe best model.β It is about choosing the right model for each specific task inside your system, and building routing layers that dynamically compose them.
Everyone knows LLMs. They are the giant transformers that sparked the AI boom β hundreds of billions to trillions of parameters, trained on the entire internet, capable of reasoning across almost any text domain. GPT-4/5, Claude Opus, Claude Sonnet, Gemini Ultra, LLaMA 3/4, Mistral Large, DeepSeek-V3. These are the heavy hitters.
If LLMs are the giant thinkers in the data center, SLMs are the nimble workers running on your phone. Small Language Models are compact transformers β 1β10 billion parameters β optimized to run locally on consumer devices without touching the cloud.
Phi-3 (Microsoft), Gemma 2 (Google), Llama 3.2 1B/3B (Meta), TinyLlama, Qwen-2 small. These are the SLMs that became the foundation of Apple Intelligence and Gemini Nano running directly on Pixel phones.
Mixture of Experts is an architectural insight that took decades to get right: if your model needs to be big to handle a wide range of tasks, but most individual tasks don't require the full model, why activate the whole thing for every query?
A MoE model has hundreds of billions to trillions of parameters total, but a small router network decides, for each input, which subset of specialized sub-networks (βexpertsβ) should actually run. Mixtral 8x7B, Mixtral 8x22B, DeepSeek-V3 (671B total, 37B active), Qwen-MoE. GPT-4 is rumored to be MoE under the hood.
Before ChatGPT made the world care about generation, there was a different generation of language models focused on understanding. Masked Language Models β BERT, RoBERTa, DeBERTa, DistilBERT β are the workhorses of text classification, extraction, and search.
Instead of predicting the next word, MLMs take a sentence with certain words βmaskedβ and predict the masked words using both preceding and following context. This bidirectional attention makes them exceptional at understanding text β classifying it, extracting entities from it, or converting it into a vector for search.
An LLM is a thinker. Give it a problem, it tells you how to solve it. A Large Action Modelis a doer. Give it a problem, it actually solves it β by navigating a web browser, clicking buttons, filling forms, calling APIs, and executing the multi-step sequence of actions required to complete the task in the real world.
The technical distinction is that LAMs are trained not just to predict the next token, but to predict the next action in a sequence of interactions with some interface. Rabbit R1's LAM, Adept ACT-1, OpenAI Operator (Computer-Using Agent), Anthropic Claude Computer Use β these are the production LAMs as of 2026.
A Vision Language Model is an LLM that can also see. It is a transformer trained on both text and images simultaneously, so the model can understand visual content in the same way it understands text. GPT-4o, GPT-4V, Claude 3.5 Sonnet with vision, Gemini Ultra/Pro, LLaVA, Qwen2-VL, Molmo.
Meta's Segment Anything Model is not a language model at all. It is a pure computer vision model trained to do one thing at superhuman precision: identify and isolate individual objects in an image, down to the pixel.
Show SAM a photo of a kitchen and it produces a mask for every single object β the coffee cup, the spoon, the cutting board, each individual apple in the fruit bowl. SAM doesn't understand what any of these objects are β it has no semantic knowledge β but it knows whereeach object is, pixel by pixel, with an accuracy no other model family approaches.
Traditional diffusion models (Stable Diffusion, Midjourney, DALL-E) work by starting with random noise and iteratively denoising it over 50β200 steps to produce a final image. Each step requires a forward pass. For real-time applications, that is too slow.
Latent Consistency Models are a distillation technique that collapses those 50β100 steps into just 2β4. The model produces high-quality images in a fraction of the time. SDXL Turbo, SDXL Lightning, LCM-LoRA, Flux Schnell, SDXS.
Knowing what each model family does is necessary but not sufficient. The higher-order skill is composing them into coherent systems that route dynamically based on query characteristics. When a query arrives, the routing layer decides which specialist should handle it. Cheap queries go to small fast models. Complex queries escalate. Image queries go to VLMs. Action queries go to LAMs. Precision vision goes to SAM pipelines.
The simplest approach: hand-coded rules. Easy to build, transparent, deterministic. Also brittle β any query that doesn't fit the rules falls through to a default, which is usually the expensive LLM that was supposed to be the fallback.
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Rule-based router β no training data, no latency overhead,
# transparent logic. Best for narrow products with predictable
# query distributions.
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
from dataclasses import dataclass
from typing import Literal
ModelFamily = Literal["slm", "llm", "moe", "vlm", "lam", "sam", "lcm", "mlm"]
@dataclass
class Query:
text: str
has_image: bool = False
wants_action: bool = False # set by upstream intent classifier
needs_pixel_mask: bool = False # set when request is "cut this out"
is_generative_image: bool = False # set for "draw me a ..." prompts
def route(q: Query) -> ModelFamily:
# Hard rules first β these short-circuit everything else.
if q.needs_pixel_mask:
return "sam" # pixel-perfect isolation
if q.is_generative_image:
return "lcm" # fast image generation
if q.has_image:
return "vlm" # any other image input
# Action before reasoning: LAM beats LLM when the user wants
# something DONE, not explained.
if q.wants_action:
return "lam"
# Text path: short factual β SLM, long complex β LLM/MoE.
token_estimate = len(q.text.split())
if token_estimate < 20 and _looks_factual(q.text):
return "slm" # on-device, <200ms
# Embedding/classification jobs ride the MLM family.
if _is_classification_task(q.text):
return "mlm"
# Default escalation: MoE if available, otherwise dense LLM.
return "moe" if _moe_endpoint_healthy() else "llm"
def _looks_factual(text: str) -> bool:
return text.strip().endswith("?") and not any(
marker in text.lower()
for marker in ("explain", "analyze", "why", "how does", "compare")
)
def _is_classification_task(text: str) -> bool:
return text.startswith(("classify:", "sentiment:", "intent:"))
def _moe_endpoint_healthy() -> bool:
# In production this checks a circuit breaker; stubbed here.
return True
Rule-based router β explicit conditions decide the target model family.
A small MLM (like a fine-tuned DistilBERT) looks at the query and decides which downstream model to invoke. Handles the long tail better than hand-coded rules. Adds ~10ms per query β negligible compared to the downstream model call.
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Classifier router β one small MLM decides which specialist
# handles the query. ~10ms overhead; handles the long tail
# better than hand-coded rules.
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
ROUTER_MODEL = "distilbert-base-uncased" # fine-tuned offline
LABELS = ["slm", "llm", "moe", "vlm", "lam", "sam", "lcm", "mlm"]
# Load once at process start; hold in memory.
_tokenizer = AutoTokenizer.from_pretrained(ROUTER_MODEL)
_model = AutoModelForSequenceClassification.from_pretrained(
ROUTER_MODEL,
num_labels=len(LABELS),
)
_model.eval()
def classify_route(query: str, confidence_floor: float = 0.65) -> str:
"""Return the predicted target model family, or 'llm' as safe default."""
with torch.inference_mode():
inputs = _tokenizer(
query,
return_tensors="pt",
truncation=True,
max_length=128,
)
logits = _model(**inputs).logits
probs = torch.softmax(logits, dim=-1)[0]
best_idx = int(torch.argmax(probs).item())
best_conf = float(probs[best_idx].item())
# Below the confidence floor we escalate to the generalist LLM
# rather than risk a wrong specialist pick on an ambiguous query.
if best_conf < confidence_floor:
return "llm"
return LABELS[best_idx]
# Example usage in a request handler
def handle_request(query: str) -> str:
target = classify_route(query)
# dispatch_to_family is the thin adapter that talks to each model
# provider β Anthropic, OpenAI, Mistral, on-device runtime, etc.
return dispatch_to_family(target, query)
Classifier router β a DistilBERT fine-tuned on queryβmodel-family pairs, served locally on CPU.
Try the cheap model first. If confidence is high and the answer passes a quality check, return it. If not, escalate to the next-larger model. Optimizes for cost automatically β most queries land on the cheap tier, only the hard ones escalate.
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Cascading router β escalate through a cost-ascending cascade,
# stopping at the first model whose answer clears the gate.
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
from dataclasses import dataclass
from anthropic import Anthropic
client = Anthropic()
@dataclass
class CascadeStep:
model: str
confidence_gate: float # min confidence to accept this answer
cost_per_1k_in: float # illustrative $/1k input tokens
cost_per_1k_out: float # illustrative $/1k output tokens
CASCADE: list[CascadeStep] = [
CascadeStep("claude-haiku-4-5", confidence_gate=0.85,
cost_per_1k_in=0.0008, cost_per_1k_out=0.004),
CascadeStep("claude-sonnet-4-6", confidence_gate=0.90,
cost_per_1k_in=0.003, cost_per_1k_out=0.015),
CascadeStep("claude-opus-4-6", confidence_gate=1.00, # final
cost_per_1k_in=0.015, cost_per_1k_out=0.075),
]
def cascade_answer(query: str, system_prompt: str) -> dict:
"""Walk the cascade until an answer clears its step's confidence gate."""
last_answer = None
last_model = None
for step in CASCADE:
response = client.messages.create(
model=step.model,
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": query}],
)
answer = response.content[0].text
confidence = _judge_quality(query, answer, step.model)
last_answer, last_model = answer, step.model
# Clear the gate β return immediately, don't pay for escalation.
if confidence >= step.confidence_gate:
return {
"model": step.model,
"answer": answer,
"confidence": confidence,
"escalations_used": CASCADE.index(step),
}
# Exhausted the cascade β return the last (most expensive) answer anyway.
return {
"model": last_model,
"answer": last_answer,
"confidence": confidence,
"escalations_used": len(CASCADE) - 1,
}
def _judge_quality(query: str, answer: str, model: str) -> float:
"""Lightweight self-evaluation signal. In production this is either
a small judge model or a rule-based post-hoc validator."""
if not answer or len(answer) < 20:
return 0.0
if any(refusal in answer.lower() for refusal in ("i can't", "i don't know")):
return 0.3
return 0.92 # placeholder β real systems use an embedding-based judge
Cascading router β cheap model first, escalate only when confidence gate fails. Optimizes aggregate cost automatically.
Send the query to multiple models simultaneously and either pick the best answer, combine them, or use adversarial review to catch errors. This is the LLM Council pattern I use at Zen Algorithms β Claude as the primary writer, Gemini and Codex running in parallel as adversarial reviewers. Expensive per query, but produces the highest-quality output because every response is checked by two independent second opinions before it ships.
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# LLM Council β parallel adversarial review pattern.
# Claude writes. Gemini and Codex critique in parallel.
# Writer revises until both reviewers approve, or max rounds hit.
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
import os
from concurrent.futures import ThreadPoolExecutor
from dataclasses import dataclass
from anthropic import Anthropic
from openai import OpenAI
import google.generativeai as genai
anthropic_client = Anthropic()
openai_client = OpenAI()
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
gemini_model = genai.GenerativeModel("gemini-2.5-pro")
@dataclass
class Review:
reviewer: str
approved: bool
feedback: str
def llm_council_write(
prompt: str,
system: str = "You are a rigorous technical writer.",
max_rounds: int = 3,
) -> str:
"""Iterate until both reviewers approve or max_rounds reached."""
draft = _claude_write(prompt, system)
for round_num in range(max_rounds):
# Parallel adversarial review β Gemini and Codex run concurrently.
with ThreadPoolExecutor(max_workers=2) as pool:
gemini_future = pool.submit(_gemini_critique, prompt, draft)
codex_future = pool.submit(_codex_critique, prompt, draft)
gemini_review = gemini_future.result()
codex_review = codex_future.result()
if gemini_review.approved and codex_review.approved:
return draft # consensus reached
# Writer revises against both reviewers' feedback simultaneously.
draft = _claude_revise(draft, gemini_review, codex_review)
return draft # ship with residual warnings after max_rounds
def _claude_write(prompt: str, system: str) -> str:
response = anthropic_client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
system=system,
messages=[{"role": "user", "content": prompt}],
)
return response.content[0].text
def _gemini_critique(prompt: str, draft: str) -> Review:
critique_prompt = (
f"Original request:\n{prompt}\n\nDraft:\n{draft}\n\n"
"Find every factual error, logical gap, or missing requirement. "
"End with exactly one line: 'VERDICT: APPROVE' or 'VERDICT: REJECT'."
)
response = gemini_model.generate_content(critique_prompt)
text = response.text
return Review(
reviewer="gemini",
approved="VERDICT: APPROVE" in text,
feedback=text,
)
def _codex_critique(prompt: str, draft: str) -> Review:
critique_prompt = (
f"Original request:\n{prompt}\n\nDraft:\n{draft}\n\n"
"Review as an adversarial technical reviewer. "
"End with exactly one line: 'VERDICT: APPROVE' or 'VERDICT: REJECT'."
)
response = openai_client.chat.completions.create(
model="gpt-5-codex",
messages=[{"role": "user", "content": critique_prompt}],
)
text = response.choices[0].message.content or ""
return Review(
reviewer="codex",
approved="VERDICT: APPROVE" in text,
feedback=text,
)
def _claude_revise(draft: str, g: Review, c: Review) -> str:
response = anthropic_client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
messages=[{
"role": "user",
"content": (
f"Current draft:\n{draft}\n\n"
f"Gemini said:\n{g.feedback}\n\n"
f"Codex said:\n{c.feedback}\n\n"
"Revise the draft to address both reviewers' concerns. "
"Return only the revised draft."
),
}],
)
return response.content[0].text
LLM Council β Claude writes, Gemini and Codex review in parallel, writer revises until consensus. The production pattern behind the AI Factory.
Three concrete examples of production systems that route dynamically between model families. Every one of them looks like a single AI product from the outside and a composed routing layer from the inside.
Single-line autocomplete β small fast code model (3β7B params, <100ms). Writing a new function β Claude Sonnet. Reviewing an architectural decision β Claude Opus or GPT-5. The routing decision is based on the scope of the requested change. Cheap queries pay cheap model costs. Complex queries pay complex model costs. The user experience feels seamless because the routing is invisible.
Tier 1: ~3B SLM on-device on the Neural Engine for writing assistance, tone rewriting, short summarization. Zero latency, zero cloud cost, zero privacy exposure. Tier 2: Private Cloud Compute for queries that exceed the SLM's capability β still with privacy guarantees, but larger models. Tier 3: ChatGPT handoff with explicit user consent for frontier-level queries. Three tiers of routing, each picked based on query complexity and privacy requirements. The most architecturally sophisticated consumer AI deployment in the world as of 2026.
At Zen Algorithms, I built a content generation pipeline for 3,247 algorithm problem solutions across three programming languages in three narrative flavors β nearly thirty thousand outputs. A single model for all of them would have been expensive and slow.
Result: 1,600+ AI-authored solutions with autonomous 30+ hour runs, at a fraction of the cost of running everything on the top-tier model. Documented in full at /ai-ml/ai-factory.
With eight model families and four routing patterns, the number of possible system designs is large. Here is the framework I use when architecting a new AI system from scratch.
What are the actual queries your system needs to handle? Group by complexity (simple/medium/hard), modality (text/image/action/hybrid), and latency requirement (real-time/near-real-time/batch). This gives you the query distribution β the input to every subsequent decision.
Simple text β SLM or MoE. Complex text β LLM. Images β VLM. Actions β LAM. Pixel-precise vision β SAM pipelines. Fast image gen β LCM. Search/classification β MLM. Not a strict mapping β some queries could go to multiple families β but mapping them gives you the shape of the system.
Rule-based for narrow products, classifier-based for broad ones, cascading for cost optimization, parallel (LLM Council) for quality-critical paths. Most systems end up hybrid. The key question: what is your highest-priority constraint β cost, latency, privacy, or quality?
Every production AI system needs graceful degradation. Primary model down, rate-limited, or returning low-quality answers β what happens next? Most systems have a chain: primary fails β try secondary β still fails β return a safe default with an apology. Designing fallback chains is not optional. It is the difference between a production system and a demo.
Every query should emit structured logs: which route was taken, which model was invoked, how long it took, how much it cost, what the quality signal was. Without this instrumentation you cannot tune the routing layer β and an untuned routing layer degrades over time as query patterns shift and models evolve.
The AI architect's highest-leverage skill in 2026 is not model selection, not prompt engineering, and not LLM fine-tuning. It is the ability to design routing layers that dynamically compose multiple specialized models into coherent production systems.
The products that win in 2026 will not be the ones that call the most expensive API for every query. They will be the ones whose routing layers make intelligent decisions about which specialist to invoke for each sub-task β and whose cost, latency, and quality curves are all optimized simultaneously because the right work goes to the right model.
βIf the absolute smartest AI systems in the world are no longer monolithic geniuses, but rather committees of highly specialized workers managed by an orchestrator β does that mean the future of human work isn't about trying to be the smartest specialist in the room, but rather learning how to become the ultimate routing layer for the AI tools in your own life?β
β closing exchange, Why Modern AI Is a Committee (two-host conversation, April 2026)