🎯Overview 🌌Cosmic Managed AI 📚Foundations 🏛️Model Committee 🧭RAG Anatomy 🗂️Enterprise RAG 🛠️Agent Frameworks 🏗️Platform Anatomy 🔭Observability & Evals 🧪Local LLM Field Notes 💻Polyglot 🏭AI Factory ⌨️CosmicKeys 📊WatchAlgo ⚖️AI Underwriting

Enterprise RAG — Anatomy

When you press Enter, who actually decides what gets retrieved?

Most RAG tutorials show one three-box diagram and stop. This page is the walkthrough I wish existed when I was learning — the full production pipeline with chunking, embeddings, hybrid search, and re-ranking — one end-to-end example query, 15 numbered steps, and a clear answer to who decides what at every hop. It's the page to read before the vertical case studies.

Architecture + Sequence15-step walkthroughNaive vs AgenticMisconceptions cleared

🛤️RAG Learning Path—Read in order to build a production RAG system

✓

📚

Foundations

The concepts & mental models

🧭You are here

RAG Anatomy

Full production pipeline

🗂️Next →

Vertical Examples

Domain case studies

Not building a RAG system? The Model Committee deep-dive is a parallel track covering the eight specialized model families and routing patterns — read it after Foundations instead of RAG Anatomy if model composition is what you're after.

🧭The question behind the question

Ask a room of engineers “how does RAG work” and you'll get three boxes on a whiteboard: a vector database, an LLM, and an arrow between them. That diagram is not wrong — it's just so compressed that it hides every decision that matters.

The question I keep getting asked — by recruiters, by candidates, by anyone building their first RAG system — is deceptively simple:

💬The clarifying question

When you use a vector database for a custom knowledge base, who actually decides what to retrieve? Is it the LLM? Is it the vector database? Something else?

The honest answer is: neither of them, in the sense people usually mean. Retrieval is a pipeline of mechanical operations, and different components make different decisions at different points. The vector database never decides anything cognitive — it runs math. The LLM may or may not decide, depending on which architecture you picked. The real deciders are components most tutorials barely mention: the embedding model, the metadata filter, and the reranker.

The rest of this page is a rigorous walkthrough of exactly who decides what, using one canonical query traced through a real production architecture.

🎯The canonical query

Example query we will trace end-to-end

“What's our parental leave policy for California employees?”

This example is chosen deliberately. It exercises metadata filtering (CA vs other states), exact-match terminology (“parental leave” is a specific HR term), policy freshness (was this updated recently?), and has a clear wrong-answer-equals-compliance-problem stake. It's the kind of query where every step of the pipeline matters.

🎭The cast of actors

Before we walk the flow, meet every component that participates. Pay attention to the “decides” column — this is the layer the conventional diagrams compress away.

👤

User

HR employee

Chooses the question wording. That's their entire contribution — everything downstream is deterministic given this input.

💻

Browser / Client

Next.js UI

Sends the message and renders the streamed response. Stateless — it doesn't know anything about retrieval.

⚙️

App Server / Orchestrator

Next.js API + FastAPI

Decides the pipeline topology. Owns every hop. Receives the query, scopes the retrieval filter, coordinates parallel calls, assembles the prompt. This is the conductor.

🔐

Auth Service

Okta / Auth0 / internal SSO

Decides whether the session is valid. First security boundary — no downstream calls run if auth fails.

👥

HRIS

Workday / BambooHR / Rippling

Source of employee attributes used for filter scoping (work state, department, employment type). Usually cached in the app DB for latency.

🧬

Embedding Model

text-embedding-3-small / voyage-3

Decides what counts as 'semantically similar.' This is the hidden decider most people forget exists. Chosen at design time — long before any user shows up.

🗃️

Vector Database

pgvector / Qdrant / Weaviate

Runs approximate nearest-neighbor math. Decides NOTHING cognitive. Given a query vector and a filter, it returns the top-k nearest candidates by cosine similarity. That's a math operation, not a decision.

🔍

BM25 Index

Postgres tsvector / Elasticsearch

Exact-keyword scoring. Decides based on term frequency, inverse document frequency, and document length — no semantic understanding at all. Runs in parallel with vector search.

⚖️

Reranker

Cohere Rerank-v3 / BGE-reranker

A SECOND machine-learning model that cross-encodes each (query, chunk) pair and re-scores them. This is where retrieval actually gets precise. Most of your top-5 quality comes from this step, not the vector DB.

🛡️

Guardrails

PII scrub + refusal filter

Pre-assembly (redact PII before embedding) and post-generation (scrub PII from the LLM response). Non-blocking but critical for compliance.

🤖

LLM

Claude Sonnet 4.5

In naive RAG: decides only the OUTPUT TEXT given the handed-in context. In agentic RAG: additionally decides when and how to call the retrieval tool.

📜

Audit Log

append-only Postgres table

Records who asked what and which chunks were returned. Decides nothing — pure observability. Must be non-blocking: a log failure must NEVER block a user response.

🏗️The full architecture — take it in at a glance

This is the “oh, that's how it works” diagram. Trace the numbered arrows from 1 to 15 in sequence and you'll see the whole request path without reading a single word of prose. The components are grouped into five bands by architectural responsibility — user, application, identity, retrieval, generation — and the retrieval layer has all four pieces (embedding, vector DB, keyword index, reranker) visually clustered to reinforce that they are one subsystem, not four unrelated services.

Enterprise RAG — Full Architecture (Naive RAG)

🔑Read the bands, not just the arrows

The reason this diagram uses layered bands instead of a linear left-to-right flow is to encode architectural thinking. “Retrieval Layer” is a concept — grouping the four retrieval components visually teaches that embedding + vector DB + BM25 + reranker are one subsystem. Candidates who understand this intuitively are the ones who can actually debug production RAG. Candidates who think of them as four independent services usually can't.

⏱️The sequence — trace every step in time order

Where the architectural diagram above encodes structure, this sequence diagram encodes time. Each vertical lane is one actor; time flows top to bottom. The numbered arrows are the same 15 steps — but now you can see exactly who calls whom in what order, which calls happen in parallel, and where the self-loops on the App Server lane represent internal work (fusion, prompt assembly, post-processing).

Enterprise RAG — Sequence Diagram (time flows top to bottom)

💡Why both diagrams?

The architectural diagram is for the 10-second gut-check (“do I understand the shape”). The sequence diagram is for the 2-minute trace (“do I understand the order”). A portfolio page without both is missing one of the two ways senior engineers evaluate system designs.

📋The 15 steps, in detail

Now the detailed walk. Each step lists the actor, what goes in, what comes out, who decides, and what happens when it fails. The “why this matters” note appears on the steps where the common mental model diverges from what actually happens.

User submits the question

User → Browser

Input

Typed text: "What\u2019s our parental leave policy for California employees?"

Output

Nothing yet — the message lives in the client state

Who decides

The user (word choice is the only decision)

Fail mode

App server (pure code, no external call)

Input

Two ranked lists of 50 chunks each

Output

One fused list (~70 unique chunks after dedup) scored by Reciprocal Rank Fusion

Who decides

The fusion formula — score(doc) = Σ 1 / (k + rank_i(doc)) across both lists

Fail mode

Deterministic math, no failure mode

App server sends top 50 fused candidates to the reranker

App server → Reranker model

Input

Query text + 50 candidate chunks as text

Output

50 chunks with precision scores, reordered by cross-encoder relevance

Who decides

The reranker is a SECOND machine-learning model — specifically a cross-encoder that looks at each (query, chunk) pair TOGETHER and scores them. This is fundamentally different from the embedding model, which looks at query and chunk separately.

Fail mode

Reranker timeout → fall back to fused top-K without rerank (degraded precision, still usable)

★ Why this matters

App server (pure code + optional guardrail model)

Input

Streamed LLM tokens

Output

Cleaned response with inline citations linked back to source chunks

Who decides

Post-processing rules: PII scrub, refusal detection, citation injection

Fail mode

PII leak detected → substitute safe fallback; alert on-call

App server writes the audit log

App server → Audit log

Input

user_id, query, retrieved_chunk_ids, llm_response, timestamp, latency_ms

Output

Confirmation (or dead-letter queue if log is down)

Who decides

Logging policy — which fields are logged, retention, PII handling

Fail mode

Log failure must NOT block the user response. Critical design invariant: audit is best-effort, never blocking.

App server streams the response to the browser

App server → User

Input

Cleaned tokens + final citations

Output

Rendered in the chat UI with clickable source links

Who decides

UI rendering rules

Fail mode

Connection drop → client retry logic replays from the last received token

🎯Agentic RAG — what changes when the LLM takes the driver's seat

Everything above is naive RAG — a pre-programmed pipeline where the LLM is the last step and plays no role in deciding what gets retrieved. In agentic RAG, the LLM is placed at the top of the call stack and given the retrieval pipeline as a tool. It decides when to call it, what query to pass, and — crucially — whether it needs to call it again after seeing the first results.

Agentic RAG — the same pipeline, re-wired with the LLM at the top

The important insight: agentic RAG isn't a different retrieval system. It's the same retrieval system (same embedding model, same vector DB, same BM25, same reranker — steps 5-10 are literally unchanged) with an LLM placed at the top of the call stack. The vector database doesn't know the caller is now an LLM instead of app code, and doesn't care.

What agentic RAG adds

✨ LLM chooses the query phrasing — often better than the user's literal words
✨ LLM chooses the filter values — e.g., recognizes “CA” from context and sets applies_to: [CA]
✨ Multi-hop retrieval — LLM can issue a second, refined query if the first batch was insufficient
✨ No retrieval when unneeded — LLM can answer greetings without touching the vector DB

The trade-offs

⚠️ Higher latency — every tool call adds a round trip
⚠️ Harder evaluation — the retrieval path is no longer deterministic
⚠️ LLM can over-search — without good system prompts, some models call the tool on every turn
⚠️ Cost grows with complexity — multi-hop conversations pay the LLM cost per hop

🔑The four hidden deciders most tutorials undersell

If you were only allowed to remember four things from this page, remember these. These are the components that actually determine whether your retrieval is precise, and they're the ones most tutorials treat as footnotes.

🧬

1. The embedding model

Defines what “similar” means in your whole system. Chosen once at design time, baked into every chunk you've indexed. Switching it means re-embedding your entire corpus. Choose wrong and no amount of reranking can save you.

🎚️

2. The metadata filter

The filter applied to the vector search before ANN runs. Owns permission scoping and freshness filtering. In enterprise RAG this is where access control lives — filter failure equals data leak. The vector DB is dumb; the filter is the brain.

🔍

3. BM25 keyword search

The lexical counterweight to semantic search. Catches exact-match cases (acronyms, error codes, proper nouns, SKU numbers) where embeddings famously fuzz. Running BM25 in parallel with vector search is table-stakes in production; vector-only is a beginner mistake on enterprise corpora.

⚖️

4. The reranker

A second, slower, more precise ML model that cross-encodes each (query, chunk) pair. The reason you can afford a slow model here is that you're only running it on the top-50 candidates the first pass surfaced. Most of your production top-5 precision comes from this step — not the vector DB.

🔗The handoff — embedding model and LLM are decoupled

The single most common point of confusion in RAG: 'does my embedding model need to match my LLM?' Short answer: no. Here's why.

A confusion I hear constantly: people assume that because the embedding model converts text to vectors, and the LLM also processes text, the two must somehow “speak the same language” — like you'd need OpenAI embeddings to use OpenAI LLMs, or Cohere embeddings to use a Cohere reranker. This is false. Embedding models and LLMs live in entirely separate parts of the pipeline and never see each other's outputs.

🔑The rule, in one sentence

The embedding model used at INGEST time and the embedding model used at QUERY time must be the same model and same version. The LLM that receives the retrieved chunks is independent of both — pair any LLM with any embedding model.

What gets stored, and what gets passed to the LLM

The piece most people miss: your vector DB stores BOTH the vector AND the original text of each chunk. The vector is the index used to find the chunk. The original text is the payload sent to the LLM at generation time. The LLM never sees vectors at all — it only ever reads plain text.

What lives in your vector DB (one row per chunk)

chunk_id

vector (used to find the chunk)

original text (sent to the LLM)

c-7321

[0.014, -0.221, 0.078, ... ] (1024 floats)

“Eligible employees may take up to 12 weeks of unpaid leave under FMLA...”

c-7322

[0.018, -0.215, 0.080, ... ] (1024 floats)

“California provides additional paid family leave benefits via SDI...”

c-7323

[-0.005, 0.142, -0.063, ... ] (1024 floats)

“Employees on parental leave continue to accrue PTO at their normal rate...”

The vector and the text are both stored on the same row. The vector finds the chunk (via cosine similarity); the text is what gets handed to the LLM. Some teams put the text in a metadata field on the vector record; others store it in a separate document table and look it up by chunk_id. Either pattern works — what matters is that the original text is preserved and is what the LLM ultimately reads.

The two-stage flow, with the matching rule highlighted

Stage 1 — Ingest (one-time per chunk)

Chunk text

“Eligible employees may take 12 weeks of unpaid leave...”

↓

voyage-3-large

embedding model (chosen once at design time)

↓

Chunk vector

[0.014, -0.221, 0.078, ...]

↓

stored alongside the original text

↓

Vector DB row

{chunk_id: c-7321, vector: [...], text: "Eligible..."}

Stage 2 — Query (every user turn)

User question

“What is parental leave?”

↓

voyage-3-large

⚠ SAME model as ingest — only matching rule

↓

Query vector → vector DB → top match

chunk_id: c-7321 (cosine 0.91)

↓

look up the ORIGINAL TEXT of c-7321

↓

Prompt to LLM (plain text only)

user: What is parental leave?
context: “Eligible employees...”

↓

Claude Sonnet 4.5

ANY LLM — never sees vectors

↓

Response

“Eligible employees may take up to 12 weeks...”

Three things to notice from the flow above

🔒

1. Same embedding model at ingest AND query

Different embedding models produce vectors in incompatible geometric spaces. text-embedding-3-large vectors and BGE-M3 vectors aren't comparable — they live in different “universes.” Switching the embedding model means re-embedding the entire corpus. This is the only matching rule in the whole pipeline.

📦

2. Vector DB stores BOTH vector and text

The vector is the search index — it's used to find the chunk. The original text is the payload — it's what you pass to the LLM. Most teams put the text in a metadata field on the vector record, or look it up by chunk_id from a separate document store. The text is never lost.

🤖

3. LLM is fully independent

The LLM receives only TEXT — the user's question and the original text of the retrieved chunks. It has no awareness of vectors, no awareness of which embedding model was used, no ability to query the vector DB itself. Pair any LLM with any embedding model.

Concrete combinations that work in production

Every one of these is a valid production pairing. What works for your system depends on cost, latency, and quality — not on whether the embedding model and LLM share a vendor:

✅ OpenAI text-embedding-3-large + Anthropic Claude Sonnet 4.5 — mix vendors freely; this is the most common production pairing
✅ voyage-3-large + OpenAI GPT-5 — specialty embedding from one vendor, frontier LLM from another
✅ BGE-M3 (open-source, self-hosted) + Google Gemini 2.5 Pro — open-source embedding for cost control + cloud LLM for quality
✅ text-embedding-3-small + Llama 3.3 70B (self-hosted) — cheap embedding via API + open-source LLM for full data sovereignty
✅ Cohere Embed v3 + Claude Haiku 4.5 — no constraint forces these to match vendor or model family

⚠️The one combination that DOES NOT work

Mixing embedding models within the same index. If you embedded half your corpus with voyage-3-large and the other half with text-embedding-3-large, the cosine similarities between query and chunk are meaningless — the two halves live in incompatible geometric spaces. You'd retrieve effectively random chunks from one half. The fix: keep one embedding model per index. To migrate models, build a parallel index, A/B test, then switch the read path atomically.

💡So why does the embedding model selection matter so much?

Because the embedding model decides what similar means in your search space — and that decision is made before the LLM is in the picture. If the embedding model puts “parental leave” far from “maternity benefits,” the LLM never sees the right chunk to begin with — no amount of LLM quality can recover what retrieval missed. That's why the next section goes deep on embedding model selection: the LLM is downstream of this decision and depends on it being right.

🧬Choosing the embedding model — three real use cases

The single most consequential design decision in any RAG system, and the one most architecture diagrams render as one unlabeled box.

The four hidden deciders above told you the embedding model defines what “similar” means in your whole system. That's the headline. The follow-up question — the one that actually shapes your architecture — is which embedding model do you pick for which workload? Picking wrong is one of the most expensive mistakes in production RAG because the choice is baked into every chunk you've ingested. Switching means re-embedding the entire corpus.

Below are three workloads I see often, each with a different recommended embedding model and a different reason. The point isn't the specific model name (those change every quarter — see the section below) but the decision pattern: read your domain shape, then choose accordingly.

💡Why this section exists

Every RAG tutorial says “use text-embedding-3 from OpenAI.” That's fine for a tutorial. In production, the embedding model that's right for HR policy retrieval is wrong for security operations, and both are wrong for real-time user-transaction decisioning. The shape of your domain — formal language vs technical jargon vs structured artifacts vs latency budget — determines the choice.

👥

Use case 1 — HR Policy Knowledge Base

Domain shape

Long-form policy text in formal English. Sentences like “Eligible employees may take up to 12 weeks of unpaid leave under FMLA.” Retrieval is forgiving — paraphrases (“maternity benefits” ↔ “parental leave”) are common, and exact-match isn't critical beyond legal terminology. Latency budget: 1-2 seconds.

Recommended embedding

General-purpose web-trained model at 1024-1536 dimensions — text-embedding-3-large (OpenAI) or voyage-3-large (Voyage AI). Pair with a cross-encoder reranker (Cohere Rerank v3 or BGE Reranker v2-gemma) on the top-50.

Why this choice

✦ Both models are trained on web-scale general English where HR-policy language is well-represented — you get strong synonym and paraphrase matching out of the box.
✦ 1024-1536 dimensions are sufficient for an HR corpus of 50K-500K chunks. Going higher (3072-dim) buys imperceptible recall gain at the cost of 2× index size and 2× embedding latency.
✦ Reranker is non-negotiable: HR queries paraphrase heavily, so the first-pass retrieval is high-recall but noisy. The reranker trims to the top-5 that actually answer the asker's question.
✦ Compliance bonus: general English models don't need fine-tuning, which means no PII leaving your premises and no model-card review with legal.

What you'd be wrong to pick

A code-tuned model (CodeBERT, voyage-code-3). Code embeddings prioritize structural similarity (token shape, syntax) over semantic meaning of natural language — they collapse the distinction between “leave” and “leaving” because they care about tokens. HR queries demand semantic generalization, not structural alignment.

🛡️

Use case 2 — Security Operations (SecOps)

Domain shape

Short, dense, technical artifacts. CVE IDs, IOCs (indicators of compromise — file hashes, IP addresses, domain names), MITRE ATT&CK technique IDs (T1059, T1547), malware family names (Emotet, Qakbot), TTPs. Retrieval is precision-critical: “CVE-2024-1234” must NOT match “CVE-2024-1235.” Latency budget: 100-500ms during alert triage.

Recommended embedding

A model that produces both dense and sparse representations — BGE-M3 is the strongest open-source choice because it emits dense + sparse + ColBERT-style multi-vector outputs in a single pass. Mature SOCs fine-tune BAAI/bge-large-en-v1.5 on their own (alert, related-alert) pairs from the SIEM.

Why this choice

✦ General-purpose embeddings collapse semantically distinct hashes and IPs into a single “alphanumeric blob” cluster — the model has no signal to tell CVE-2024-1234 from CVE-2024-1235 because both look like “long alphanumeric tokens.” You retrieve the wrong vulnerabilities and triage collapses.
✦ BGE-M3's sparse component captures exact terms (CVE numbers, T-codes, hashes) at the embedding layer itself — so the vector channel scores them correctly without relying entirely on BM25 backup.
✦ The hybrid-search backbone is even more important here than for HR. Run BGE-M3 + BM25, fuse with RRF biased toward BM25 (k=40 vs k=60 default) so exact-match wins ties.
✦ Fine-tuning on internal alert pairs is the highest-precision option for mature SOCs — you teach the model that “alert A and alert B are part of the same incident” in your specific environment.

What you'd be wrong to pick

text-embedding-3-small alone. It's cheap, fast, general-purpose — and useless for SecOps. It will treat every CVE ID as semantically near every other CVE ID. You'll triage the wrong incident, sometimes catastrophically. This is one of the rare workloads where “just use OpenAI” is wrong.

⚡

Use case 3 — User Transactions (Real-Time Decisioning)

Domain shape

Structured-ish, short, latency-critical. “User u-8821 viewed product p-1247, last purchase 2026-04-18, segment: high-LTV, cart abandoned 2 hours ago.” The retrieval question is “given this user's recent context, what's the most relevant offer or product to surface?” Latency budget: sub-100ms p99 because this runs on the page-render path. Volume: tens of millions of queries per day.

Recommended embedding

A SMALL, FAST embedding — text-embedding-3-small (1536-dim, ~30ms p50) or self-hosted BGE-small-en-v1.5 (384-dim, sub-10ms p99 on a single GPU). Aggressive prompt and embedding caching is the second half of the answer.

Why this choice

✦ Real-time decisioning has a hard sub-100ms budget, and the embedding call is only one of 4-6 hops in the request path. Saving 50ms on embedding latency lets you afford the reranker that actually drives precision.
✦ The semantic space here doesn't need to be sophisticated — the dimensions that drive the decision (recency, segment, behavior) live in metadata, not in the vector. The vector is for “find similar offer text” or “find similar user-profile narratives.”
✦ Self-hosting BGE-small on a single GPU gives you sub-10ms p99 latency at the cost of operational complexity — worth it at high QPS where the OpenAI bill alone pays for the GPU within weeks.
✦ Cache aggressively: the same user vector doesn't change between page views — embed once, cache for the session. Product descriptions don't change at all — embed at ingest, never re-embed.

What you'd be wrong to pick

text-embedding-3-large at 3072 dimensions. Latency is too high (50-100ms per call) and the precision gain is invisible because the real signal lives in the metadata filter, not in the cosine similarity. You pay 5× the cost and 3× the latency for a quality difference no user perceives.

📈

Models improve every month — your architecture must absorb that

Embedding models improve roughly every quarter. text-embedding-ada-002 (Dec 2022) was state-of-the-art for 18 months. text-embedding-3-large (Jan 2024) jumped MTEB by 4 points. voyage-3 (mid-2024) added another 2-3. BGE-M3 (early 2024) brought multilingual + multi-function to open-source. Each generation outperforms the prior on specific axes — long-document retrieval, multilingual, code, finance, biomedical.

That tempo has a direct architectural consequence: you cannot pick the “best” embedding model once and ship. You need an architecture that lets you re-evaluate every six months and migrate when a meaningfully better model appears. Concretely, three patterns:

🏷️

1. Versioned embeddings

Every chunk in your vector DB carries an embedding_model_version field. Adopting a new model means re-embedding the corpus into a parallel index, running an offline A/B (recall@10, MRR, NDCG on a held-out eval set), and only flipping the read path when the new index demonstrably wins on YOUR domain.

🚦

2. Routing-aware retrieval

The retrieval layer routes different query types to different embedding models. Code queries → code-tuned embeddings. Domain-specific → domain-tuned. General → general. Each route owns its own index. The router is a lightweight intent classifier (small LLM call or rule-based) that runs before vector search.

🧪

3. Offline eval is the gate

Never adopt a new embedding model based on public benchmark scores alone. Build a domain-specific eval set (50-500 query-passage pairs labeled by SMEs from your corpus) and a test harness that measures recall@10, MRR, NDCG. MTEB wins don't always translate to your domain — your eval is the only ground truth.

🔀

Model routing — does the LLM router need a matching embedding router?

Most teams already route queries across multiple LLMs (frontier for hard, mid-tier for medium, small for greetings or refusals). The natural follow-up: should we also route across multiple embedding indexes? Three architectures, in order of operational complexity:

Tier 1 — Single index, multi-LLM

One embedding model. One vector DB. One BM25 index. The router only changes which LLM consumes the retrieved chunks. Most production teams start here and stay here for 1-2 years. Simple to operate, sufficient for most domains.

Tier 2 — Hybrid (one general + specialized)

One general-purpose index for the broad case, plus specialized indexes for domains where general embeddings demonstrably fail (security IOCs, code, biomedical). A domain classifier routes the query to the right index. Most enterprises with 3-5 domains land here.

Tier 3 — Multi-index, multi-model

Each domain has its own embedding model and index. Code → code-embed. Security → BGE-M3 with sparse. HR → voyage-3-large. The router decides both LLM and embedding index. Highest precision, highest ops cost. Worth it for 5+ distinct domains with quality bars.

🔑The decision rule

Start with Tier 1. Move to Tier 2 ONLY when you have measured evidence (offline eval set + production logs) that one or more domains are systematically failing the general embedding. Move to Tier 3 ONLY when you have 5+ distinct domains, each with its own eval harness and SMEs. Most teams who jump straight to Tier 3 spend months on operational pain for precision gains they could have gotten with a better reranker.

🤖

LLM selection — the second half of RAG quality

Most RAG conversations focus on the embedding model. But the LLM that actually answers the question shapes RAG outcomes just as profoundly. The chunks you retrieve are only half the story — the other half is whether the LLM uses them faithfully.

Faithfulness

Frontier models (Claude Sonnet 4.5, GPT-5) cite the chunks. Smaller models (Haiku, GPT-4o-mini) are more likely to ignore retrieved context and answer from training data when the chunks don't perfectly align. For high-stakes RAG (HR compliance, security), faithfulness matters more than fluency.

Long-context attention

When you stuff 20+ chunks into a prompt, models suffer “lost in the middle” — chunks at the start and end get more attention than the middle. Frontier models with native long context (Claude 4.x at 1M tokens, Gemini 2.5 Pro at 2M) handle this better than older models with retrofitted context expansion.

Refusal calibration

A high-faithfulness model says “I don't have enough information” when chunks are insufficient. A low-faithfulness model fabricates. For regulated domains (compliance, finance, security), refusal calibration is the difference between a useful product and a liability.

Cost-per-RAG-call

A typical RAG call sends 5-10K tokens of context. At Claude Opus pricing ($15/M input), 1M queries/month ≈ $150K/month. Same volume on Haiku 4.5 ($1/M) ≈ $10K. The cost gap forces a routing decision: use the frontier model for the queries where faithfulness matters; cheap model for everything else.

Tool-calling for agentic RAG

Not all LLMs are equally good at tool use. Claude and GPT-5 are excellent. Smaller open-source models often need fine-tuning to reliably emit valid tool-call JSON. If you're doing agentic RAG (LLM decides when to retrieve), the LLM choice matters more than for naive RAG.

Streaming + citation injection

Some models stream tokens cleanly while citing sources mid-response (Claude is particularly strong here). Others batch-emit citations at the end, which is uglier in UI. Test the citation pattern in your specific UX before committing — this is a product-quality decision, not just a model-quality one.

💬The compound effect

Embedding model quality determines whether the right chunks are retrieved. LLM quality determines whether those chunks are used faithfully. A 90%-correct embedding paired with a low-faithfulness LLM produces 60%-correct answers. A 70%-correct embedding paired with a high-faithfulness LLM produces 65% — but with honest “I don't know” refusals on the missing 30%, which is what compliance demands. The pairing matters more than either model in isolation.

❓Misconceptions cleared

A short catalog of common confusions about how RAG systems work. Each one is answered in one or two sentences — the kind of quick clarity that separates someone who has shipped RAG from someone who has only read about it.

Who decides what to retrieve — the LLM or the vector database?

In naive RAG: neither, really. The application code decides (by writing the pipeline), the embedding model decides (by defining similarity), and the vector DB runs math. The LLM is a passive consumer of whatever was handed to it.

In agentic RAG: the LLM decides when and how to call the retrieval tool — but the vector DB still just runs math under the hood.

Does the embedding model need to match the LLM? Can I use any LLM with any embedding model?

No, they are completely decoupled. The embedding model is used to convert your chunks (at ingest) and your query (at retrieval) into vectors for similarity search. The LLM only ever receives the original TEXT of the retrieved chunks — never the vectors themselves. You can pair OpenAI embeddings with Anthropic Claude, voyage embeddings with GPT-5, BGE-M3 with Gemini — any combination works.

The ONE matching rule: the embedding model used at ingest must be the same model and same version as the one used at query time. Different embedding models produce vectors in incompatible geometric spaces, so switching means re-embedding the entire corpus. The LLM choice is independent of all of this — it just reads text.

When the LLM gets the retrieved chunks, does it receive vectors or text?

Plain text only. The vector DB stores BOTH the vector and the original text of each chunk on the same row. The vector is the search index used to find the chunk; the original text is the payload. At generation time, the app server retrieves the chunks by vector similarity, looks up their original text, and embeds that text into the LLM's prompt as ordinary prose. The LLM has no concept of vectors at all — to it, the retrieved chunks just look like context the application decided to include.

Is the vector database an LLM? Do I need both?

No and yes. A vector database is a specialized data store that runs approximate nearest-neighbor math over high-dimensional vectors. It has no language understanding, no reasoning, and no ability to generate text. You need both: the vector DB holds your documents, the LLM answers the question, and a pipeline connects them.

Why do I need an embedding model if I already have an LLM?

Because the LLM can't search billions of documents in its head. The embedding model's job is to convert every document (and every query) into a vector so that cheap math — cosine similarity — can find the few relevant chunks. The LLM then reasons over those chunks. Trying to “just feed everything to the LLM” fails on context length, cost, and latency.

Does the LLM know what's in my knowledge base?

No. The LLM only sees what's in the current prompt. If you retrieved 5 chunks and put them in the prompt, the LLM knows about those 5 chunks — and nothing else. It has no persistent awareness of your corpus, and it can't “look something up” unless you give it a tool to do so (that's agentic RAG).

When does retrieval happen — every token, every turn, or once per conversation?

In naive RAG, retrieval happens once per user turn, right before the LLM is called. In agentic RAG, it happens zero or more times per turn — the LLM decides. It never happens per-token (that would be absurdly expensive).

If the retrieved chunks are irrelevant, does the LLM know to ignore them?

Not reliably. LLMs tend to use whatever context they're given, even when it doesn't answer the question — a phenomenon called “context contamination.” This is precisely why the reranker step matters: if you hand the LLM a clean top-5, you avoid this problem. If you hand it a noisy top-50, it may confidently cite irrelevant material.

What's the difference between RAG, fine-tuning, and long context?

RAG inserts relevant knowledge into the prompt at query time. Great for facts that change (policies, pricing, docs). Cheap, updatable.

Fine-tuning updates the model's weights with your data. Great for style, tone, format adherence. Bad for facts that change. Expensive.

Long context shoves everything into the prompt without retrieval. Works for small corpora, fails on latency and cost above ~100K tokens, and still suffers from “lost in the middle” attention problems.

Is hybrid search running two separate searches on every query?

Yes — vector search and BM25 run in parallel, then their results are fused with a technique called Reciprocal Rank Fusion (RRF). Both searches typically hit the same data store (e.g., Postgres with pgvector + tsvector), so the overhead is one extra query, not two separate trips.

If the vector database is down, can the LLM still answer?

It depends on how you designed the pipeline. A robust production system degrades to BM25-only search if the vector DB is down, and degrades further to “I'm unable to look that up right now” if both are down. The LLM itself never fails because of the vector DB — it just has less context to work with.

Why does chunk size matter?

Because embeddings average semantic meaning across the whole chunk. A 100-page PDF embedded as one vector is a muddy average of 100 topics — useless. A 10-word chunk is too narrow to contain an answer. Production systems usually land on 500–1000 tokens with 10% overlap between adjacent chunks, tuned against an offline eval set.

🎯The one-sentence summary

🔑If you remember nothing else

In naive RAG, the application code decides the pipeline, the embedding model decides similarity, the vector DB runs math, and the LLM only sees what the code handed it. In agentic RAG, the LLM gets a retrieval tool and decides when to call it — but the vector DB, embedding model, and reranker still do the actual finding. Either way, the four hidden deciders (embedding model, metadata filter, BM25, reranker) matter more than the vector DB itself.

🚀Now you're ready for the vertical case studies

This page is the generic anatomy — the template every vertical case study is built on. The vertical pages show how the same 15-step pipeline is specialized for a specific domain with domain-specific components, permissions, evaluation, and gotchas.

👥

HR Knowledge Base →

Access control is the hard part. Every retrieved chunk must be filtered by the asker's jurisdiction, role, and reporting chain — before the vector search runs.

🗂️

Enterprise RAG Hub →

Back to the hub to see all vertical case studies — HR, customer transactions, educational content, and more verticals added as new domains become relevant to the work.

👥

Next in the learning journey

Vertical case studies — HR Knowledge Base →

You now understand the generic pipeline. The next step is seeing how it specializes for a real industry. HR Knowledge Base is the first shipped vertical — with the filter-first retrieval pattern, jurisdictional versioning, and access control story that most naive implementations get wrong. The same 15 steps, but specialized for the compliance and permission requirements of a real HR deployment.

Related Architecture

📚

AI Foundations →

The conceptual story — how LLMs, embeddings, and agents actually work.

🏭

AI Factory →

Multi-agent orchestration: the production pattern behind the vertical pages.