Back to AI/ML Overview
Enterprise RAG β€” Anatomy

When you press Enter, who actually decides what gets retrieved?

Most tutorials show one three-box diagram and stop. This page is the walkthrough I wish existed when I was learning β€” the full production pipeline with , , , and β€” one end-to-end example query, 15 numbered steps, and a clear answer to who decides what at every hop. It's the page to read before the vertical case studies.

Architecture + Sequence15-step walkthroughNaive vs AgenticMisconceptions cleared
πŸ›€οΈRAG Learning Pathβ€”Read in order to build a production RAG system
Not building a RAG system? The Model Committee deep-dive is a parallel track covering the eight specialized model families and routing patterns β€” read it after Foundations instead of RAG Anatomy if model composition is what you're after.

🧭The question behind the question

Ask a room of engineers β€œhow does work” and you'll get three boxes on a whiteboard: a , an , and an arrow between them. That diagram is not wrong β€” it's just so compressed that it hides every decision that matters.

The question I keep getting asked β€” by recruiters, by candidates, by anyone building their first system β€” is deceptively simple:

πŸ’¬The clarifying question

When you use a for a custom knowledge base, who actually decides what to retrieve? Is it the ? Is it the ? Something else?

The honest answer is: neither of them, in the sense people usually mean. Retrieval is a pipeline of mechanical operations, and different components make different decisions at different points. The never decides anything cognitive β€” it runs math. The may or may not decide, depending on which architecture you picked. The real deciders are components most tutorials barely mention: the model, the metadata filter, and the .

The rest of this page is a rigorous walkthrough of exactly who decides what, using one canonical query traced through a real production architecture.

🎯The canonical query

Example query we will trace end-to-end
β€œWhat's our parental leave policy for California employees?”
This example is chosen deliberately. It exercises metadata filtering (CA vs other states), exact-match terminology (β€œparental leave” is a specific HR term), policy freshness (was this updated recently?), and has a clear wrong-answer-equals-compliance-problem stake. It's the kind of query where every step of the pipeline matters.

🎭The cast of actors

Before we walk the flow, meet every component that participates. Pay attention to the β€œdecides” column β€” this is the layer the conventional diagrams compress away.

πŸ‘€
User
HR employee
Chooses the question wording. That's their entire contribution β€” everything downstream is deterministic given this input.
πŸ’»
Browser / Client
Next.js UI
Sends the message and renders the streamed response. Stateless β€” it doesn't know anything about retrieval.
βš™οΈ
App Server / Orchestrator
Next.js API + FastAPI
Decides the pipeline topology. Owns every hop. Receives the query, scopes the retrieval filter, coordinates parallel calls, assembles the prompt. This is the conductor.
πŸ”
Auth Service
Okta / Auth0 / internal SSO
Decides whether the session is valid. First security boundary β€” no downstream calls run if auth fails.
πŸ‘₯
HRIS
Workday / BambooHR / Rippling
Source of employee attributes used for filter scoping (work state, department, employment type). Usually cached in the app DB for latency.
🧬
Embedding Model
text-embedding-3-small / voyage-3
Decides what counts as 'semantically similar.' This is the hidden decider most people forget exists. Chosen at design time β€” long before any user shows up.
πŸ—ƒοΈ
Vector Database
pgvector / Qdrant / Weaviate
Runs approximate nearest-neighbor math. Decides NOTHING cognitive. Given a query vector and a filter, it returns the top-k nearest candidates by cosine similarity. That's a math operation, not a decision.
πŸ”
BM25 Index
Postgres tsvector / Elasticsearch
Exact-keyword scoring. Decides based on term frequency, inverse document frequency, and document length β€” no semantic understanding at all. Runs in parallel with vector search.
βš–οΈ
Reranker
Cohere Rerank-v3 / BGE-reranker
A SECOND machine-learning model that cross-encodes each (query, chunk) pair and re-scores them. This is where retrieval actually gets precise. Most of your top-5 quality comes from this step, not the vector DB.
πŸ›‘οΈ
Guardrails
PII scrub + refusal filter
Pre-assembly (redact PII before embedding) and post-generation (scrub PII from the LLM response). Non-blocking but critical for compliance.
πŸ€–
LLM
Claude Sonnet 4.5
In naive RAG: decides only the OUTPUT TEXT given the handed-in context. In agentic RAG: additionally decides when and how to call the retrieval tool.
πŸ“œ
Audit Log
append-only Postgres table
Records who asked what and which chunks were returned. Decides nothing β€” pure observability. Must be non-blocking: a log failure must NEVER block a user response.

πŸ—οΈThe full architecture β€” take it in at a glance

This is the β€œoh, that's how it works” diagram. Trace the numbered arrows from 1 to 15 in sequence and you'll see the whole request path without reading a single word of prose. The components are grouped into five bands by architectural responsibility β€” user, application, identity, retrieval, generation β€” and the retrieval layer has all four pieces (, , keyword index, ) visually clustered to reinforce that they are one subsystem, not four unrelated services.

Enterprise RAG β€” Full Architecture (Naive RAG)
USER LAYERAPPLICATION LAYERIDENTITY & CONTEXTRETRIEVAL LAYERGENERATION LAYERπŸ‘€UserHR employeeπŸ’»BrowserNext.js chat UIβš™οΈApp Server / OrchestratorNext.js API + FastAPIπŸ›‘οΈGuardrailsPII scrub + refusal filterπŸ”Auth ServiceOkta / Auth0πŸ‘₯HRISWorkday / Rippling🧬Embedding Modeltext-embedding-3-smallπŸ—ƒοΈVector DBpgvector / QdrantπŸ”BM25 IndexPostgres tsvectorβš–οΈRerankerCohere Rerank-v3πŸ€–LLMClaude Sonnet 4.5πŸ“œAudit Logappend-only123456789101112131415LEGENDRequest pathParallel callResponse pathObservability (non-blocking)NStep number β€” follow 1β†’15 to trace the full flow
πŸ”‘Read the bands, not just the arrows
The reason this diagram uses layered bands instead of a linear left-to-right flow is to encode architectural thinking. β€œRetrieval Layer” is a concept β€” grouping the four retrieval components visually teaches that + + + are one subsystem. Candidates who understand this intuitively are the ones who can actually debug production . Candidates who think of them as four independent services usually can't.

⏱️The sequence β€” trace every step in time order

Where the architectural diagram above encodes structure, this sequence diagram encodes time. Each vertical lane is one actor; time flows top to bottom. The numbered arrows are the same 15 steps β€” but now you can see exactly who calls whom in what order, which calls happen in parallel, and where the self-loops on the App Server lane represent internal work (fusion, prompt assembly, post-processing).

Enterprise RAG β€” Sequence Diagram (time flows top to bottom)
πŸ‘€UserπŸ’»Browserβš™οΈAppπŸ”AuthπŸ‘₯HRIS🧬EmbedπŸ—ƒοΈVectorπŸ”BM25βš–οΈRerankπŸ€–LLM1types question2POST /api/chat3validate session4fetch attributes5embed query6ANN search (with filter)7BM25 keyword search8fuse with RRF9rerank top-5010return top-5 chunks11assemble prompt12generate answer13stream tokens14post-process + audit15stream to userPurple arrows = request path (1–12) β€’ Amber arrows = response path (13–15) β€’ Self-loops = internal app work
πŸ’‘Why both diagrams?
The architectural diagram is for the 10-second gut-check (β€œdo I understand the shape”). The sequence diagram is for the 2-minute trace (β€œdo I understand the order”). A portfolio page without both is missing one of the two ways senior engineers evaluate system designs.

πŸ“‹The 15 steps, in detail

Now the detailed walk. Each step lists the actor, what goes in, what comes out, who decides, and what happens when it fails. The β€œwhy this matters” note appears on the steps where the common mental model diverges from what actually happens.

1

User submits the question

User β†’ Browser
Input
Typed text: "What\u2019s our parental leave policy for California employees?"
Output
Nothing yet β€” the message lives in the client state
Who decides
The user (word choice is the only decision)
Fail mode
Empty string β€” client-side validation rejects before dispatch
2

Browser POSTs to /api/chat

Browser β†’ App server
Input
JSON body with the message + session_id + httpOnly session cookie
Output
HTTP 200 with a Server-Sent Events stream for token streaming
Who decides
App code defines the endpoint contract
Fail mode
Network error β†’ client shows retry UI; no retrieval has happened so no state to clean up
3

App server validates the session

App server β†’ Auth service
Input
Session token from the httpOnly cookie
Output
User identity: user_id, email, groups, access claims
Who decides
Auth service decides if the token is valid; app code decides this endpoint requires auth
Fail mode
401 Unauthorized β†’ client prompts re-login. No downstream calls run. This is the first security boundary.
β˜… Why this matters
Most naive architectures put auth last. In enterprise RAG, auth must be FIRST β€” every downstream step (especially retrieval) depends on knowing who the asker is.
4

App server fetches employee attributes for filter scoping

App server β†’ HRIS (or cached user table)
Input
user_id = u-8821
Output
work_state: CA, employment_type: FTE, department: Engineering, hire_date
Who decides
App code decides WHICH attributes are needed to scope the retrieval filter
Fail mode
HRIS outage β†’ fall back to 'global policies only' (safer to under-serve than leak)
β˜… Why this matters
This is the step most tutorials skip. Retrieval without pre-filtering is a permissions disaster. The reader must see that filtering is established BEFORE the vector search, not after.
5

App server embeds the query text

App server β†’ Embedding model
Input
Plain text: "What\u2019s our parental leave policy for California employees?"
Output
1,536-dim float32 vector
Who decides
The embedding model β€” chosen at design time β€” defines the semantic space. If you picked a general-purpose model, 'parental leave' and 'maternity benefits' are nearby; if you picked a finance-tuned model, they may not be.
Fail mode
Embedding API outage β†’ fall back to BM25-only search (degraded but functional)
β˜… Why this matters
Notice: the LLM has not been touched yet. The vector DB has not been touched yet. The 'similarity' decision is happening right now, at Step 5, inside a completely separate model that most people don't even think about. If you picked the wrong embedding model, steps 6-15 cannot compensate.
6

App server runs vector search WITH the pre-scoped filter

App server β†’ Vector DB
Input
query_vector + top_k: 50 + filter: { doc_type: policy, applies_to: [global, CA], effective_date: <= today }
Output
50 chunks with cosine-similarity scores, ordered high-to-low
Who decides
The vector DB runs ANN math (HNSW or IVF) β€” decides NOTHING cognitive. The filter decides which chunks are even candidates, and that filter came from Step 4.
Fail mode
DB timeout β†’ partial results or retry; app degrades to BM25 on second failure
β˜… Why this matters
This is the step people mean when they say 'retrieval,' but 80% of the correctness is coming from the filter β€” not the similarity math. The filter is the real hero; the vector search is fast first-pass arithmetic.
7

App server runs BM25 keyword search in parallel

App server β†’ BM25 index
Input
Raw query text + the same metadata filter as Step 6
Output
50 chunks scored by term frequency / inverse document frequency
Who decides
BM25 formula decides purely on word overlap and document length normalization
Fail mode
Skip if index unavailable; vector results are still usable
β˜… Why this matters
If someone asks "what\u2019s our PL policy?" (using the internal acronym PL for parental leave), the embedding model may not map "PL" to parental leave, but BM25 will exact-match any document containing "PL". This is why vector-only search is a trap for enterprise corpora full of acronyms.
8

App server fuses vector + BM25 results

App server (pure code, no external call)
Input
Two ranked lists of 50 chunks each
Output
One fused list (~70 unique chunks after dedup) scored by Reciprocal Rank Fusion
Who decides
The fusion formula β€” score(doc) = Ξ£ 1 / (k + rank_i(doc)) across both lists
Fail mode
Deterministic math, no failure mode
9

App server sends top 50 fused candidates to the reranker

App server β†’ Reranker model
Input
Query text + 50 candidate chunks as text
Output
50 chunks with precision scores, reordered by cross-encoder relevance
Who decides
The reranker is a SECOND machine-learning model β€” specifically a cross-encoder that looks at each (query, chunk) pair TOGETHER and scores them. This is fundamentally different from the embedding model, which looks at query and chunk separately.
Fail mode
Reranker timeout β†’ fall back to fused top-K without rerank (degraded precision, still usable)
β˜… Why this matters
This is where retrieval actually gets precise. The embedding model is a fast, imprecise first pass over millions of chunks. The reranker is a slow, precise second pass over the top 50. Most production RAG systems live or die on reranker quality.
10

App server selects top N reranked chunks (typically 5–8)

App server (pure code)
Input
50 reranked chunks
Output
Top 5 chunks + their metadata (source URL, doc title, section, effective date)
Who decides
App code's configured N, tuned against an offline eval set
Fail mode
N/A
11

App server assembles the prompt

App server (pure code)
Input
User question + top 5 chunks + versioned system prompt template
Output
Final messages array ready for the LLM API
Who decides
The prompt template β€” chosen at design time, typically versioned and evaluated in an offline harness
Fail mode
Context length overflow β†’ truncate oldest chunks or summarize older turns
β˜… Why this matters
The reader must see that the LLM is about to receive a PRE-BUILT prompt. It didn't see the chunks before this moment and it can't ask for different ones.
12

App server calls the LLM

App server β†’ LLM API
Input
messages array: [system prompt, user message with embedded context]
Output
Streaming text tokens
Who decides
The LLM decides only the OUTPUT TEXT, conditioned on the prompt. It has no ability to say 'this context is wrong, give me different chunks.'
Fail mode
LLM API error β†’ generic 'please try again' + log for investigation
β˜… Why this matters
This is the step where most people think the LLM is "doing retrieval." It isn\u2019t. The LLM received a prompt that included 5 chunks of text. It has no idea those chunks came from a vector database β€” they look like prose in the prompt. It has no handle to query more. All the "deciding what to retrieve" happened in Steps 4-10, none of it was the LLM.
13

App server post-processes the LLM output through guardrails

App server (pure code + optional guardrail model)
Input
Streamed LLM tokens
Output
Cleaned response with inline citations linked back to source chunks
Who decides
Post-processing rules: PII scrub, refusal detection, citation injection
Fail mode
PII leak detected β†’ substitute safe fallback; alert on-call
14

App server writes the audit log

App server β†’ Audit log
Input
user_id, query, retrieved_chunk_ids, llm_response, timestamp, latency_ms
Output
Confirmation (or dead-letter queue if log is down)
Who decides
Logging policy β€” which fields are logged, retention, PII handling
Fail mode
Log failure must NOT block the user response. Critical design invariant: audit is best-effort, never blocking.
15

App server streams the response to the browser

App server β†’ User
Input
Cleaned tokens + final citations
Output
Rendered in the chat UI with clickable source links
Who decides
UI rendering rules
Fail mode
Connection drop β†’ client retry logic replays from the last received token

🎯Agentic RAG β€” what changes when the LLM takes the driver's seat

Everything above is naive β€” a pre-programmed pipeline where the is the last step and plays no role in deciding what gets retrieved. In , the is placed at the top of the call stack and given the retrieval pipeline as a tool. It decides when to call it, what query to pass, and β€” crucially β€” whether it needs to call it again after seeing the first results.

Agentic RAG β€” the same pipeline, re-wired with the LLM at the top
πŸ€–LLM as OrchestratorDecides when & how to retrieve(tool = search_knowledge_base)5atool_use: search("CA parental")10atool_result: top-5 chunksRetrieval Pipeline (unchanged)🧬 Embed β†’ πŸ—ƒοΈ Vector DB β†’ πŸ” BM25 β†’ βš–οΈ RerankerExact same steps 5-10 as naive RAG β€” the vector DB doesn't knowthe caller is now an LLM instead of app code.10bmulti-hopre-query ifnot enoughcontextWHAT ACTUALLY CHANGED FROM NAIVE RAGβ€’ App code no longer sequences retrieval. The LLM does, via a declared tool.β€’ Multi-hop is free: LLM inspects tool_result, decides if it needs another search with a refined query.

The important insight: isn't a different retrieval system. It's the same retrieval system (same model, same , same , same β€” steps 5-10 are literally unchanged) with an placed at the top of the call stack. The doesn't know the caller is now an instead of app code, and doesn't care.

What adds
  • ✨ chooses the query phrasing β€” often better than the user's literal words
  • ✨ chooses the filter values β€” e.g., recognizes β€œCA” from context and sets applies_to: [CA]
  • ✨ Multi-hop retrieval β€” can issue a second, refined query if the first batch was insufficient
  • ✨ No retrieval when unneeded β€” can answer greetings without touching the
The trade-offs
  • ⚠️ Higher latency β€” every adds a round trip
  • ⚠️ Harder β€” the retrieval path is no longer deterministic
  • ⚠️ can over-search β€” without good system prompts, some models call the tool on every turn
  • ⚠️ Cost grows with complexity β€” multi-hop conversations pay the cost per hop

πŸ”‘The four hidden deciders most tutorials undersell

If you were only allowed to remember four things from this page, remember these. These are the components that actually determine whether your retrieval is precise, and they're the ones most tutorials treat as footnotes.

🧬

1. The model

Defines what β€œsimilar” means in your whole system. Chosen once at design time, baked into every you've indexed. Switching it means re- your entire corpus. Choose wrong and no amount of can save you.

🎚️

2. The metadata filter

The filter applied to the vector search before runs. Owns permission scoping and freshness filtering. In enterprise this is where access control lives β€” filter failure equals data leak. The is dumb; the filter is the brain.

πŸ”

3.

The lexical counterweight to . Catches exact-match cases (acronyms, error codes, proper nouns, SKU numbers) where famously fuzz. Running in parallel with vector search is table-stakes in production; vector-only is a beginner mistake on enterprise corpora.

βš–οΈ

4. The

A second, slower, more precise ML model that cross-encodes each (query, ) pair. The reason you can afford a slow model here is that you're only running it on the top-50 candidates the first pass surfaced. Most of your production top-5 precision comes from this step β€” not the .

πŸ”—The handoff β€” embedding model and LLM are decoupled

The single most common point of confusion in RAG: 'does my embedding model need to match my LLM?' Short answer: no. Here's why.

A confusion I hear constantly: people assume that because the converts text to vectors, and the also processes text, the two must somehow β€œspeak the same language” β€” like you'd need OpenAI to use OpenAI , or Cohere to use a Cohere . This is false. models and live in entirely separate parts of the pipeline and never see each other's outputs.

πŸ”‘The rule, in one sentence

The used at INGEST time and the model used at QUERY time must be the same model and same version. The that receives the retrieved is independent of both β€” pair any with any model.

What gets stored, and what gets passed to the

The piece most people miss: your stores BOTH the vector AND the original text of each . The vector is the index used to find the . The original text is the payload sent to the at generation time. The never sees vectors at all β€” it only ever reads plain text.

What lives in your (one row per )
chunk_id
vector (used to find the )
original text (sent to the )
c-7321
[0.014, -0.221, 0.078, ... ] (1024 floats)
β€œEligible employees may take up to 12 weeks of unpaid leave under FMLA...”
c-7322
[0.018, -0.215, 0.080, ... ] (1024 floats)
β€œCalifornia provides additional paid family leave benefits via SDI...”
c-7323
[-0.005, 0.142, -0.063, ... ] (1024 floats)
β€œEmployees on parental leave continue to accrue PTO at their normal rate...”

The vector and the text are both stored on the same row. The vector finds the (via cosine similarity); the text is what gets handed to the . Some teams put the text in a metadata field on the vector record; others store it in a separate document table and look it up by chunk_id. Either pattern works β€” what matters is that the original text is preserved and is what the ultimately reads.

The two-stage flow, with the matching rule highlighted

Stage 1 β€” Ingest (one-time per )
text
β€œEligible employees may take 12 weeks of unpaid leave...”
↓
voyage-3-large
model (chosen once at design time)
↓
vector
[0.014, -0.221, 0.078, ...]
↓
stored alongside the original text
↓
row
{chunk_id: c-7321, vector: [...], text: "Eligible..."}
Stage 2 β€” Query (every user turn)
User question
β€œWhat is parental leave?”
↓
voyage-3-large
⚠ SAME model as ingest β€” only matching rule
↓
Query vector β†’ β†’ top match
chunk_id: c-7321 (cosine 0.91)
↓
look up the ORIGINAL TEXT of c-7321
↓
Prompt to (plain text only)
user: What is parental leave?
context: β€œEligible employees...”
↓
Claude Sonnet 4.5
ANY β€” never sees vectors
↓
Response
β€œEligible employees may take up to 12 weeks...”

Three things to notice from the flow above

πŸ”’

1. Same model at ingest AND query

Different models produce vectors in incompatible geometric spaces. text--3-large vectors and BGE-M3 vectors aren't comparable β€” they live in different β€œuniverses.” Switching the model means re- the entire corpus. This is the only matching rule in the whole pipeline.

πŸ“¦

2. stores BOTH vector and text

The vector is the search index β€” it's used to find the . The original text is the payload β€” it's what you pass to the . Most teams put the text in a metadata field on the vector record, or look it up by chunk_id from a separate document store. The text is never lost.

πŸ€–

3. is fully independent

The receives only TEXT β€” the user's question and the original text of the retrieved . It has no awareness of vectors, no awareness of which model was used, no ability to query the itself. Pair any with any model.

Concrete combinations that work in production

Every one of these is a valid production pairing. What works for your system depends on cost, latency, and quality β€” not on whether the model and share a vendor:

  • βœ… OpenAI text--3-large + Anthropic Claude Sonnet 4.5 β€” mix vendors freely; this is the most common production pairing
  • βœ… voyage-3-large + OpenAI GPT-5 β€” specialty from one vendor, from another
  • βœ… BGE-M3 (open-source, self-hosted) + 2.5 Pro β€” open-source for cost control + cloud for quality
  • βœ… text--3-small + Llama 3.3 70B (self-hosted) β€” cheap via API + open-source for full data sovereignty
  • βœ… Cohere Embed v3 + Claude Haiku 4.5 β€” no constraint forces these to match vendor or model family
⚠️The one combination that DOES NOT work

Mixing models within the same index. If you embedded half your corpus with voyage-3-large and the other half with text--3-large, the cosine similarities between query and are meaningless β€” the two halves live in incompatible geometric spaces. You'd retrieve effectively random from one half. The fix: keep one model per index. To migrate models, build a parallel index, A/B test, then switch the read path atomically.

πŸ’‘So why does the embedding model selection matter so much?

Because the model decides what similar means in your search space β€” and that decision is made before the is in the picture. If the model puts β€œparental leave” far from β€œmaternity benefits,” the never sees the right to begin with β€” no amount of quality can recover what retrieval missed. That's why the next section goes deep on model selection: the is downstream of this decision and depends on it being right.

🧬Choosing the embedding model β€” three real use cases

The single most consequential design decision in any RAG system, and the one most architecture diagrams render as one unlabeled box.

The four hidden deciders above told you the model defines what β€œsimilar” means in your whole system. That's the headline. The follow-up question β€” the one that actually shapes your architecture β€” is which model do you pick for which workload? Picking wrong is one of the most expensive mistakes in production because the choice is baked into every you've ingested. Switching means re- the entire corpus.

Below are three workloads I see often, each with a different recommended model and a different reason. The point isn't the specific model name (those change every quarter β€” see the section below) but the decision pattern: read your domain shape, then choose accordingly.

πŸ’‘Why this section exists
Every tutorial says β€œuse text--3 from OpenAI.” That's fine for a tutorial. In production, the model that's right for HR policy retrieval is wrong for security operations, and both are wrong for real-time user-transaction decisioning. The shape of your domain β€” formal language vs technical jargon vs structured artifacts vs latency budget β€” determines the choice.
πŸ‘₯

Use case 1 β€” HR Policy Knowledge Base

Domain shape

Long-form policy text in formal English. Sentences like β€œEligible employees may take up to 12 weeks of unpaid leave under FMLA.” Retrieval is forgiving β€” paraphrases (β€œmaternity benefits” ↔ β€œparental leave”) are common, and exact-match isn't critical beyond legal terminology. Latency budget: 1-2 seconds.

Recommended

General-purpose web-trained model at 1024-1536 dimensions β€” text--3-large (OpenAI) or voyage-3-large (Voyage AI). Pair with a (Cohere Rerank v3 or BGE v2-gemma) on the top-50.

Why this choice
  • ✦ Both models are trained on web-scale general English where HR-policy language is well-represented β€” you get strong synonym and paraphrase matching out of the box.
  • ✦ 1024-1536 dimensions are sufficient for an HR corpus of 50K-500K . Going higher (3072-dim) buys imperceptible recall gain at the cost of 2Γ— index size and 2Γ— latency.
  • ✦ is non-negotiable: HR queries paraphrase heavily, so the first-pass retrieval is high-recall but noisy. The trims to the top-5 that actually answer the asker's question.
  • ✦ Compliance bonus: general English models don't need , which means no PII leaving your premises and no model-card review with legal.
What you'd be wrong to pick

A code-tuned model (CodeBERT, voyage-code-3). Code prioritize structural similarity (token shape, syntax) over semantic meaning of natural language β€” they collapse the distinction between β€œleave” and β€œleaving” because they care about tokens. HR queries demand semantic generalization, not structural alignment.

πŸ›‘οΈ

Use case 2 β€” Security Operations (SecOps)

Domain shape

Short, dense, technical artifacts. CVE IDs, IOCs (indicators of compromise β€” file hashes, IP addresses, domain names), MITRE ATT&CK technique IDs (T1059, T1547), malware family names (Emotet, Qakbot), TTPs. Retrieval is precision-critical: β€œCVE-2024-1234” must NOT match β€œCVE-2024-1235.” Latency budget: 100-500ms during alert triage.

Recommended

A model that produces both dense and sparse representations β€” BGE-M3 is the strongest open-source choice because it emits dense + sparse + ColBERT-style multi-vector outputs in a single pass. Mature SOCs fine-tune BAAI/bge-large-en-v1.5 on their own (alert, related-alert) pairs from the SIEM.

Why this choice
  • ✦ General-purpose collapse semantically distinct hashes and IPs into a single β€œalphanumeric blob” cluster β€” the model has no signal to tell CVE-2024-1234 from CVE-2024-1235 because both look like β€œlong alphanumeric tokens.” You retrieve the wrong vulnerabilities and triage collapses.
  • ✦ BGE-M3's sparse component captures exact terms (CVE numbers, T-codes, hashes) at the layer itself β€” so the vector channel scores them correctly without relying entirely on backup.
  • ✦ The hybrid-search backbone is even more important here than for HR. Run BGE-M3 + , fuse with biased toward (k=40 vs k=60 default) so exact-match wins ties.
  • ✦ on internal alert pairs is the highest-precision option for mature SOCs β€” you teach the model that β€œalert A and alert B are part of the same incident” in your specific environment.
What you'd be wrong to pick

text--3-small alone. It's cheap, fast, general-purpose β€” and useless for SecOps. It will treat every CVE ID as semantically near every other CVE ID. You'll triage the wrong incident, sometimes catastrophically. This is one of the rare workloads where β€œjust use OpenAI” is wrong.

⚑

Use case 3 β€” User Transactions (Real-Time Decisioning)

Domain shape

Structured-ish, short, latency-critical. β€œUser u-8821 viewed product p-1247, last purchase 2026-04-18, segment: high-LTV, cart abandoned 2 hours ago.” The retrieval question is β€œgiven this user's recent context, what's the most relevant offer or product to surface?” Latency budget: sub-100ms p99 because this runs on the page-render path. Volume: tens of millions of queries per day.

Recommended

A SMALL, FAST β€” text--3-small (1536-dim, ~30ms p50) or self-hosted BGE-small-en-v1.5 (384-dim, sub-10ms p99 on a single GPU). Aggressive prompt and caching is the second half of the answer.

Why this choice
  • ✦ Real-time decisioning has a hard sub-100ms budget, and the call is only one of 4-6 hops in the request path. Saving 50ms on latency lets you afford the that actually drives precision.
  • ✦ The semantic space here doesn't need to be sophisticated β€” the dimensions that drive the decision (recency, segment, behavior) live in metadata, not in the vector. The vector is for β€œfind similar offer text” or β€œfind similar user-profile narratives.”
  • ✦ Self-hosting BGE-small on a single GPU gives you sub-10ms p99 latency at the cost of operational complexity β€” worth it at high QPS where the OpenAI bill alone pays for the GPU within weeks.
  • ✦ Cache aggressively: the same user vector doesn't change between page views β€” embed once, cache for the session. Product descriptions don't change at all β€” embed at ingest, never re-embed.
What you'd be wrong to pick

text--3-large at 3072 dimensions. Latency is too high (50-100ms per call) and the precision gain is invisible because the real signal lives in the metadata filter, not in the cosine similarity. You pay 5Γ— the cost and 3Γ— the latency for a quality difference no user perceives.

πŸ“ˆ

Models improve every month β€” your architecture must absorb that

models improve roughly every quarter. text--ada-002 (Dec 2022) was state-of-the-art for 18 months. text--3-large (Jan 2024) jumped MTEB by 4 points. voyage-3 (mid-2024) added another 2-3. BGE-M3 (early 2024) brought multilingual + multi-function to open-source. Each generation outperforms the prior on specific axes β€” long-document retrieval, multilingual, code, finance, biomedical.

That tempo has a direct architectural consequence: you cannot pick the β€œbest” model once and ship. You need an architecture that lets you re-evaluate every six months and migrate when a meaningfully better model appears. Concretely, three patterns:

🏷️

1. Versioned

Every in your carries an embedding_model_version field. Adopting a new model means re- the corpus into a parallel index, running an offline A/B (recall@10, MRR, NDCG on a held-out set), and only flipping the read path when the new index demonstrably wins on YOUR domain.

🚦

2. Routing-aware retrieval

The retrieval layer routes different query types to different models. Code queries β†’ code-tuned . Domain-specific β†’ domain-tuned. General β†’ general. Each route owns its own index. The router is a lightweight intent classifier (small call or rule-based) that runs before .

πŸ§ͺ

3. Offline is the gate

Never adopt a new model based on public benchmark scores alone. Build a domain-specific set (50-500 query-passage pairs labeled by SMEs from your corpus) and a test harness that measures recall@10, MRR, NDCG. MTEB wins don't always translate to your domain β€” your is the only ground truth.

πŸ”€

β€” does the router need a matching router?

Most teams already route queries across multiple (frontier for hard, mid-tier for medium, small for greetings or refusals). The natural follow-up: should we also route across multiple indexes? Three architectures, in order of operational complexity:

Tier 1 β€” Single index, multi-

One model. One . One index. The router only changes which consumes the retrieved . Most production teams start here and stay here for 1-2 years. Simple to operate, sufficient for most domains.

Tier 2 β€” Hybrid (one general + specialized)

One general-purpose index for the broad case, plus specialized indexes for domains where general demonstrably fail (security IOCs, code, biomedical). A domain classifier routes the query to the right index. Most enterprises with 3-5 domains land here.

Tier 3 β€” Multi-index, multi-model

Each domain has its own model and index. Code β†’ code-embed. Security β†’ BGE-M3 with sparse. HR β†’ voyage-3-large. The router decides both and index. Highest precision, highest ops cost. Worth it for 5+ distinct domains with quality bars.

πŸ”‘The decision rule

Start with Tier 1. Move to Tier 2 ONLY when you have measured evidence (offline set + production logs) that one or more domains are systematically failing the general . Move to Tier 3 ONLY when you have 5+ distinct domains, each with its own harness and SMEs. Most teams who jump straight to Tier 3 spend months on operational pain for precision gains they could have gotten with a better .

πŸ€–

selection β€” the second half of quality

Most conversations focus on the model. But the that actually answers the question shapes outcomes just as profoundly. The you retrieve are only half the story β€” the other half is whether the uses them faithfully.

Faithfulness

(Claude Sonnet 4.5, GPT-5) cite the . Smaller models (Haiku, GPT-4o-mini) are more likely to ignore retrieved context and answer from training data when the don't perfectly align. For high-stakes (HR compliance, security), faithfulness matters more than fluency.

Long-context attention

When you stuff 20+ into a prompt, models suffer β€œlost in the middle” β€” at the start and end get more attention than the middle. with native long context (Claude 4.x at 1M tokens, Pro at 2M) handle this better than older models with retrofitted context expansion.

Refusal calibration

A high-faithfulness model says β€œI don't have enough information” when are insufficient. A low-faithfulness model fabricates. For regulated domains (compliance, finance, security), refusal calibration is the difference between a useful product and a liability.

Cost-per--call

A typical call sends 5-10K tokens of context. At Claude Opus pricing ($15/M input), 1M queries/month β‰ˆ $150K/month. Same volume on Haiku 4.5 ($1/M) β‰ˆ $10K. The cost gap forces a routing decision: use the for the queries where faithfulness matters; cheap model for everything else.

Tool-calling for

Not all are equally good at . Claude and GPT-5 are excellent. Smaller open-source models often need to reliably emit valid tool-call JSON. If you're doing ( decides when to retrieve), the choice matters more than for naive .

Streaming + citation injection

Some models stream tokens cleanly while citing sources mid-response (Claude is particularly strong here). Others batch-emit citations at the end, which is uglier in UI. Test the citation pattern in your specific UX before committing β€” this is a product-quality decision, not just a model-quality one.

πŸ’¬The compound effect

model quality determines whether the right are retrieved. quality determines whether those are used faithfully. A 90%-correct paired with a low-faithfulness produces 60%-correct answers. A 70%-correct paired with a high-faithfulness produces 65% β€” but with honest β€œI don't know” refusals on the missing 30%, which is what compliance demands. The pairing matters more than either model in isolation.

❓Misconceptions cleared

A short catalog of common confusions about how systems work. Each one is answered in one or two sentences β€” the kind of quick clarity that separates someone who has shipped from someone who has only read about it.

Who decides what to retrieve β€” the LLM or the vector database?

In naive RAG: neither, really. The application code decides (by writing the pipeline), the embedding model decides (by defining similarity), and the vector DB runs math. The LLM is a passive consumer of whatever was handed to it.

In agentic RAG: the LLM decides when and how to call the retrieval tool β€” but the vector DB still just runs math under the hood.

Does the embedding model need to match the LLM? Can I use any LLM with any embedding model?

No, they are completely decoupled. The embedding model is used to convert your chunks (at ingest) and your query (at retrieval) into vectors for similarity search. The LLM only ever receives the original TEXT of the retrieved chunks β€” never the vectors themselves. You can pair OpenAI embeddings with Anthropic Claude, voyage embeddings with GPT-5, BGE-M3 with Gemini β€” any combination works.

The ONE matching rule: the embedding model used at ingest must be the same model and same version as the one used at query time. Different embedding models produce vectors in incompatible geometric spaces, so switching means re-embedding the entire corpus. The LLM choice is independent of all of this β€” it just reads text.

When the LLM gets the retrieved chunks, does it receive vectors or text?

Plain text only. The vector DB stores BOTH the vector and the original text of each chunk on the same row. The vector is the search index used to find the chunk; the original text is the payload. At generation time, the app server retrieves the chunks by vector similarity, looks up their original text, and embeds that text into the LLM's prompt as ordinary prose. The LLM has no concept of vectors at all β€” to it, the retrieved chunks just look like context the application decided to include.

Is the vector database an LLM? Do I need both?

No and yes. A vector database is a specialized data store that runs approximate nearest-neighbor math over high-dimensional vectors. It has no language understanding, no reasoning, and no ability to generate text. You need both: the vector DB holds your documents, the LLM answers the question, and a pipeline connects them.

Why do I need an embedding model if I already have an LLM?

Because the LLM can't search billions of documents in its head. The embedding model's job is to convert every document (and every query) into a vector so that cheap math β€” cosine similarity β€” can find the few relevant chunks. The LLM then reasons over those chunks. Trying to β€œjust feed everything to the LLM” fails on context length, cost, and latency.

Does the LLM know what's in my knowledge base?

No. The LLM only sees what's in the current prompt. If you retrieved 5 chunks and put them in the prompt, the LLM knows about those 5 chunks β€” and nothing else. It has no persistent awareness of your corpus, and it can't β€œlook something up” unless you give it a tool to do so (that's agentic RAG).

When does retrieval happen β€” every token, every turn, or once per conversation?

In naive RAG, retrieval happens once per user turn, right before the LLM is called. In agentic RAG, it happens zero or more times per turn β€” the LLM decides. It never happens per-token (that would be absurdly expensive).

If the retrieved chunks are irrelevant, does the LLM know to ignore them?

Not reliably. LLMs tend to use whatever context they're given, even when it doesn't answer the question β€” a phenomenon called β€œcontext contamination.” This is precisely why the reranker step matters: if you hand the LLM a clean top-5, you avoid this problem. If you hand it a noisy top-50, it may confidently cite irrelevant material.

What's the difference between RAG, fine-tuning, and long context?

RAG inserts relevant knowledge into the prompt at query time. Great for facts that change (policies, pricing, docs). Cheap, updatable.

Fine-tuning updates the model's weights with your data. Great for style, tone, format adherence. Bad for facts that change. Expensive.

Long context shoves everything into the prompt without retrieval. Works for small corpora, fails on latency and cost above ~100K tokens, and still suffers from β€œlost in the middle” attention problems.

Is hybrid search running two separate searches on every query?

Yes β€” vector search and BM25 run in parallel, then their results are fused with a technique called Reciprocal Rank Fusion (RRF). Both searches typically hit the same data store (e.g., Postgres with pgvector + tsvector), so the overhead is one extra query, not two separate trips.

If the vector database is down, can the LLM still answer?

It depends on how you designed the pipeline. A robust production system degrades to BM25-only search if the vector DB is down, and degrades further to β€œI'm unable to look that up right now” if both are down. The LLM itself never fails because of the vector DB β€” it just has less context to work with.

Why does chunk size matter?

Because embeddings average semantic meaning across the whole chunk. A 100-page PDF embedded as one vector is a muddy average of 100 topics β€” useless. A 10-word chunk is too narrow to contain an answer. Production systems usually land on 500–1000 tokens with 10% overlap between adjacent chunks, tuned against an offline eval set.

🎯The one-sentence summary

πŸ”‘If you remember nothing else

In naive , the application code decides the pipeline, the model decides similarity, the runs math, and the only sees what the code handed it. In , the gets a retrieval tool and decides when to call it β€” but the , model, and still do the actual finding. Either way, the four hidden deciders ( model, metadata filter, , ) matter more than the itself.

πŸš€Now you're ready for the vertical case studies

This page is the generic anatomy β€” the template every vertical case study is built on. The vertical pages show how the same 15-step pipeline is specialized for a specific domain with domain-specific components, permissions, , and gotchas.