Back to AI/ML Overview
Enterprise RAG β€” Anatomy

When you press Enter, who actually decides what gets retrieved?

Most RAG tutorials show one three-box diagram and stop. This page is the walkthrough I wish existed when I was learning β€” the full production pipeline, one end-to-end example query, 15 numbered steps, and a clear answer to who decides what at every hop. It's the page to read before the vertical case studies.

Architecture + Sequence15-step walkthroughNaive vs AgenticMisconceptions cleared
πŸ›€οΈRAG Learning Pathβ€”Read in order to build a production RAG system
Not building a RAG system? The Model Committee deep-dive is a parallel track covering the eight specialized model families and routing patterns β€” read it after Foundations instead of RAG Anatomy if model composition is what you're after.

🧭The question behind the question

Ask a room of engineers β€œhow does RAG work” and you'll get three boxes on a whiteboard: a vector database, an LLM, and an arrow between them. That diagram is not wrong β€” it's just so compressed that it hides every decision that matters.

The question I keep getting asked β€” by recruiters, by candidates, by anyone building their first RAG system β€” is deceptively simple:

πŸ’¬The clarifying question

When you use a vector database for a custom knowledge base, who actually decides what to retrieve? Is it the LLM? Is it the vector database? Something else?

The honest answer is: neither of them, in the sense people usually mean. Retrieval is a pipeline of mechanical operations, and different components make different decisions at different points. The vector database never decides anything cognitive β€” it runs math. The LLM may or may not decide, depending on which architecture you picked. The real deciders are components most tutorials barely mention: the embedding model, the metadata filter, and the reranker.

The rest of this page is a rigorous walkthrough of exactly who decides what, using one canonical query traced through a real production architecture.

🎯The canonical query

Example query we will trace end-to-end
β€œWhat's our parental leave policy for California employees?”
This example is chosen deliberately. It exercises metadata filtering (CA vs other states), exact-match terminology (β€œparental leave” is a specific HR term), policy freshness (was this updated recently?), and has a clear wrong-answer-equals-compliance-problem stake. It's the kind of query where every step of the pipeline matters.

🎭The cast of actors

Before we walk the flow, meet every component that participates. Pay attention to the β€œdecides” column β€” this is the layer the conventional diagrams compress away.

πŸ‘€
User
HR employee
Chooses the question wording. That's their entire contribution β€” everything downstream is deterministic given this input.
πŸ’»
Browser / Client
Next.js UI
Sends the message and renders the streamed response. Stateless β€” it doesn't know anything about retrieval.
βš™οΈ
App Server / Orchestrator
Next.js API + FastAPI
Decides the pipeline topology. Owns every hop. Receives the query, scopes the retrieval filter, coordinates parallel calls, assembles the prompt. This is the conductor.
πŸ”
Auth Service
Okta / Auth0 / internal SSO
Decides whether the session is valid. First security boundary β€” no downstream calls run if auth fails.
πŸ‘₯
HRIS
Workday / BambooHR / Rippling
Source of employee attributes used for filter scoping (work state, department, employment type). Usually cached in the app DB for latency.
🧬
Embedding Model
text-embedding-3-small / voyage-3
Decides what counts as 'semantically similar.' This is the hidden decider most people forget exists. Chosen at design time β€” long before any user shows up.
πŸ—ƒοΈ
Vector Database
pgvector / Qdrant / Weaviate
Runs approximate nearest-neighbor math. Decides NOTHING cognitive. Given a query vector and a filter, it returns the top-k nearest candidates by cosine similarity. That's a math operation, not a decision.
πŸ”
BM25 Index
Postgres tsvector / Elasticsearch
Exact-keyword scoring. Decides based on term frequency, inverse document frequency, and document length β€” no semantic understanding at all. Runs in parallel with vector search.
βš–οΈ
Reranker
Cohere Rerank-v3 / BGE-reranker
A SECOND machine-learning model that cross-encodes each (query, chunk) pair and re-scores them. This is where retrieval actually gets precise. Most of your top-5 quality comes from this step, not the vector DB.
πŸ›‘οΈ
Guardrails
PII scrub + refusal filter
Pre-assembly (redact PII before embedding) and post-generation (scrub PII from the LLM response). Non-blocking but critical for compliance.
πŸ€–
LLM
Claude Sonnet 4.5
In naive RAG: decides only the OUTPUT TEXT given the handed-in context. In agentic RAG: additionally decides when and how to call the retrieval tool.
πŸ“œ
Audit Log
append-only Postgres table
Records who asked what and which chunks were returned. Decides nothing β€” pure observability. Must be non-blocking: a log failure must NEVER block a user response.

πŸ—οΈThe full architecture β€” take it in at a glance

This is the β€œoh, that's how it works” diagram. Trace the numbered arrows from 1 to 15 in sequence and you'll see the whole request path without reading a single word of prose. The components are grouped into five bands by architectural responsibility β€” user, application, identity, retrieval, generation β€” and the retrieval layer has all four pieces (embedding, vector DB, keyword index, reranker) visually clustered to reinforce that they are one subsystem, not four unrelated services.

Enterprise RAG β€” Full Architecture (Naive RAG)
USER LAYERAPPLICATION LAYERIDENTITY & CONTEXTRETRIEVAL LAYERGENERATION LAYERπŸ‘€UserHR employeeπŸ’»BrowserNext.js chat UIβš™οΈApp Server / OrchestratorNext.js API + FastAPIπŸ›‘οΈGuardrailsPII scrub + refusal filterπŸ”Auth ServiceOkta / Auth0πŸ‘₯HRISWorkday / Rippling🧬Embedding Modeltext-embedding-3-smallπŸ—ƒοΈVector DBpgvector / QdrantπŸ”BM25 IndexPostgres tsvectorβš–οΈRerankerCohere Rerank-v3πŸ€–LLMClaude Sonnet 4.5πŸ“œAudit Logappend-only123456789101112131415LEGENDRequest pathParallel callResponse pathObservability (non-blocking)NStep number β€” follow 1β†’15 to trace the full flow
πŸ”‘Read the bands, not just the arrows
The reason this diagram uses layered bands instead of a linear left-to-right flow is to encode architectural thinking. β€œRetrieval Layer” is a concept β€” grouping the four retrieval components visually teaches that embedding + vector DB + BM25 + reranker are one subsystem. Candidates who understand this intuitively are the ones who can actually debug production RAG. Candidates who think of them as four independent services usually can't.

⏱️The sequence β€” trace every step in time order

Where the architectural diagram above encodes structure, this sequence diagram encodes time. Each vertical lane is one actor; time flows top to bottom. The numbered arrows are the same 15 steps β€” but now you can see exactly who calls whom in what order, which calls happen in parallel, and where the self-loops on the App Server lane represent internal work (fusion, prompt assembly, post-processing).

Enterprise RAG β€” Sequence Diagram (time flows top to bottom)
πŸ‘€UserπŸ’»Browserβš™οΈAppπŸ”AuthπŸ‘₯HRIS🧬EmbedπŸ—ƒοΈVectorπŸ”BM25βš–οΈRerankπŸ€–LLM1types question2POST /api/chat3validate session4fetch attributes5embed query6ANN search (with filter)7BM25 keyword search8fuse with RRF9rerank top-5010return top-5 chunks11assemble prompt12generate answer13stream tokens14post-process + audit15stream to userPurple arrows = request path (1–12) β€’ Amber arrows = response path (13–15) β€’ Self-loops = internal app work
πŸ’‘Why both diagrams?
The architectural diagram is for the 10-second gut-check (β€œdo I understand the shape”). The sequence diagram is for the 2-minute trace (β€œdo I understand the order”). A portfolio page without both is missing one of the two ways senior engineers evaluate system designs.

πŸ“‹The 15 steps, in detail

Now the detailed walk. Each step lists the actor, what goes in, what comes out, who decides, and what happens when it fails. The β€œwhy this matters” note appears on the steps where the common mental model diverges from what actually happens.

1

User submits the question

User β†’ Browser
Input
Typed text: "What\u2019s our parental leave policy for California employees?"
Output
Nothing yet β€” the message lives in the client state
Who decides
The user (word choice is the only decision)
Fail mode
Empty string β€” client-side validation rejects before dispatch
2

Browser POSTs to /api/chat

Browser β†’ App server
Input
JSON body with the message + session_id + httpOnly session cookie
Output
HTTP 200 with a Server-Sent Events stream for token streaming
Who decides
App code defines the endpoint contract
Fail mode
Network error β†’ client shows retry UI; no retrieval has happened so no state to clean up
3

App server validates the session

App server β†’ Auth service
Input
Session token from the httpOnly cookie
Output
User identity: user_id, email, groups, access claims
Who decides
Auth service decides if the token is valid; app code decides this endpoint requires auth
Fail mode
401 Unauthorized β†’ client prompts re-login. No downstream calls run. This is the first security boundary.
β˜… Why this matters
Most naive architectures put auth last. In enterprise RAG, auth must be FIRST β€” every downstream step (especially retrieval) depends on knowing who the asker is.
4

App server fetches employee attributes for filter scoping

App server β†’ HRIS (or cached user table)
Input
user_id = u-8821
Output
work_state: CA, employment_type: FTE, department: Engineering, hire_date
Who decides
App code decides WHICH attributes are needed to scope the retrieval filter
Fail mode
HRIS outage β†’ fall back to 'global policies only' (safer to under-serve than leak)
β˜… Why this matters
This is the step most tutorials skip. Retrieval without pre-filtering is a permissions disaster. The reader must see that filtering is established BEFORE the vector search, not after.
5

App server embeds the query text

App server β†’ Embedding model
Input
Plain text: "What\u2019s our parental leave policy for California employees?"
Output
1,536-dim float32 vector
Who decides
The embedding model β€” chosen at design time β€” defines the semantic space. If you picked a general-purpose model, 'parental leave' and 'maternity benefits' are nearby; if you picked a finance-tuned model, they may not be.
Fail mode
Embedding API outage β†’ fall back to BM25-only search (degraded but functional)
β˜… Why this matters
Notice: the LLM has not been touched yet. The vector DB has not been touched yet. The 'similarity' decision is happening right now, at Step 5, inside a completely separate model that most people don't even think about. If you picked the wrong embedding model, steps 6-15 cannot compensate.
6

App server runs vector search WITH the pre-scoped filter

App server β†’ Vector DB
Input
query_vector + top_k: 50 + filter: { doc_type: policy, applies_to: [global, CA], effective_date: <= today }
Output
50 chunks with cosine-similarity scores, ordered high-to-low
Who decides
The vector DB runs ANN math (HNSW or IVF) β€” decides NOTHING cognitive. The filter decides which chunks are even candidates, and that filter came from Step 4.
Fail mode
DB timeout β†’ partial results or retry; app degrades to BM25 on second failure
β˜… Why this matters
This is the step people mean when they say 'retrieval,' but 80% of the correctness is coming from the filter β€” not the similarity math. The filter is the real hero; the vector search is fast first-pass arithmetic.
7

App server runs BM25 keyword search in parallel

App server β†’ BM25 index
Input
Raw query text + the same metadata filter as Step 6
Output
50 chunks scored by term frequency / inverse document frequency
Who decides
BM25 formula decides purely on word overlap and document length normalization
Fail mode
Skip if index unavailable; vector results are still usable
β˜… Why this matters
If someone asks "what\u2019s our PL policy?" (using the internal acronym PL for parental leave), the embedding model may not map "PL" to parental leave, but BM25 will exact-match any document containing "PL". This is why vector-only search is a trap for enterprise corpora full of acronyms.
8

App server fuses vector + BM25 results

App server (pure code, no external call)
Input
Two ranked lists of 50 chunks each
Output
One fused list (~70 unique chunks after dedup) scored by Reciprocal Rank Fusion
Who decides
The fusion formula β€” score(doc) = Ξ£ 1 / (k + rank_i(doc)) across both lists
Fail mode
Deterministic math, no failure mode
9

App server sends top 50 fused candidates to the reranker

App server β†’ Reranker model
Input
Query text + 50 candidate chunks as text
Output
50 chunks with precision scores, reordered by cross-encoder relevance
Who decides
The reranker is a SECOND machine-learning model β€” specifically a cross-encoder that looks at each (query, chunk) pair TOGETHER and scores them. This is fundamentally different from the embedding model, which looks at query and chunk separately.
Fail mode
Reranker timeout β†’ fall back to fused top-K without rerank (degraded precision, still usable)
β˜… Why this matters
This is where retrieval actually gets precise. The embedding model is a fast, imprecise first pass over millions of chunks. The reranker is a slow, precise second pass over the top 50. Most production RAG systems live or die on reranker quality.
10

App server selects top N reranked chunks (typically 5–8)

App server (pure code)
Input
50 reranked chunks
Output
Top 5 chunks + their metadata (source URL, doc title, section, effective date)
Who decides
App code's configured N, tuned against an offline eval set
Fail mode
N/A
11

App server assembles the prompt

App server (pure code)
Input
User question + top 5 chunks + versioned system prompt template
Output
Final messages array ready for the LLM API
Who decides
The prompt template β€” chosen at design time, typically versioned and evaluated in an offline harness
Fail mode
Context length overflow β†’ truncate oldest chunks or summarize older turns
β˜… Why this matters
The reader must see that the LLM is about to receive a PRE-BUILT prompt. It didn't see the chunks before this moment and it can't ask for different ones.
12

App server calls the LLM

App server β†’ LLM API
Input
messages array: [system prompt, user message with embedded context]
Output
Streaming text tokens
Who decides
The LLM decides only the OUTPUT TEXT, conditioned on the prompt. It has no ability to say 'this context is wrong, give me different chunks.'
Fail mode
LLM API error β†’ generic 'please try again' + log for investigation
β˜… Why this matters
This is the step where most people think the LLM is "doing retrieval." It isn\u2019t. The LLM received a prompt that included 5 chunks of text. It has no idea those chunks came from a vector database β€” they look like prose in the prompt. It has no handle to query more. All the "deciding what to retrieve" happened in Steps 4-10, none of it was the LLM.
13

App server post-processes the LLM output through guardrails

App server (pure code + optional guardrail model)
Input
Streamed LLM tokens
Output
Cleaned response with inline citations linked back to source chunks
Who decides
Post-processing rules: PII scrub, refusal detection, citation injection
Fail mode
PII leak detected β†’ substitute safe fallback; alert on-call
14

App server writes the audit log

App server β†’ Audit log
Input
user_id, query, retrieved_chunk_ids, llm_response, timestamp, latency_ms
Output
Confirmation (or dead-letter queue if log is down)
Who decides
Logging policy β€” which fields are logged, retention, PII handling
Fail mode
Log failure must NOT block the user response. Critical design invariant: audit is best-effort, never blocking.
15

App server streams the response to the browser

App server β†’ User
Input
Cleaned tokens + final citations
Output
Rendered in the chat UI with clickable source links
Who decides
UI rendering rules
Fail mode
Connection drop β†’ client retry logic replays from the last received token

🎯Agentic RAG β€” what changes when the LLM takes the driver's seat

Everything above is naive RAG β€” a pre-programmed pipeline where the LLM is the last step and plays no role in deciding what gets retrieved. In agentic RAG, the LLM is placed at the top of the call stack and given the retrieval pipeline as a tool. It decides when to call it, what query to pass, and β€” crucially β€” whether it needs to call it again after seeing the first results.

Agentic RAG β€” the same pipeline, re-wired with the LLM at the top
πŸ€–LLM as OrchestratorDecides when & how to retrieve(tool = search_knowledge_base)5atool_use: search("CA parental")10atool_result: top-5 chunksRetrieval Pipeline (unchanged)🧬 Embed β†’ πŸ—ƒοΈ Vector DB β†’ πŸ” BM25 β†’ βš–οΈ RerankerExact same steps 5-10 as naive RAG β€” the vector DB doesn't knowthe caller is now an LLM instead of app code.10bmulti-hopre-query ifnot enoughcontextWHAT ACTUALLY CHANGED FROM NAIVE RAGβ€’ App code no longer sequences retrieval. The LLM does, via a declared tool.β€’ Multi-hop is free: LLM inspects tool_result, decides if it needs another search with a refined query.

The important insight: agentic RAG isn't a different retrieval system. It's the same retrieval system (same embedding model, same vector DB, same BM25, same reranker β€” steps 5-10 are literally unchanged) with an LLM placed at the top of the call stack. The vector database doesn't know the caller is now an LLM instead of app code, and doesn't care.

What agentic RAG adds
  • ✨ LLM chooses the query phrasing β€” often better than the user's literal words
  • ✨ LLM chooses the filter values β€” e.g., recognizes β€œCA” from context and sets applies_to: [CA]
  • ✨ Multi-hop retrieval β€” LLM can issue a second, refined query if the first batch was insufficient
  • ✨ No retrieval when unneeded β€” LLM can answer greetings without touching the vector DB
The trade-offs
  • ⚠️ Higher latency β€” every tool call adds a round trip
  • ⚠️ Harder evaluation β€” the retrieval path is no longer deterministic
  • ⚠️ LLM can over-search β€” without good system prompts, some models call the tool on every turn
  • ⚠️ Cost grows with complexity β€” multi-hop conversations pay the LLM cost per hop

πŸ”‘The four hidden deciders most tutorials undersell

If you were only allowed to remember four things from this page, remember these. These are the components that actually determine whether your retrieval is precise, and they're the ones most tutorials treat as footnotes.

🧬

1. The embedding model

Defines what β€œsimilar” means in your whole system. Chosen once at design time, baked into every chunk you've indexed. Switching it means re-embedding your entire corpus. Choose wrong and no amount of reranking can save you.

🎚️

2. The metadata filter

The filter applied to the vector search before ANN runs. Owns permission scoping and freshness filtering. In enterprise RAG this is where access control lives β€” filter failure equals data leak. The vector DB is dumb; the filter is the brain.

πŸ”

3. BM25 keyword search

The lexical counterweight to semantic search. Catches exact-match cases (acronyms, error codes, proper nouns, SKU numbers) where embeddings famously fuzz. Running BM25 in parallel with vector search is table-stakes in production; vector-only is a beginner mistake on enterprise corpora.

βš–οΈ

4. The reranker

A second, slower, more precise ML model that cross-encodes each (query, chunk) pair. The reason you can afford a slow model here is that you're only running it on the top-50 candidates the first pass surfaced. Most of your production top-5 precision comes from this step β€” not the vector DB.

❓Misconceptions cleared

A short catalog of common confusions about how RAG systems work. Each one is answered in one or two sentences β€” the kind of quick clarity that separates someone who has shipped RAG from someone who has only read about it.

Who decides what to retrieve β€” the LLM or the vector database?

In naive RAG: neither, really. The application code decides (by writing the pipeline), the embedding model decides (by defining similarity), and the vector DB runs math. The LLM is a passive consumer of whatever was handed to it.

In agentic RAG: the LLM decides when and how to call the retrieval tool β€” but the vector DB still just runs math under the hood.

Is the vector database an LLM? Do I need both?

No and yes. A vector database is a specialized data store that runs approximate nearest-neighbor math over high-dimensional vectors. It has no language understanding, no reasoning, and no ability to generate text. You need both: the vector DB holds your documents, the LLM answers the question, and a pipeline connects them.

Why do I need an embedding model if I already have an LLM?

Because the LLM can't search billions of documents in its head. The embedding model's job is to convert every document (and every query) into a vector so that cheap math β€” cosine similarity β€” can find the few relevant chunks. The LLM then reasons over those chunks. Trying to β€œjust feed everything to the LLM” fails on context length, cost, and latency.

Does the LLM know what's in my knowledge base?

No. The LLM only sees what's in the current prompt. If you retrieved 5 chunks and put them in the prompt, the LLM knows about those 5 chunks β€” and nothing else. It has no persistent awareness of your corpus, and it can't β€œlook something up” unless you give it a tool to do so (that's agentic RAG).

When does retrieval happen β€” every token, every turn, or once per conversation?

In naive RAG, retrieval happens once per user turn, right before the LLM is called. In agentic RAG, it happens zero or more times per turn β€” the LLM decides. It never happens per-token (that would be absurdly expensive).

If the retrieved chunks are irrelevant, does the LLM know to ignore them?

Not reliably. LLMs tend to use whatever context they're given, even when it doesn't answer the question β€” a phenomenon called β€œcontext contamination.” This is precisely why the reranker step matters: if you hand the LLM a clean top-5, you avoid this problem. If you hand it a noisy top-50, it may confidently cite irrelevant material.

What's the difference between RAG, fine-tuning, and long context?

RAG inserts relevant knowledge into the prompt at query time. Great for facts that change (policies, pricing, docs). Cheap, updatable.

Fine-tuning updates the model's weights with your data. Great for style, tone, format adherence. Bad for facts that change. Expensive.

Long context shoves everything into the prompt without retrieval. Works for small corpora, fails on latency and cost above ~100K tokens, and still suffers from β€œlost in the middle” attention problems.

Is hybrid search running two separate searches on every query?

Yes β€” vector search and BM25 run in parallel, then their results are fused with a technique called Reciprocal Rank Fusion (RRF). Both searches typically hit the same data store (e.g., Postgres with pgvector + tsvector), so the overhead is one extra query, not two separate trips.

If the vector database is down, can the LLM still answer?

It depends on how you designed the pipeline. A robust production system degrades to BM25-only search if the vector DB is down, and degrades further to β€œI'm unable to look that up right now” if both are down. The LLM itself never fails because of the vector DB β€” it just has less context to work with.

Why does chunk size matter?

Because embeddings average semantic meaning across the whole chunk. A 100-page PDF embedded as one vector is a muddy average of 100 topics β€” useless. A 10-word chunk is too narrow to contain an answer. Production systems usually land on 500–1000 tokens with 10% overlap between adjacent chunks, tuned against an offline eval set.

🎯The one-sentence summary

πŸ”‘If you remember nothing else

In naive RAG, the application code decides the pipeline, the embedding model decides similarity, the vector DB runs math, and the LLM only sees what the code handed it. In agentic RAG, the LLM gets a retrieval tool and decides when to call it β€” but the vector DB, embedding model, and reranker still do the actual finding. Either way, the four hidden deciders (embedding model, metadata filter, BM25, reranker) matter more than the vector DB itself.

πŸš€Now you're ready for the vertical case studies

This page is the generic anatomy β€” the template every vertical case study is built on. The vertical pages show how the same 15-step pipeline is specialized for a specific domain with domain-specific components, permissions, evaluation, and gotchas.