Most RAG tutorials show one three-box diagram and stop. This page is the walkthrough I wish existed when I was learning β the full production pipeline, one end-to-end example query, 15 numbered steps, and a clear answer to who decides what at every hop. It's the page to read before the vertical case studies.
Ask a room of engineers βhow does RAG workβ and you'll get three boxes on a whiteboard: a vector database, an LLM, and an arrow between them. That diagram is not wrong β it's just so compressed that it hides every decision that matters.
The question I keep getting asked β by recruiters, by candidates, by anyone building their first RAG system β is deceptively simple:
When you use a vector database for a custom knowledge base, who actually decides what to retrieve? Is it the LLM? Is it the vector database? Something else?
The honest answer is: neither of them, in the sense people usually mean. Retrieval is a pipeline of mechanical operations, and different components make different decisions at different points. The vector database never decides anything cognitive β it runs math. The LLM may or may not decide, depending on which architecture you picked. The real deciders are components most tutorials barely mention: the embedding model, the metadata filter, and the reranker.
The rest of this page is a rigorous walkthrough of exactly who decides what, using one canonical query traced through a real production architecture.
Before we walk the flow, meet every component that participates. Pay attention to the βdecidesβ column β this is the layer the conventional diagrams compress away.
This is the βoh, that's how it worksβ diagram. Trace the numbered arrows from 1 to 15 in sequence and you'll see the whole request path without reading a single word of prose. The components are grouped into five bands by architectural responsibility β user, application, identity, retrieval, generation β and the retrieval layer has all four pieces (embedding, vector DB, keyword index, reranker) visually clustered to reinforce that they are one subsystem, not four unrelated services.
Where the architectural diagram above encodes structure, this sequence diagram encodes time. Each vertical lane is one actor; time flows top to bottom. The numbered arrows are the same 15 steps β but now you can see exactly who calls whom in what order, which calls happen in parallel, and where the self-loops on the App Server lane represent internal work (fusion, prompt assembly, post-processing).
Now the detailed walk. Each step lists the actor, what goes in, what comes out, who decides, and what happens when it fails. The βwhy this mattersβ note appears on the steps where the common mental model diverges from what actually happens.
Everything above is naive RAG β a pre-programmed pipeline where the LLM is the last step and plays no role in deciding what gets retrieved. In agentic RAG, the LLM is placed at the top of the call stack and given the retrieval pipeline as a tool. It decides when to call it, what query to pass, and β crucially β whether it needs to call it again after seeing the first results.
The important insight: agentic RAG isn't a different retrieval system. It's the same retrieval system (same embedding model, same vector DB, same BM25, same reranker β steps 5-10 are literally unchanged) with an LLM placed at the top of the call stack. The vector database doesn't know the caller is now an LLM instead of app code, and doesn't care.
If you were only allowed to remember four things from this page, remember these. These are the components that actually determine whether your retrieval is precise, and they're the ones most tutorials treat as footnotes.
Defines what βsimilarβ means in your whole system. Chosen once at design time, baked into every chunk you've indexed. Switching it means re-embedding your entire corpus. Choose wrong and no amount of reranking can save you.
The filter applied to the vector search before ANN runs. Owns permission scoping and freshness filtering. In enterprise RAG this is where access control lives β filter failure equals data leak. The vector DB is dumb; the filter is the brain.
The lexical counterweight to semantic search. Catches exact-match cases (acronyms, error codes, proper nouns, SKU numbers) where embeddings famously fuzz. Running BM25 in parallel with vector search is table-stakes in production; vector-only is a beginner mistake on enterprise corpora.
A second, slower, more precise ML model that cross-encodes each (query, chunk) pair. The reason you can afford a slow model here is that you're only running it on the top-50 candidates the first pass surfaced. Most of your production top-5 precision comes from this step β not the vector DB.
A short catalog of common confusions about how RAG systems work. Each one is answered in one or two sentences β the kind of quick clarity that separates someone who has shipped RAG from someone who has only read about it.
In naive RAG: neither, really. The application code decides (by writing the pipeline), the embedding model decides (by defining similarity), and the vector DB runs math. The LLM is a passive consumer of whatever was handed to it.
In agentic RAG: the LLM decides when and how to call the retrieval tool β but the vector DB still just runs math under the hood.
No and yes. A vector database is a specialized data store that runs approximate nearest-neighbor math over high-dimensional vectors. It has no language understanding, no reasoning, and no ability to generate text. You need both: the vector DB holds your documents, the LLM answers the question, and a pipeline connects them.
Because the LLM can't search billions of documents in its head. The embedding model's job is to convert every document (and every query) into a vector so that cheap math β cosine similarity β can find the few relevant chunks. The LLM then reasons over those chunks. Trying to βjust feed everything to the LLMβ fails on context length, cost, and latency.
No. The LLM only sees what's in the current prompt. If you retrieved 5 chunks and put them in the prompt, the LLM knows about those 5 chunks β and nothing else. It has no persistent awareness of your corpus, and it can't βlook something upβ unless you give it a tool to do so (that's agentic RAG).
In naive RAG, retrieval happens once per user turn, right before the LLM is called. In agentic RAG, it happens zero or more times per turn β the LLM decides. It never happens per-token (that would be absurdly expensive).
Not reliably. LLMs tend to use whatever context they're given, even when it doesn't answer the question β a phenomenon called βcontext contamination.β This is precisely why the reranker step matters: if you hand the LLM a clean top-5, you avoid this problem. If you hand it a noisy top-50, it may confidently cite irrelevant material.
RAG inserts relevant knowledge into the prompt at query time. Great for facts that change (policies, pricing, docs). Cheap, updatable.
Fine-tuning updates the model's weights with your data. Great for style, tone, format adherence. Bad for facts that change. Expensive.
Long context shoves everything into the prompt without retrieval. Works for small corpora, fails on latency and cost above ~100K tokens, and still suffers from βlost in the middleβ attention problems.
Yes β vector search and BM25 run in parallel, then their results are fused with a technique called Reciprocal Rank Fusion (RRF). Both searches typically hit the same data store (e.g., Postgres with pgvector + tsvector), so the overhead is one extra query, not two separate trips.
It depends on how you designed the pipeline. A robust production system degrades to BM25-only search if the vector DB is down, and degrades further to βI'm unable to look that up right nowβ if both are down. The LLM itself never fails because of the vector DB β it just has less context to work with.
Because embeddings average semantic meaning across the whole chunk. A 100-page PDF embedded as one vector is a muddy average of 100 topics β useless. A 10-word chunk is too narrow to contain an answer. Production systems usually land on 500β1000 tokens with 10% overlap between adjacent chunks, tuned against an offline eval set.
In naive RAG, the application code decides the pipeline, the embedding model decides similarity, the vector DB runs math, and the LLM only sees what the code handed it. In agentic RAG, the LLM gets a retrieval tool and decides when to call it β but the vector DB, embedding model, and reranker still do the actual finding. Either way, the four hidden deciders (embedding model, metadata filter, BM25, reranker) matter more than the vector DB itself.
This page is the generic anatomy β the template every vertical case study is built on. The vertical pages show how the same 15-step pipeline is specialized for a specific domain with domain-specific components, permissions, evaluation, and gotchas.