Most RAG tutorials show one three-box diagram and stop. This page is the walkthrough I wish existed when I was learning β the full production pipeline with chunking, embeddings, hybrid search, and re-ranking β one end-to-end example query, 15 numbered steps, and a clear answer to who decides what at every hop. It's the page to read before the vertical case studies.
Ask a room of engineers βhow does RAG workβ and you'll get three boxes on a whiteboard: a vector database, an LLM, and an arrow between them. That diagram is not wrong β it's just so compressed that it hides every decision that matters.
The question I keep getting asked β by recruiters, by candidates, by anyone building their first RAG system β is deceptively simple:
When you use a vector database for a custom knowledge base, who actually decides what to retrieve? Is it the LLM? Is it the vector database? Something else?
The honest answer is: neither of them, in the sense people usually mean. Retrieval is a pipeline of mechanical operations, and different components make different decisions at different points. The vector database never decides anything cognitive β it runs math. The LLM may or may not decide, depending on which architecture you picked. The real deciders are components most tutorials barely mention: the embedding model, the metadata filter, and the reranker.
The rest of this page is a rigorous walkthrough of exactly who decides what, using one canonical query traced through a real production architecture.
Before we walk the flow, meet every component that participates. Pay attention to the βdecidesβ column β this is the layer the conventional diagrams compress away.
This is the βoh, that's how it worksβ diagram. Trace the numbered arrows from 1 to 15 in sequence and you'll see the whole request path without reading a single word of prose. The components are grouped into five bands by architectural responsibility β user, application, identity, retrieval, generation β and the retrieval layer has all four pieces (embedding, vector DB, keyword index, reranker) visually clustered to reinforce that they are one subsystem, not four unrelated services.
Where the architectural diagram above encodes structure, this sequence diagram encodes time. Each vertical lane is one actor; time flows top to bottom. The numbered arrows are the same 15 steps β but now you can see exactly who calls whom in what order, which calls happen in parallel, and where the self-loops on the App Server lane represent internal work (fusion, prompt assembly, post-processing).
Now the detailed walk. Each step lists the actor, what goes in, what comes out, who decides, and what happens when it fails. The βwhy this mattersβ note appears on the steps where the common mental model diverges from what actually happens.
Everything above is naive RAG β a pre-programmed pipeline where the LLM is the last step and plays no role in deciding what gets retrieved. In agentic RAG, the LLM is placed at the top of the call stack and given the retrieval pipeline as a tool. It decides when to call it, what query to pass, and β crucially β whether it needs to call it again after seeing the first results.
The important insight: agentic RAG isn't a different retrieval system. It's the same retrieval system (same embedding model, same vector DB, same BM25, same reranker β steps 5-10 are literally unchanged) with an LLM placed at the top of the call stack. The vector database doesn't know the caller is now an LLM instead of app code, and doesn't care.
If you were only allowed to remember four things from this page, remember these. These are the components that actually determine whether your retrieval is precise, and they're the ones most tutorials treat as footnotes.
Defines what βsimilarβ means in your whole system. Chosen once at design time, baked into every chunk you've indexed. Switching it means re-embedding your entire corpus. Choose wrong and no amount of reranking can save you.
The filter applied to the vector search before ANN runs. Owns permission scoping and freshness filtering. In enterprise RAG this is where access control lives β filter failure equals data leak. The vector DB is dumb; the filter is the brain.
The lexical counterweight to semantic search. Catches exact-match cases (acronyms, error codes, proper nouns, SKU numbers) where embeddings famously fuzz. Running BM25 in parallel with vector search is table-stakes in production; vector-only is a beginner mistake on enterprise corpora.
A second, slower, more precise ML model that cross-encodes each (query, chunk) pair. The reason you can afford a slow model here is that you're only running it on the top-50 candidates the first pass surfaced. Most of your production top-5 precision comes from this step β not the vector DB.
The single most common point of confusion in RAG: 'does my embedding model need to match my LLM?' Short answer: no. Here's why.
A confusion I hear constantly: people assume that because the embedding model converts text to vectors, and the LLM also processes text, the two must somehow βspeak the same languageβ β like you'd need OpenAI embeddings to use OpenAI LLMs, or Cohere embeddings to use a Cohere reranker. This is false. Embedding models and LLMs live in entirely separate parts of the pipeline and never see each other's outputs.
The embedding model used at INGEST time and the embedding model used at QUERY time must be the same model and same version. The LLM that receives the retrieved chunks is independent of both β pair any LLM with any embedding model.
The piece most people miss: your vector DB stores BOTH the vector AND the original text of each chunk. The vector is the index used to find the chunk. The original text is the payload sent to the LLM at generation time. The LLM never sees vectors at all β it only ever reads plain text.
The vector and the text are both stored on the same row. The vector finds the chunk (via cosine similarity); the text is what gets handed to the LLM. Some teams put the text in a metadata field on the vector record; others store it in a separate document table and look it up by chunk_id. Either pattern works β what matters is that the original text is preserved and is what the LLM ultimately reads.
Different embedding models produce vectors in incompatible geometric spaces. text-embedding-3-large vectors and BGE-M3 vectors aren't comparable β they live in different βuniverses.β Switching the embedding model means re-embedding the entire corpus. This is the only matching rule in the whole pipeline.
The vector is the search index β it's used to find the chunk. The original text is the payload β it's what you pass to the LLM. Most teams put the text in a metadata field on the vector record, or look it up by chunk_id from a separate document store. The text is never lost.
The LLM receives only TEXT β the user's question and the original text of the retrieved chunks. It has no awareness of vectors, no awareness of which embedding model was used, no ability to query the vector DB itself. Pair any LLM with any embedding model.
Every one of these is a valid production pairing. What works for your system depends on cost, latency, and quality β not on whether the embedding model and LLM share a vendor:
Mixing embedding models within the same index. If you embedded half your corpus with voyage-3-large and the other half with text-embedding-3-large, the cosine similarities between query and chunk are meaningless β the two halves live in incompatible geometric spaces. You'd retrieve effectively random chunks from one half. The fix: keep one embedding model per index. To migrate models, build a parallel index, A/B test, then switch the read path atomically.
Because the embedding model decides what similar means in your search space β and that decision is made before the LLM is in the picture. If the embedding model puts βparental leaveβ far from βmaternity benefits,β the LLM never sees the right chunk to begin with β no amount of LLM quality can recover what retrieval missed. That's why the next section goes deep on embedding model selection: the LLM is downstream of this decision and depends on it being right.
The single most consequential design decision in any RAG system, and the one most architecture diagrams render as one unlabeled box.
The four hidden deciders above told you the embedding model defines what βsimilarβ means in your whole system. That's the headline. The follow-up question β the one that actually shapes your architecture β is which embedding model do you pick for which workload? Picking wrong is one of the most expensive mistakes in production RAG because the choice is baked into every chunk you've ingested. Switching means re-embedding the entire corpus.
Below are three workloads I see often, each with a different recommended embedding model and a different reason. The point isn't the specific model name (those change every quarter β see the section below) but the decision pattern: read your domain shape, then choose accordingly.
Long-form policy text in formal English. Sentences like βEligible employees may take up to 12 weeks of unpaid leave under FMLA.β Retrieval is forgiving β paraphrases (βmaternity benefitsβ β βparental leaveβ) are common, and exact-match isn't critical beyond legal terminology. Latency budget: 1-2 seconds.
General-purpose web-trained model at 1024-1536 dimensions β text-embedding-3-large (OpenAI) or voyage-3-large (Voyage AI). Pair with a cross-encoder reranker (Cohere Rerank v3 or BGE Reranker v2-gemma) on the top-50.
A code-tuned model (CodeBERT, voyage-code-3). Code embeddings prioritize structural similarity (token shape, syntax) over semantic meaning of natural language β they collapse the distinction between βleaveβ and βleavingβ because they care about tokens. HR queries demand semantic generalization, not structural alignment.
Short, dense, technical artifacts. CVE IDs, IOCs (indicators of compromise β file hashes, IP addresses, domain names), MITRE ATT&CK technique IDs (T1059, T1547), malware family names (Emotet, Qakbot), TTPs. Retrieval is precision-critical: βCVE-2024-1234β must NOT match βCVE-2024-1235.β Latency budget: 100-500ms during alert triage.
A model that produces both dense and sparse representations β BGE-M3 is the strongest open-source choice because it emits dense + sparse + ColBERT-style multi-vector outputs in a single pass. Mature SOCs fine-tune BAAI/bge-large-en-v1.5 on their own (alert, related-alert) pairs from the SIEM.
text-embedding-3-small alone. It's cheap, fast, general-purpose β and useless for SecOps. It will treat every CVE ID as semantically near every other CVE ID. You'll triage the wrong incident, sometimes catastrophically. This is one of the rare workloads where βjust use OpenAIβ is wrong.
Structured-ish, short, latency-critical. βUser u-8821 viewed product p-1247, last purchase 2026-04-18, segment: high-LTV, cart abandoned 2 hours ago.β The retrieval question is βgiven this user's recent context, what's the most relevant offer or product to surface?β Latency budget: sub-100ms p99 because this runs on the page-render path. Volume: tens of millions of queries per day.
A SMALL, FAST embedding β text-embedding-3-small (1536-dim, ~30ms p50) or self-hosted BGE-small-en-v1.5 (384-dim, sub-10ms p99 on a single GPU). Aggressive prompt and embedding caching is the second half of the answer.
text-embedding-3-large at 3072 dimensions. Latency is too high (50-100ms per call) and the precision gain is invisible because the real signal lives in the metadata filter, not in the cosine similarity. You pay 5Γ the cost and 3Γ the latency for a quality difference no user perceives.
Embedding models improve roughly every quarter. text-embedding-ada-002 (Dec 2022) was state-of-the-art for 18 months. text-embedding-3-large (Jan 2024) jumped MTEB by 4 points. voyage-3 (mid-2024) added another 2-3. BGE-M3 (early 2024) brought multilingual + multi-function to open-source. Each generation outperforms the prior on specific axes β long-document retrieval, multilingual, code, finance, biomedical.
That tempo has a direct architectural consequence: you cannot pick the βbestβ embedding model once and ship. You need an architecture that lets you re-evaluate every six months and migrate when a meaningfully better model appears. Concretely, three patterns:
Every chunk in your vector DB carries an embedding_model_version field. Adopting a new model means re-embedding the corpus into a parallel index, running an offline A/B (recall@10, MRR, NDCG on a held-out eval set), and only flipping the read path when the new index demonstrably wins on YOUR domain.
The retrieval layer routes different query types to different embedding models. Code queries β code-tuned embeddings. Domain-specific β domain-tuned. General β general. Each route owns its own index. The router is a lightweight intent classifier (small LLM call or rule-based) that runs before vector search.
Never adopt a new embedding model based on public benchmark scores alone. Build a domain-specific eval set (50-500 query-passage pairs labeled by SMEs from your corpus) and a test harness that measures recall@10, MRR, NDCG. MTEB wins don't always translate to your domain β your eval is the only ground truth.
Most teams already route queries across multiple LLMs (frontier for hard, mid-tier for medium, small for greetings or refusals). The natural follow-up: should we also route across multiple embedding indexes? Three architectures, in order of operational complexity:
One embedding model. One vector DB. One BM25 index. The router only changes which LLM consumes the retrieved chunks. Most production teams start here and stay here for 1-2 years. Simple to operate, sufficient for most domains.
One general-purpose index for the broad case, plus specialized indexes for domains where general embeddings demonstrably fail (security IOCs, code, biomedical). A domain classifier routes the query to the right index. Most enterprises with 3-5 domains land here.
Each domain has its own embedding model and index. Code β code-embed. Security β BGE-M3 with sparse. HR β voyage-3-large. The router decides both LLM and embedding index. Highest precision, highest ops cost. Worth it for 5+ distinct domains with quality bars.
Start with Tier 1. Move to Tier 2 ONLY when you have measured evidence (offline eval set + production logs) that one or more domains are systematically failing the general embedding. Move to Tier 3 ONLY when you have 5+ distinct domains, each with its own eval harness and SMEs. Most teams who jump straight to Tier 3 spend months on operational pain for precision gains they could have gotten with a better reranker.
Most RAG conversations focus on the embedding model. But the LLM that actually answers the question shapes RAG outcomes just as profoundly. The chunks you retrieve are only half the story β the other half is whether the LLM uses them faithfully.
Frontier models (Claude Sonnet 4.5, GPT-5) cite the chunks. Smaller models (Haiku, GPT-4o-mini) are more likely to ignore retrieved context and answer from training data when the chunks don't perfectly align. For high-stakes RAG (HR compliance, security), faithfulness matters more than fluency.
When you stuff 20+ chunks into a prompt, models suffer βlost in the middleβ β chunks at the start and end get more attention than the middle. Frontier models with native long context (Claude 4.x at 1M tokens, Gemini 2.5 Pro at 2M) handle this better than older models with retrofitted context expansion.
A high-faithfulness model says βI don't have enough informationβ when chunks are insufficient. A low-faithfulness model fabricates. For regulated domains (compliance, finance, security), refusal calibration is the difference between a useful product and a liability.
A typical RAG call sends 5-10K tokens of context. At Claude Opus pricing ($15/M input), 1M queries/month β $150K/month. Same volume on Haiku 4.5 ($1/M) β $10K. The cost gap forces a routing decision: use the frontier model for the queries where faithfulness matters; cheap model for everything else.
Not all LLMs are equally good at tool use. Claude and GPT-5 are excellent. Smaller open-source models often need fine-tuning to reliably emit valid tool-call JSON. If you're doing agentic RAG (LLM decides when to retrieve), the LLM choice matters more than for naive RAG.
Some models stream tokens cleanly while citing sources mid-response (Claude is particularly strong here). Others batch-emit citations at the end, which is uglier in UI. Test the citation pattern in your specific UX before committing β this is a product-quality decision, not just a model-quality one.
Embedding model quality determines whether the right chunks are retrieved. LLM quality determines whether those chunks are used faithfully. A 90%-correct embedding paired with a low-faithfulness LLM produces 60%-correct answers. A 70%-correct embedding paired with a high-faithfulness LLM produces 65% β but with honest βI don't knowβ refusals on the missing 30%, which is what compliance demands. The pairing matters more than either model in isolation.
A short catalog of common confusions about how RAG systems work. Each one is answered in one or two sentences β the kind of quick clarity that separates someone who has shipped RAG from someone who has only read about it.
In naive RAG: neither, really. The application code decides (by writing the pipeline), the embedding model decides (by defining similarity), and the vector DB runs math. The LLM is a passive consumer of whatever was handed to it.
In agentic RAG: the LLM decides when and how to call the retrieval tool β but the vector DB still just runs math under the hood.
No, they are completely decoupled. The embedding model is used to convert your chunks (at ingest) and your query (at retrieval) into vectors for similarity search. The LLM only ever receives the original TEXT of the retrieved chunks β never the vectors themselves. You can pair OpenAI embeddings with Anthropic Claude, voyage embeddings with GPT-5, BGE-M3 with Gemini β any combination works.
The ONE matching rule: the embedding model used at ingest must be the same model and same version as the one used at query time. Different embedding models produce vectors in incompatible geometric spaces, so switching means re-embedding the entire corpus. The LLM choice is independent of all of this β it just reads text.
Plain text only. The vector DB stores BOTH the vector and the original text of each chunk on the same row. The vector is the search index used to find the chunk; the original text is the payload. At generation time, the app server retrieves the chunks by vector similarity, looks up their original text, and embeds that text into the LLM's prompt as ordinary prose. The LLM has no concept of vectors at all β to it, the retrieved chunks just look like context the application decided to include.
No and yes. A vector database is a specialized data store that runs approximate nearest-neighbor math over high-dimensional vectors. It has no language understanding, no reasoning, and no ability to generate text. You need both: the vector DB holds your documents, the LLM answers the question, and a pipeline connects them.
Because the LLM can't search billions of documents in its head. The embedding model's job is to convert every document (and every query) into a vector so that cheap math β cosine similarity β can find the few relevant chunks. The LLM then reasons over those chunks. Trying to βjust feed everything to the LLMβ fails on context length, cost, and latency.
No. The LLM only sees what's in the current prompt. If you retrieved 5 chunks and put them in the prompt, the LLM knows about those 5 chunks β and nothing else. It has no persistent awareness of your corpus, and it can't βlook something upβ unless you give it a tool to do so (that's agentic RAG).
In naive RAG, retrieval happens once per user turn, right before the LLM is called. In agentic RAG, it happens zero or more times per turn β the LLM decides. It never happens per-token (that would be absurdly expensive).
Not reliably. LLMs tend to use whatever context they're given, even when it doesn't answer the question β a phenomenon called βcontext contamination.β This is precisely why the reranker step matters: if you hand the LLM a clean top-5, you avoid this problem. If you hand it a noisy top-50, it may confidently cite irrelevant material.
RAG inserts relevant knowledge into the prompt at query time. Great for facts that change (policies, pricing, docs). Cheap, updatable.
Fine-tuning updates the model's weights with your data. Great for style, tone, format adherence. Bad for facts that change. Expensive.
Long context shoves everything into the prompt without retrieval. Works for small corpora, fails on latency and cost above ~100K tokens, and still suffers from βlost in the middleβ attention problems.
Yes β vector search and BM25 run in parallel, then their results are fused with a technique called Reciprocal Rank Fusion (RRF). Both searches typically hit the same data store (e.g., Postgres with pgvector + tsvector), so the overhead is one extra query, not two separate trips.
It depends on how you designed the pipeline. A robust production system degrades to BM25-only search if the vector DB is down, and degrades further to βI'm unable to look that up right nowβ if both are down. The LLM itself never fails because of the vector DB β it just has less context to work with.
Because embeddings average semantic meaning across the whole chunk. A 100-page PDF embedded as one vector is a muddy average of 100 topics β useless. A 10-word chunk is too narrow to contain an answer. Production systems usually land on 500β1000 tokens with 10% overlap between adjacent chunks, tuned against an offline eval set.
In naive RAG, the application code decides the pipeline, the embedding model decides similarity, the vector DB runs math, and the LLM only sees what the code handed it. In agentic RAG, the LLM gets a retrieval tool and decides when to call it β but the vector DB, embedding model, and reranker still do the actual finding. Either way, the four hidden deciders (embedding model, metadata filter, BM25, reranker) matter more than the vector DB itself.
This page is the generic anatomy β the template every vertical case study is built on. The vertical pages show how the same 15-step pipeline is specialized for a specific domain with domain-specific components, permissions, evaluation, and gotchas.