Key takeaways
- RAG connects a language model to an external knowledge base, so the model can ground its answers in specific documents rather than relying only on training memory.
- The term was introduced in a 2020 paper by Lewis et al. at Facebook AI Research, though the general idea predates the name.
- RAG has two halves: a retriever (fetches relevant documents) and a generator (writes the answer using them).
- It is the simplest practical way to make an LLM answer questions about private data — product manuals, internal policies, recent news — without retraining the model.
- RAG reduces but does not eliminate hallucinations. The quality of retrieval is usually the bottleneck.
The problem RAG solves
A large language model is frozen at training time. It knows what was in its training corpus, which cuts off months or years before the model is deployed. It cannot know your company’s Q3 earnings report, the latest documentation for your product, or a private customer-support history. It also has a tendency to hallucinate — produce fluent, confident-sounding answers that are factually wrong.

RAG addresses both problems by borrowing a trick from information retrieval. Before the model writes its answer, a retrieval system searches a knowledge base for the most relevant documents. Those documents are pasted into the prompt as context. The model now has the facts it needs sitting in front of it, and can cite them directly in its answer.
How RAG works, step by step
Step 1: build the knowledge base
Collect the documents you want the model to be able to draw from — product docs, Slack archives, customer-support tickets, a wiki, whatever. Split each document into chunks (typically 200 to 1,000 tokens each). Pass each chunk through an embedding model to produce a dense vector that captures its semantic content. Store the chunk plus its vector in a vector database — see our vector databases explainer.
Step 2: retrieve at query time
When the user asks a question, embed the question with the same embedding model. Do a nearest-neighbour search in the vector database to find the top-K chunks most semantically similar to the question. K is typically between 3 and 20. Optionally apply a reranker — a second model that scores the candidates more precisely — to push the best matches to the top.
Step 3: generate with context
Construct a prompt that includes the retrieved chunks plus the user’s question and an instruction like “Answer the question using only the information in the context below. If the context does not contain the answer, say so.” The language model reads this augmented prompt and produces a grounded answer, often with inline citations pointing back to the source chunks.
Why it works
RAG exploits a division of labour between two components that are each good at different things. Retrievers are good at finding relevant information but terrible at writing coherent prose. Language models are the opposite — fluent writers, but unreliable about facts. RAG lets each do what it does best. The model does not need to memorize the knowledge base; it needs only to synthesize from the retrieved context.
This also makes RAG efficient. Updating the knowledge base means re-indexing documents, not retraining the model. A company can ship new policies on Monday and have its RAG system answer questions about them by Monday afternoon, without touching the model weights.
RAG vs. fine-tuning
Both RAG and fine-tuning adapt an LLM to your domain, but they solve different problems. Fine-tuning bakes knowledge or style into the model weights — good for teaching the model a new tone of voice, a specialized vocabulary, or a consistent output format. RAG grounds the model in retrievable facts — good for up-to-date information, domain-specific knowledge, and citability. In practice many production systems use both. See our fine tuning comparison for details.
Common challenges
Retrieval quality dominates
If the retriever surfaces the wrong chunks, the generator cannot recover. Most RAG system improvements come from making retrieval smarter: better chunk sizing, hybrid search that combines keyword matching with semantic search, metadata filtering, query rewriting (rephrase the user question before searching), and re-ranking.
Context-window stuffing
Naive RAG pipelines simply dump the top-K chunks into the prompt. This wastes tokens, introduces noise, and can push the model’s context window to its limit. Techniques like map-reduce summarization, hierarchical retrieval, and “needle in a haystack” testing help manage long contexts.
Hallucinations still happen
Even with good retrieved context, LLMs sometimes ignore the context and generate from memory. Careful prompt design (“Answer ONLY using the context below”) helps, as does showing the retrieved chunks in the UI so users can verify.
Chunking the documents
How you split documents profoundly affects retrieval quality. Too-small chunks lack context; too-large chunks dilute semantic signal. Overlapping chunks help preserve boundaries. Structural chunking (by heading, section, or table) often beats fixed-size chunking for structured documents.
When to use RAG
- Building a Q&A system over private or proprietary documents.
- Adding fresh information an LLM could not know from training.
- Providing citations so users can verify answers.
- Reducing hallucinations on factual questions.
- Handling a large knowledge base where fine-tuning would be impractical.
When not to use RAG
- Teaching the model a new output style or format (use fine-tuning).
- Simple prompts where the LLM’s training knowledge is sufficient.
- Fully structured data better served by a SQL query.
- Real-time, low-latency use cases where retrieval adds too much overhead.
The RAG ecosystem
Practical RAG stacks typically combine an embedding model (OpenAI text-embedding-3, Cohere, or open alternatives like BGE), a vector database (Pinecone, Weaviate, Qdrant, pgvector), a generator LLM (GPT, Claude, Gemini, Llama), and orchestration tooling (LangChain, LlamaIndex, or custom code). Major vendors now offer turnkey RAG services — NVIDIA’s NeMo Retriever, AWS Knowledge Bases, Azure AI Search — that bundle the pieces together. For more context on the LLMs that drive RAG, see our large language models coverage.
Frequently asked questions
Is RAG just glorified search with AI on top?
In a sense, yes — and that framing is a feature, not a bug. RAG explicitly separates the retrieval step from the generation step, so you can evaluate and improve each independently. Search decides what information to surface; the LLM decides how to phrase the answer. The value added by the LLM is the ability to synthesize across multiple retrieved passages, resolve ambiguity, and produce natural-language answers in whatever style or format you want — tasks that classical search engines cannot do.
Does RAG eliminate hallucinations?
No, but it reduces them significantly for fact-based questions. Hallucinations can still creep in if the retrieved context is ambiguous, if the LLM decides the context is incomplete and fills in from memory, or if the user asks something the knowledge base does not cover. Good RAG systems include confidence signals, explicit “I don’t know” fallbacks, and user-visible citations so humans can verify claims. Combining RAG with techniques like answer verification or multi-step agent loops further reduces error rates.
Do I need a vector database to do RAG?
Not strictly. Keyword search (BM25), full-text search engines (Elasticsearch, OpenSearch), or even grep-style retrieval can back a RAG system. Vector search adds semantic matching — finding documents about the same topic even when they use different words — which is why it has become the default for RAG. In practice, the best RAG systems use hybrid retrieval that combines keyword and vector search to get the best of both.






