How language models search external document databases to generate more precise and current answers
RAG extends the knowledge of LLMs beyond their training. Instead of storing all facts in model weights, relevant documents are retrieved at runtime and added to the context. This reduces hallucinations and enables up-to-date information.
RAG is the practical application of context mechanisms: Long context windows enable more retrieved documents, and efficient KV-caches make this economical.
ChatGPT with browsing, Claude with document upload, and every enterprise LLM application uses RAG. It's the bridge between static model knowledge and dynamic data sources.
Large language models are trained on data up to a certain date. These models don't automatically have access to latest information, proprietary documents, or specialized databases.
When they need to answer questions about information outside their training dataset, they tend to hallucinate – plausible-sounding but incorrect answers.
Solution: Retrieval-Augmented Generation (RAG) combines the power of language models with external knowledge bases. The model can retrieve relevant documents and base its answer on them.
The process starts with a user question. This question is transformed into the same vector representation as the documents in the database.
Note: At this point, the question is still just text – not a numerical vector.
An embedding model converts the question into a high-dimensional vector. This vector captures the semantic meaning of the question in a vector space.
• Sentence-BERT (384-768 dims)
• Multilingual E5 (1024 dims)
• nomic-embed-text (768 dims)
Query and documents must be embedded with the same model to be comparable.
The vector database calculates the similarity between the query vector and all stored document vectors. This is done using distance metrics like Cosine Similarity.
Complexity: O(n) with linear search or O(log n) with indexed structures (e.g., FAISS, Annoy).
After retrieval, the top-k documents (usually k=3-5) are sorted by similarity. Optionally, a second ranking with a cross-encoder follows for more accurate ordering.
Semantic similarity. Captures synonyms, but misses exact matches ("Error TS-999").
Keyword-based. Captures exact matches, but misses synonyms.
The original prompt is augmented with the retrieved documents. The model now receives context to provide an informed answer.
Token Overhead: 3-5 documents at 256-512 tokens each = 800-2560 additional tokens per query.
The language model generates an answer based on the augmented prompt. The answer should be more precise and current since it's based on external sources.
✓ Current info
✓ Source citations
✓ Fewer hallucinations
✗ Latency (retrieval)
✗ Faulty chunks
✗ Lost-in-the-Middle
Modern RAG systems combine multiple techniques for better results:
| Technique | Approach | Advantages | Disadvantages |
|---|---|---|---|
| Dense | Semantic vectors | Captures synonyms | Misses exact matches |
| Sparse (BM25) | Keyword matching | Exact matches | Misses synonyms |
| Hybrid | Dense + Sparse combined (RRF) | Both advantages | More complex implementation |
| Reranker | Cross-Encoder on Top-k | More precise ranking | Additional latency |
Fascinating: Language models often don't attend to the middle of contexts. Relevant information in the middle can be ignored.
Solution: Place critical documents at the beginning or end. Alternatively: "Found-in-the-Middle" techniques like document reordering.
Here is a simplified example of a RAG implementation:
RAG is ideal for:
RAG is not ideal for: