CHAPTER 4.4b · CONTEXT MECHANISMS

RAG: Retrieval-Augmented Generation

How language models search external document databases to generate more precise and current answers

RAG extends the knowledge of LLMs beyond their training. Instead of storing all facts in model weights, relevant documents are retrieved at runtime and added to the context. This reduces hallucinations and enables up-to-date information.

📖 Learning Context ▼

Understand the RAG pipeline: Query → Retrieval → Generation
Recognize how embedding similarity finds relevant documents
Understand the trade-off between retrieval quality and context length

Step 6/6 Optimizations & Memory

RAG is the practical application of context mechanisms: Long context windows enable more retrieved documents, and efficient KV-caches make this economical.

ChatGPT with browsing, Claude with document upload, and every enterprise LLM application uses RAG. It's the bridge between static model knowledge and dynamic data sources.

Embedding Search: Query and documents are projected into the same vector space
Top-K Retrieval: The k most similar chunks are added to the prompt
Grounding: The model can reference retrieved facts instead of hallucinating

The Problem: Knowledge Boundary and Hallucinations

Large language models are trained on data up to a certain date. These models don't automatically have access to latest information, proprietary documents, or specialized databases.

When they need to answer questions about information outside their training dataset, they tend to hallucinate – plausible-sounding but incorrect answers.

Solution: Retrieval-Augmented Generation (RAG) combines the power of language models with external knowledge bases. The model can retrieve relevant documents and base its answer on them.

Try different questions:

Embedding Space & Retrieval

Query

Documents

Top-k

Input: User Question

The process starts with a user question. This question is transformed into the same vector representation as the documents in the database.

Input: "What is Grouped Query Attention?"

Note: At this point, the question is still just text – not a numerical vector.

Embedding: Question → Vector

An embedding model converts the question into a high-dimensional vector. This vector captures the semantic meaning of the question in a vector space.

Embedding Model: sentence-transformers/multilingual-e5-large Output Shape: (1, 1024) Query Vector: [-0.23, 0.45, ..., 0.12]

Popular Embedding Models

• Sentence-BERT (384-768 dims)
• Multilingual E5 (1024 dims)
• nomic-embed-text (768 dims)

Important

Query and documents must be embedded with the same model to be comparable.

Retrieval: Finding Similar Documents

The vector database calculates the similarity between the query vector and all stored document vectors. This is done using distance metrics like Cosine Similarity.

Cosine Similarity(query, doc) =
  (query · doc) / (||query|| × ||doc||)

Value range: -1 (opposite) to +1 (identical)
Threshold: > 0.5 usually relevant
                    

Complexity: O(n) with linear search or O(log n) with indexed structures (e.g., FAISS, Annoy).

Ranking: Top-k Documents

After retrieval, the top-k documents (usually k=3-5) are sorted by similarity. Optionally, a second ranking with a cross-encoder follows for more accurate ordering.

Stage 1 (Bi-Encoder): Fast, Top-100
Stage 2 (Cross-Encoder): Precise, Top-10
Stage 3 (Reranker): Optional, Final ranking
                    

Dense Retrieval

Semantic similarity. Captures synonyms, but misses exact matches ("Error TS-999").

Sparse Retrieval (BM25)

Keyword-based. Captures exact matches, but misses synonyms.

Augmentation: Extending the Prompt

The original prompt is augmented with the retrieved documents. The model now receives context to provide an informed answer.

System: You are an assistant... Context: [Doc 1] Grouped Query Attention (GQA) is... [Doc 2] GQA reduces the KV-Cache by... [Doc 3] Models with GQA: Llama 3... Question: What is Grouped Query Attention?

Token Overhead: 3-5 documents at 256-512 tokens each = 800-2560 additional tokens per query.

Generation: Answer with Context

The language model generates an answer based on the augmented prompt. The answer should be more precise and current since it's based on external sources.

Output: "Grouped Query Attention is a variant of Multi-Head Attention where multiple Query heads share a Key-Value head. This reduces the KV-Cache by up to 8x, while quality remains nearly intact..."

Advantages

✓ Current info
✓ Source citations
✓ Fewer hallucinations

Challenges

✗ Latency (retrieval)
✗ Faulty chunks
✗ Lost-in-the-Middle

Additions: Reranking & Hybrid Search

Modern RAG systems combine multiple techniques for better results:

Technique	Approach	Advantages	Disadvantages
Dense	Semantic vectors	Captures synonyms	Misses exact matches
Sparse (BM25)	Keyword matching	Exact matches	Misses synonyms
Hybrid	Dense + Sparse combined (RRF)	Both advantages	More complex implementation
Reranker	Cross-Encoder on Top-k	More precise ranking	Additional latency

Lost-in-the-Middle Problem

Fascinating: Language models often don't attend to the middle of contexts. Relevant information in the middle can be ignored.

Experiment: [Irrelevant Docs] + [Target Doc] + [Irrelevant Docs]

Result: Accuracy by position:
  Beginning: 85%
  Middle: 45% ⚠️
  End: 80%
                    

Solution: Place critical documents at the beginning or end. Alternatively: "Found-in-the-Middle" techniques like document reordering.

Practical RAG Pipeline

Here is a simplified example of a RAG implementation:

# 1. Embed documents (once)
documents = [
  "Grouped Query Attention (GQA) is...",
  "The KV-Cache is stored as O(n)...",
  "Llama 3 uses Multi-Head Attention..."
]
doc_embeddings = embedding_model.encode(documents)
vector_db.index(doc_embeddings)

# 2. Process query (per request)
query = "What is Grouped Query Attention?"
query_embedding = embedding_model.encode(query)

# 3. Retrieval
top_k_docs = vector_db.search(query_embedding, k=3)

# 4. Augmentation
augmented_prompt = f"""
Context:
{chr(10).join(top_k_docs)}

Question: {query}
"""

# 5. Generation
response = llm.generate(augmented_prompt)
            

Summary: When to Use RAG?

RAG is ideal for:

✓ Questions about current/dynamic knowledge
✓ Access-restricted or proprietary data
✓ Quality-critical applications (hallucinations are costly)
✓ Large external knowledge bases (Wikipedia, Confluence, etc.)

RAG is not ideal for:

✗ General reasoning (the model should think itself)
✗ Very specific domain expertise (no doc in retrieval)
✗ Real-time applications with low latency (<100ms needed)