The Problem: Knowledge Boundary and Hallucinations

Large language models are trained on data up to a certain date. These models don't automatically have access to latest information, proprietary documents, or specialized databases.

When they need to answer questions about information outside their training dataset, they tend to hallucinate – plausible-sounding but incorrect answers.

Solution: Retrieval-Augmented Generation (RAG) combines the power of language models with external knowledge bases. The model can retrieve relevant documents and base its answer on them.

Embedding Space & Retrieval
Documents
Top-k
1

Input: User Question

The process starts with a user question. This question is transformed into the same vector representation as the documents in the database.

Input: "What is Grouped Query Attention?"

Note: At this point, the question is still just text – not a numerical vector.

2

Embedding: Question → Vector

An embedding model converts the question into a high-dimensional vector. This vector captures the semantic meaning of the question in a vector space.

Embedding Model: sentence-transformers/multilingual-e5-large Output Shape: (1, 1024) Query Vector: [-0.23, 0.45, ..., 0.12]

Popular Embedding Models

• Sentence-BERT (384-768 dims)
• Multilingual E5 (1024 dims)
• nomic-embed-text (768 dims)

Important

Query and documents must be embedded with the same model to be comparable.

3

Retrieval: Finding Similar Documents

The vector database calculates the similarity between the query vector and all stored document vectors. This is done using distance metrics like Cosine Similarity.

Cosine Similarity(query, doc) = (query · doc) / (||query|| × ||doc||) Value range: -1 (opposite) to +1 (identical) Threshold: > 0.5 usually relevant

Complexity: O(n) with linear search or O(log n) with indexed structures (e.g., FAISS, Annoy).

4

Ranking: Top-k Documents

After retrieval, the top-k documents (usually k=3-5) are sorted by similarity. Optionally, a second ranking with a cross-encoder follows for more accurate ordering.

Stage 1 (Bi-Encoder): Fast, Top-100 Stage 2 (Cross-Encoder): Precise, Top-10 Stage 3 (Reranker): Optional, Final ranking

Dense Retrieval

Semantic similarity. Captures synonyms, but misses exact matches ("Error TS-999").

Sparse Retrieval (BM25)

Keyword-based. Captures exact matches, but misses synonyms.

5

Augmentation: Extending the Prompt

The original prompt is augmented with the retrieved documents. The model now receives context to provide an informed answer.

System: You are an assistant... Context: [Doc 1] Grouped Query Attention (GQA) is... [Doc 2] GQA reduces the KV-Cache by... [Doc 3] Models with GQA: Llama 3... Question: What is Grouped Query Attention?

Token Overhead: 3-5 documents at 256-512 tokens each = 800-2560 additional tokens per query.

6

Generation: Answer with Context

The language model generates an answer based on the augmented prompt. The answer should be more precise and current since it's based on external sources.

Output: "Grouped Query Attention is a variant of Multi-Head Attention where multiple Query heads share a Key-Value head. This reduces the KV-Cache by up to 8x, while quality remains nearly intact..."

Advantages

✓ Current info
✓ Source citations
✓ Fewer hallucinations

Challenges

✗ Latency (retrieval)
✗ Faulty chunks
Lost-in-the-Middle

7

Additions: Reranking & Hybrid Search

Modern RAG systems combine multiple techniques for better results:

Technique Approach Advantages Disadvantages
Dense Semantic vectors Captures synonyms Misses exact matches
Sparse (BM25) Keyword matching Exact matches Misses synonyms
Hybrid Dense + Sparse combined (RRF) Both advantages More complex implementation
Reranker Cross-Encoder on Top-k More precise ranking Additional latency
8

Lost-in-the-Middle Problem

Fascinating: Language models often don't attend to the middle of contexts. Relevant information in the middle can be ignored.

Experiment: [Irrelevant Docs] + [Target Doc] + [Irrelevant Docs] Result: Accuracy by position: Beginning: 85% Middle: 45% ⚠️ End: 80%

Solution: Place critical documents at the beginning or end. Alternatively: "Found-in-the-Middle" techniques like document reordering.

Practical RAG Pipeline

Here is a simplified example of a RAG implementation:

# 1. Embed documents (once) documents = [ "Grouped Query Attention (GQA) is...", "The KV-Cache is stored as O(n)...", "Llama 3 uses Multi-Head Attention..." ] doc_embeddings = embedding_model.encode(documents) vector_db.index(doc_embeddings) # 2. Process query (per request) query = "What is Grouped Query Attention?" query_embedding = embedding_model.encode(query) # 3. Retrieval top_k_docs = vector_db.search(query_embedding, k=3) # 4. Augmentation augmented_prompt = f""" Context: {chr(10).join(top_k_docs)} Question: {query} """ # 5. Generation response = llm.generate(augmented_prompt)

Summary: When to Use RAG?

RAG is ideal for:

RAG is not ideal for: