Embedding Space Retrieval

Embedding Space: Retrieval & Nearest Neighbors

Visualization of the embedding space: How a query is projected in 2D and the k nearest neighbors are found

Embedding Retrieval makes the magic behind RAG visible. Texts become points in high-dimensional space, and "similar" means "close together". This 2D projection shows how k-nearest-neighbor search finds the most relevant documents for a query.

📖 Learning Context ▼

Understand the geometric intuition behind embedding similarity
Comprehend how k-NN search finds retrieval candidates
Recognize the role of dimension reduction (PCA/t-SNE) for visualization

Step 6/6 Optimizations & Memory

Complements the RAG pipeline with an interactive visualization of the embedding space. Shows why semantic search works better than keyword matching.

OpenAI Embeddings, Cohere Embed, and BGE are the workhorses behind RAG systems. Understanding what "semantic proximity" means geometrically explains why RAG sometimes finds irrelevant documents.

Cosine Similarity: Angle between vectors = semantic similarity
Approximate NN: With millions of documents, use HNSW or IVF instead of brute-force
Dimensionality: Real embeddings have 768-1536 dimensions, not 2

Key Insights

k-NN Retrieval: RAG systems typically use k=3, 5, or 10 nearest neighbors. The balance between too few (missed quality) and too many (context overload) is critical.

Embedding Quality is Everything: When embeddings are well-trained (on relevant data), semantically similar documents cluster together. Poor embeddings lead to random retrieval.

Curse of Dimensionality: In high dimensions (d=768 or d=1536), distances become unintuitive. All points are approximately equally far apart. This is a known problem with Dense Retrieval.

Reranking Phase: Simple k-NN can be suboptimal. Modern RAG systems use Stage-1 (fast dense retrieval, top-100) → Stage-2 (slower cross-encoder reranker, top-10).

Negative Sampling: When training embeddings, "hard negative mining" is important: documents that are similar (but wrong) should be trained to be separated. Random negatives are ineffective.

Practical Implication: For your RAG pipeline: Use pretrained dense embeddings (bge-large-en, voyage-2), not random initializations. Fine-tuning on your domain helps with 5-10% performance gain.

Embedding Space: Retrieval & Nearest Neighbors

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways

Key Insights