3
💡 Click on the canvas to set a new query
Fig. 1 | 2D projection of the embedding space with 50 documents (blue points). Red point = Query. Orange lines = k nearest neighbors (determined by Euclidean distance).
Avg Distance (k Neighbors)
Max Distance (k Neighbors)
Retrieval Success Rate
Most Similar Doc (cosine)

Key Insights

1
k-NN Retrieval: RAG systems typically use k=3, 5, or 10 nearest neighbors. The balance between too few (missed quality) and too many (context overload) is critical.
2
Embedding Quality is Everything: When embeddings are well-trained (on relevant data), semantically similar documents cluster together. Poor embeddings lead to random retrieval.
3
Curse of Dimensionality: In high dimensions (d=768 or d=1536), distances become unintuitive. All points are approximately equally far apart. This is a known problem with Dense Retrieval.
4
Reranking Phase: Simple k-NN can be suboptimal. Modern RAG systems use Stage-1 (fast dense retrieval, top-100) → Stage-2 (slower cross-encoder reranker, top-10).
5
Negative Sampling: When training embeddings, "hard negative mining" is important: documents that are similar (but wrong) should be trained to be separated. Random negatives are ineffective.
6
Practical Implication: For your RAG pipeline: Use pretrained dense embeddings (bge-large-en, voyage-2), not random initializations. Fine-tuning on your domain helps with 5-10% performance gain.