2000 Tokens
Input Sequence Structure

Attention Pattern by Layer

Position in Sequence (System → Query → Documents) | Color Scale: Red (low) → Green (high)
Color Scale
Low High
Statistics
System Prompt Attn 25%
Query Attn 30%
Docs Attn 45%
U-Curve Score 0.68

Why This Matters

🎯
Lost-in-the-Middle Phenomenon
Documents in the middle of the context window receive less attention. The first and last document are preferred – a classic U-curve pattern.
📍
System Prompt Position
Placed at the beginning, but no guarantee of high attention. Often the Query is attended to more strongly. Modern prompting techniques address this.
🔄
Layer-Dependent
Early layers focus on syntactic (System Prompt tokens). Later layers focus on semantics (documents, query). Different roles.
RAG Impact
With RAG with 20+ retrieved docs, middle documents can be ignored, even if highly relevant. Ranking and ordering are critical.
💡
Mitigation Strategies
System Prompt at end, most important docs first/last, or rephrasing in query. Various approaches with different success rates.
📊
Empirical Evidence
LLaMA, GPT-4 and Claude show similar U-curve patterns. It's an architectural phenomenon, not model-specific.

Key Insights

1
U-Curve is real: Literature (Liu et al., 2024) shows empirically: Position at start/end → 80% accuracy, middle → 50% accuracy. With 30 retrieved docs the middle is practically lost.
2
System Prompt competes: System Prompt at start, but Query/Docs moved forward → Model focuses on Query. System Prompt alone not enough, must be repeated.
3
Layer-wise differences: Early layers (1-20) focus on token-level syntax (System Format). Late layers (60+) focus on semantic meaning (Query/Docs). Training stack has different jobs.
4
Recency Bias: Last tokens get ~15-20% more attention than middle ones. That's why "conclusion at end" prompt tricks work better than expected.
5
RAG consequences are large: With KNN retrieval (top-10 docs) positions 4-7 can get <20% attention. Doc ranking is more critical than retrieval itself.
6
Training can't fix this: Even with Supervised Fine-Tuning on long contexts, U-curve remains. It seems to be an architecture-level limitation, not data-level.