1K
Low
High
0% 50% 100%
18%
Documents
12%
U-Curve Score
7.2
Fig. 1 | U-shaped attention distribution shows strong attention at the beginning (System Prompt) and end (Query), while information in the middle (Documents) receives weaker attention. Layer-dependent differences are clearly visible.
Sequence Structure: System Prompt User Query Retrieved Docs (1-3) Query repeated
📊
U-Curve is a Real Phenomenon
Not an artifact, but measurable in models with 32K, 100K and larger context windows. Early layers show stronger U-curve.
⚠️
RAG Consequences are Significant
Retrieved documents in the middle receive only 12-15% attention. Critical information must be placed at the beginning or end.
🔍
System Prompts Compete for Attention
A long system prompt (e.g., Claude: 16K words) consumes 20-25% of the attention budget, even when user input is more important.
📈
Layer-wise Differences
Early layers (4): 7.8 U-Curve Score. Middle layers (32): 6.5. Late layers (64): 5.2. Upper layers focus more on global structure.
Recency Bias at the End
Query tokens at the end get +15-20% more attention than at the beginning. This helps models prioritize recent requests.
🚫
Training Cannot Fix U-Curve
Even models fine-tuned on long sequences show the U-curve. It is structurally anchored in the attention architecture.
Model Context U-Curve Solution
GPT-4 128K Strong (6.8) Place documents at front
Claude 3.5 200K Medium-weak (5.5) Question answering format
Llama 3 70B 128K Strong (7.0) Hybrid position engineering
Mistral 8×7B 32K Weak (4.2) Less susceptible due to SWA