The Phenomenon: The U-Curve of Attention

Despite large context windows (32K, 100K+ tokens), LLMs show a surprising behavior: They forget information in the middle and focus on beginning and end.

This leads to a characteristic U-shaped attention distribution (U-curve): Information at the beginning is processed well, forgotten in the middle, attended to again at the end.

8K Tokens
Fig. 1 | The U-Curve: Attention and information processing as a function of position in context. Beginning ✓, Middle ✗, End ✓. The slider changes the context length.

What does this mean in practice?

Why is this a problem?

In RAG pipelines or long-context QA, critical information can be in the middle of a document – exactly where the model doesn't look.

System Prompts Benefit

System prompts at the beginning are processed well. This is one of the reasons why beginning positioning of instructions is important.

Causes of the U-Curve

The U-shaped attention arises from two factors:

1. Attention Masking Techniques

Transformers use causal attention masking: Each token can only look at previous tokens. This leads to structural biases:

2. Training Data Biases

Training data has biased patterns:

The model implicitly learns that beginning and end are more important. This trained bias manifests as the U-curve.

The Problem Mechanistically
Causal Masking + Training Bias
→ Structural bias in attention patterns
→ U-shaped attention distribution
→ Middle information gets "lost"

Result: Long-context capability is an illusion.
Models can process long contexts,
but only actively use beginning/end.

Practical Demonstration: Document Retrieval

Fig. 2 | RAG scenario: When a relevant document is placed at the beginning, it is processed correctly. In the middle: Model ignores it. At the end: Model attends to it again. Buttons switch the document position.

Scenario: Question Answering via Retrieval

Prompt Structure:
1. Multiple documents (from Retrieval)
2. User question at the end

Problem: When relevant document is in the middle:
→ Model doesn't find the answer
→ "I don't know" or hallucinations

Solution: Arrange documents strategically
→ Most important at beginning/end
→ Less important in the middle

Impact on RAG and Long-Context Systems

The RAG Problem

In Retrieval-Augmented Generation (RAG) pipelines, the U-curve becomes particularly problematic:

Scenario Document Position Success Rate Implication
Document at beginning Position 0% ~95% Is attended to and processed
Document in the middle Position 50% ~50% Is often ignored
Document at the end Position 100% ~90% Is attended to (before question)

Problem: Naive Ranking

Standard retrieval ranks by relevance. But the top-K documents should be at the beginning/end, not in the middle!

Solution: Position-Aware Ranking

Found-in-the-Middle Calibration: Rank by relevance AND position. Consider the U-curve.

Found-in-the-Middle Calibration

Approach: Arrange retrieval results so that important documents don't end up in the middle.

Found-in-the-Middle Strategy
Traditional RAG:
Ranking: Top-1 (relevant) → Middle (next best) → Bottom
→ Middle documents end up in LLM context middle!

Found-in-the-Middle:
Position: Beginning (Top-1) + End (Top-2-5) + Middle (Less important)
→ Best relevance is positioned where LLM looks

Result: ~15% improvement in retrieval quality

Solutions for Lost-in-the-Middle

1. Position-Aware Ranking (RAG)

2. Prompt Design Strategies

3. Alternative: Position Shuffling

4. Architecture Improvements (Future)

Short-term (Practical)

Position-aware RAG and prompt design. Avoid critical information in the middle.

Long-term (Research)

New training strategies and architectures can reduce or eliminate the U-curve.

System Prompts and the U-Curve

A practical reason why system prompts are positioned at the beginning: They fall in the high-attention beginning region of the U-curve!

Why does this work?

Prompt Structure in Practice:

[SYSTEM PROMPT] ← High attention (beginning)
[User Context/Documents] ← Mixed
[User Question] ← High attention (end)

This structure optimally utilizes the U-curve!

Key Insights

1️⃣ U-Shaped Attention

LLMs show structurally higher attention at the beginning and end, not in the middle – despite large context windows.

2️⃣ Training and Architecture Effect

Combination of causal masking and training data biases creates the U-curve. Not easy to fix.

3️⃣ Practical Consequences

Long contexts are less useful than they appear. Only beginning and end are actively used.

4️⃣ RAG Problem

Standard retrieval ranking ignores position. Found-in-the-Middle Calibration: +15% through better positioning.

5️⃣ Design Implication

System prompts at top, question at end = best position. Critical info not in the middle.

6️⃣ Future: Fixable?

Research on position shuffling and new architectures. But not yet standard in production.