System Prompt Attention Heatmap

Attention Pattern by Layer

Position in Sequence (System → Query → Documents) | Color Scale: Red (low) → Green (high)

Color Scale

Low High

Statistics

System Prompt Attn 25%

Query Attn 30%

Docs Attn 45%

U-Curve Score 0.68

Why This Matters

🎯

Lost-in-the-Middle Phenomenon

Documents in the middle of the context window receive less attention. The first and last document are preferred – a classic U-curve pattern.

📍

System Prompt Position

Placed at the beginning, but no guarantee of high attention. Often the Query is attended to more strongly. Modern prompting techniques address this.

🔄

Layer-Dependent

Early layers focus on syntactic (System Prompt tokens). Later layers focus on semantics (documents, query). Different roles.

⚡

RAG Impact

With RAG with 20+ retrieved docs, middle documents can be ignored, even if highly relevant. Ranking and ordering are critical.

💡

Mitigation Strategies

System Prompt at end, most important docs first/last, or rephrasing in query. Various approaches with different success rates.

📊

Empirical Evidence

LLaMA, GPT-4 and Claude show similar U-curve patterns. It's an architectural phenomenon, not model-specific.

Key Insights

U-Curve is real: Literature (Liu et al., 2024) shows empirically: Position at start/end → 80% accuracy, middle → 50% accuracy. With 30 retrieved docs the middle is practically lost.

System Prompt competes: System Prompt at start, but Query/Docs moved forward → Model focuses on Query. System Prompt alone not enough, must be repeated.

Layer-wise differences: Early layers (1-20) focus on token-level syntax (System Format). Late layers (60+) focus on semantic meaning (Query/Docs). Training stack has different jobs.

Recency Bias: Last tokens get ~15-20% more attention than middle ones. That's why "conclusion at end" prompt tricks work better than expected.

RAG consequences are large: With KNN retrieval (top-10 docs) positions 4-7 can get <20% attention. Doc ranking is more critical than retrieval itself.

Training can't fix this: Even with Supervised Fine-Tuning on long contexts, U-curve remains. It seems to be an architecture-level limitation, not data-level.

Attention on System Prompts

Learning Objectives

Context: Where are we?

Why It Matters

Key Takeaways

Attention Pattern by Layer

Why This Matters

Key Insights