The Problem: Deep Networks

Why can't we just stack more layers? Historically, there were two fundamental problems:

💡 Solution: Residual Connections (Skip-Connections) + Normalization. Together they enable deep networks (50-100+ layers) with stable training.

Residual Connections – Skip Connections

A Residual Connection is a direct connection that routes the input around a transformation:

Residual Connection Formula:
x' = x + f(x)

Where:
• x = Input
• f(x) = Transformation (e.g., Attention or FFN)
• x' = Output with additional signal from input
Fig. 1 | Residual Connections: Why skip connections enable deep networks

Why Residual Connections Work:

Layer Normalization – Stabilization

Normalization means: Bringing activations per Token to mean 0 and standard deviation 1. This significantly stabilizes training.

Standard LayerNorm:

Layer Normalization Formula:
LayerNorm(x) = γ ⊙ (x - μ) / √(σ² + ε) + β

Where:
• μ = Mean over all d dimensions per token
• σ² = Variance over d dimensions
• γ, β = Trainable scale and shift parameters
• ε = Small constant (e.g., 1e-6) for numerical stability

RMSNorm – Modern Variant:

Llama, Mistral, and other modern models use RMSNorm – a simplified version without mean subtraction:

RMSNorm Formula:
RMSNorm(x) = γ ⊙ x / √(mean(x²) + ε)

Advantages:
• Faster (fewer operations)
• Equivalent performance to LayerNorm
• Less memory
Fig. 2 | Comparison: Post-LayerNorm vs. Pre-LayerNorm architecture and their position in the block

Post-LayerNorm (Original)

  • ❌ Needs warmup phase (learning rate annealing)
  • ❌ Unstable with many layers
  • ✅ Output normalized
  • ✅ Historical standard (original Transformer)

Pre-LayerNorm (Modern)

  • ✅ Stable training without warmup
  • ✅ Better convergence with many layers
  • ✅ Simpler tuning parameters
  • ✅ Standard in GPT-3, Llama, Claude

Transformer Block: Interplay

A modern Transformer block uses Residual Connections AND Pre-LayerNorm together. Here is the complete data flow:

Fig. 3 | Complete Transformer block with Pre-LayerNorm + Residual Connections (Llama style)
Block Logic (Pre-LayerNorm Style):
h₁ = x + Attention(RMSNorm(x))
h₂ = h₁ + SwiGLU(RMSNorm(h₁))
return h₂

Block Components:

Component Function Why needed
RMSNorm(x) Normalizes input to RMS=1 Stabilizes Attention input, prevents numerical instability
Attention(norm_x) Computes head interactions Semantically connects different tokens
+ (Residual) Adds original input Gradient highway, preserves original information
RMSNorm(h₁) Normalizes before FFN Stabilizes FFN input
SwiGLU(norm_h) Non-linear projection with gating Increases model capacity, learnable gating
+ (Residual) Adds h₁ back Gradient highway, preserves Attention output
Section 5: Key Insights

Core Insights

1. Residuals Are Not "Waste"

Skip connections are not a weakness or backup plan. They are a primary design principle: The network only needs to learn the change, not the complete transformation.

2. Normalization = Training Stabilizer

Without normalization, activation distributions constantly change – the network would have to constantly adapt. With normalization, the distribution remains stable and learning becomes more efficient.

3. Pre-LayerNorm Is a Game-Changer

Pre-normalization makes training without warmup possible. This is not just practical, but also enables deeper models – all modern models use Pre-LayerNorm.

4. RMSNorm Shows: Simpler Is Often Better

RMSNorm omits mean subtraction – but is just as fast and just as effective. This shows: Not all mathematical subtleties are necessary. What matters empirically is what works.

5. The Combination Is Essential

Residuals WITHOUT normalization = unstable. Normalization WITHOUT residuals = limited depth. Together: Stable, deep, efficient networks (50-100+ layers).

6. Depth Has Costs

More layers = more parameters, more compute. But with Residuals + Normalization, scaling is predictable and stable – not chaotic like without these techniques.

Modern Models & Configuration

All large modern language models use Residual Connections + Pre-LayerNorm. Here are the typical configurations:

Model Normalization Residual Type Layers dmodel
GPT-2 LayerNorm Post 12-48 768-1600
GPT-3 LayerNorm Post 96 12,288
PaLM RMSNorm Pre 118 18,432
Llama 2 70B RMSNorm Pre 80 8,192
Llama 3 70B RMSNorm Pre 80 8,192
Claude 3 LayerNorm (presumably) Pre ~100+ ~8K-10K
Mistral 7B RMSNorm Pre 32 4,096
🔍 Observation: All modern models (2022+) use Pre-LayerNorm. Older models (GPT-2, GPT-3) used Post-LayerNorm. This is a sign of evolution: Pre-LayerNorm is superior.