CHAPTER 1.7 · TRANSFORMER BUILDING BLOCKS

Residual Connections & Layer Normalization

How skip connections make deep networks trainable and normalization ensures stability

Residual Connections and Layer Normalization are the unsung heroes of deep networks. They enable training models with 100+ layers by keeping gradients stable.

📖 Learning Context ▼

Understand why deep networks are hard to train without Residuals
Comprehend the function of Skip Connections (x + f(x))
Distinguish between Pre-Layer-Norm vs. Post-Layer-Norm

Step 7/8 Transformer Fundamentals

Residuals and LayerNorm are the "infrastructure" that makes Attention (4-5) and FFN (6) trainable. In the complete Transformer block (Step 8) you'll see how everything works together.

Without Residuals: Gradient vanishing in deep networks. The skip connection x + f(x) ensures that gradients can flow directly through the network. Pre-Layer-Norm (Norm before Attention/FFN) is the modern standard, as it's more stable than Post-LN. RMSNorm is more efficient than classic LayerNorm.

Skip Connection: y = x + f(x) preserves gradients even with 100+ layers
Pre-Layer-Norm more stable than Post-LN (modern standard)
RMSNorm: Only variance normalization, ~10% faster than LayerNorm

The Problem: Deep Networks

Why can't we just stack more layers? Historically, there were two fundamental problems:

Vanishing Gradients: With many layers, gradients become exponentially smaller, learning stagnates
Internal Covariate Shift: The input distribution of each layer changes during training – instability
Numerical Instability: Without normalization, activations can explode or go to zero

💡 Solution: Residual Connections (Skip-Connections) + Normalization. Together they enable deep networks (50-100+ layers) with stable training.

Residual Connections – Skip Connections

A Residual Connection is a direct connection that routes the input around a transformation:

Residual Connection Formula:

x' = x + f(x)

Where:

• x = Input

• f(x) = Transformation (e.g., Attention or FFN)

• x' = Output with additional signal from input

Visualization Mode

Fig. 1 | Residual Connections: Why skip connections enable deep networks

Why Residual Connections Work:

Gradient Highway: The gradient has a direct path back to the input – no multiplication with many weights
Identity Assumption: If f(x) ≈ 0, then x' ≈ x – the network can decide to "skip"
Additive Modification: Instead of completely replacing x, only a small change is added
Empirical Result: With skip connections, we can train 50-100+ layers without gradients vanishing

Layer Normalization – Stabilization

Normalization means: Bringing activations per Token to mean 0 and standard deviation 1. This significantly stabilizes training.

Standard LayerNorm:

Layer Normalization Formula:

LayerNorm(x) = γ ⊙ (x - μ) / √(σ² + ε) + β

Where:

• μ = Mean over all d dimensions per token

• σ² = Variance over d dimensions

• γ, β = Trainable scale and shift parameters

• ε = Small constant (e.g., 1e-6) for numerical stability

RMSNorm – Modern Variant:

Llama, Mistral, and other modern models use RMSNorm – a simplified version without mean subtraction:

RMSNorm Formula:

RMSNorm(x) = γ ⊙ x / √(mean(x²) + ε)

Advantages:

• Faster (fewer operations)

• Equivalent performance to LayerNorm

• Less memory

Normalization Approach

Fig. 2 | Comparison: Post-LayerNorm vs. Pre-LayerNorm architecture and their position in the block

Post-LayerNorm (Original)

❌ Needs warmup phase (learning rate annealing)
❌ Unstable with many layers
✅ Output normalized
✅ Historical standard (original Transformer)

Pre-LayerNorm (Modern)

✅ Stable training without warmup
✅ Better convergence with many layers
✅ Simpler tuning parameters
✅ Standard in GPT-3, Llama, Claude

Transformer Block: Interplay

A modern Transformer block uses Residual Connections AND Pre-LayerNorm together. Here is the complete data flow:

Fig. 3 | Complete Transformer block with Pre-LayerNorm + Residual Connections (Llama style)

Block Logic (Pre-LayerNorm Style):

h₁ = x + Attention(RMSNorm(x))
h₂ = h₁ + SwiGLU(RMSNorm(h₁))
return h₂

Block Components:

Component	Function	Why needed
RMSNorm(x)	Normalizes input to RMS=1	Stabilizes Attention input, prevents numerical instability
Attention(norm_x)	Computes head interactions	Semantically connects different tokens
+ (Residual)	Adds original input	Gradient highway, preserves original information
RMSNorm(h₁)	Normalizes before FFN	Stabilizes FFN input
SwiGLU(norm_h)	Non-linear projection with gating	Increases model capacity, learnable gating
+ (Residual)	Adds h₁ back	Gradient highway, preserves Attention output

Section 5: Key Insights

Core Insights

1. Residuals Are Not "Waste"

Skip connections are not a weakness or backup plan. They are a primary design principle: The network only needs to learn the change, not the complete transformation.

2. Normalization = Training Stabilizer

Without normalization, activation distributions constantly change – the network would have to constantly adapt. With normalization, the distribution remains stable and learning becomes more efficient.

3. Pre-LayerNorm Is a Game-Changer

Pre-normalization makes training without warmup possible. This is not just practical, but also enables deeper models – all modern models use Pre-LayerNorm.

4. RMSNorm Shows: Simpler Is Often Better

RMSNorm omits mean subtraction – but is just as fast and just as effective. This shows: Not all mathematical subtleties are necessary. What matters empirically is what works.

5. The Combination Is Essential

Residuals WITHOUT normalization = unstable. Normalization WITHOUT residuals = limited depth. Together: Stable, deep, efficient networks (50-100+ layers).

6. Depth Has Costs

More layers = more parameters, more compute. But with Residuals + Normalization, scaling is predictable and stable – not chaotic like without these techniques.

Modern Models & Configuration

All large modern language models use Residual Connections + Pre-LayerNorm. Here are the typical configurations:

Model	Normalization	Residual Type	Layers	d_model
GPT-2	LayerNorm	Post	12-48	768-1600
GPT-3	LayerNorm	Post	96	12,288
PaLM	RMSNorm	Pre	118	18,432
Llama 2 70B	RMSNorm	Pre	80	8,192
Llama 3 70B	RMSNorm	Pre	80	8,192
Claude 3	LayerNorm (presumably)	Pre	~100+	~8K-10K
Mistral 7B	RMSNorm	Pre	32	4,096

🔍 Observation: All modern models (2022+) use Pre-LayerNorm. Older models (GPT-2, GPT-3) used Post-LayerNorm. This is a sign of evolution: Pre-LayerNorm is superior.