How skip connections make deep networks trainable and normalization ensures stability
Residual Connections and Layer Normalization are the unsung heroes of deep networks. They enable training models with 100+ layers by keeping gradients stable.
Residuals and LayerNorm are the "infrastructure" that makes Attention (4-5) and FFN (6) trainable. In the complete Transformer block (Step 8) you'll see how everything works together.
Without Residuals: Gradient vanishing in deep networks. The skip connection x + f(x) ensures that gradients can flow directly through the network. Pre-Layer-Norm (Norm before Attention/FFN) is the modern standard, as it's more stable than Post-LN. RMSNorm is more efficient than classic LayerNorm.
Why can't we just stack more layers? Historically, there were two fundamental problems:
A Residual Connection is a direct connection that routes the input around a transformation:
Normalization means: Bringing activations per Token to mean 0 and standard deviation 1. This significantly stabilizes training.
Llama, Mistral, and other modern models use RMSNorm – a simplified version without mean subtraction:
A modern Transformer block uses Residual Connections AND Pre-LayerNorm together. Here is the complete data flow:
| Component | Function | Why needed |
|---|---|---|
| RMSNorm(x) | Normalizes input to RMS=1 | Stabilizes Attention input, prevents numerical instability |
| Attention(norm_x) | Computes head interactions | Semantically connects different tokens |
| + (Residual) | Adds original input | Gradient highway, preserves original information |
| RMSNorm(h₁) | Normalizes before FFN | Stabilizes FFN input |
| SwiGLU(norm_h) | Non-linear projection with gating | Increases model capacity, learnable gating |
| + (Residual) | Adds h₁ back | Gradient highway, preserves Attention output |
Skip connections are not a weakness or backup plan. They are a primary design principle: The network only needs to learn the change, not the complete transformation.
Without normalization, activation distributions constantly change – the network would have to constantly adapt. With normalization, the distribution remains stable and learning becomes more efficient.
Pre-normalization makes training without warmup possible. This is not just practical, but also enables deeper models – all modern models use Pre-LayerNorm.
RMSNorm omits mean subtraction – but is just as fast and just as effective. This shows: Not all mathematical subtleties are necessary. What matters empirically is what works.
Residuals WITHOUT normalization = unstable. Normalization WITHOUT residuals = limited depth. Together: Stable, deep, efficient networks (50-100+ layers).
More layers = more parameters, more compute. But with Residuals + Normalization, scaling is predictable and stable – not chaotic like without these techniques.
All large modern language models use Residual Connections + Pre-LayerNorm. Here are the typical configurations:
| Model | Normalization | Residual Type | Layers | dmodel |
|---|---|---|---|---|
| GPT-2 | LayerNorm | Post | 12-48 | 768-1600 |
| GPT-3 | LayerNorm | Post | 96 | 12,288 |
| PaLM | RMSNorm | Pre | 118 | 18,432 |
| Llama 2 70B | RMSNorm | Pre | 80 | 8,192 |
| Llama 3 70B | RMSNorm | Pre | 80 | 8,192 |
| Claude 3 | LayerNorm (presumably) | Pre | ~100+ | ~8K-10K |
| Mistral 7B | RMSNorm | Pre | 32 | 4,096 |