Gradient Flow Comparison

Visualization of the vanishing gradient problem and how residual connections enable gradient flow in deep networks

Gradient Flow is the central challenge when training deep networks. Without skip connections, gradients vanish exponentially – with them, models with 100+ layers can be trained successfully.

📖 Learning Context ▼

Visually understand the vanishing gradient problem
Understand how skip connections solve the problem
Recognize the "Gradient Highway" through residuals

Step 7/8 Transformer Fundamentals

This visualization complements Residual & LayerNorm (1.7) with a dynamic representation of gradient flow during training.

Without residuals, the lower layers of a deep network would practically not learn. The skip connection y = x + f(x) guarantees that gradients always have a direct path – even with 128 layers.

Without skip: Gradient × layer weights at each layer → exponential vanishing
With skip: Gradient can flow directly (= 1) + layer contribution
Enables training of 100+ layer models

The Vanishing Gradient Problem

In deep neural networks, gradients are multiplied at each layer during the backpropagation process. Without skip connections, this leads to exponentially decreasing gradients – the lower layers barely learn. Residual connections create a "Gradient Highway" that enables direct gradient flow.

Why Gradients Vanish

In traditional backpropagation, gradients are multiplied at each layer:

Gradient passes through weights and activation functions
In deep networks: many multiplications with values < 1
Exponential decay: 0.8^10 ≈ 0.107
Lower layers receive barely any learning signal

The Solution: Residual Connections

Skip connections create a "Gradient Highway":

Direct path for gradients through all layers
Formula: H(x) = F(x) + x instead of just F(x)
Gradient can pass skip connection unchanged
Enables training of networks with 100+ layers

Gradient Flow Comparison

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways

The Vanishing Gradient Problem

Controls

Gradient Strength

Without Residual Connections

With Residual Connections

Why Gradients Vanish

The Solution: Residual Connections