Layer Normalization Live

Layer Normalization Live Demo

Step-by-step visualization of how LayerNorm normalizes, scales, and shifts a vector

Layer Normalization is the mathematical foundation for stable training of deep networks. This live demo shows every computation step: Mean calculation, Variance, Normalization, and finally Scale (γ) and Shift (β) — the trainable parameters that allow the model to learn the optimal activation distribution.

📖 Learning Context ▼

Follow the mathematical steps of LayerNorm
Understand why γ and β are trainable (flexibility for the model)
Know the difference between LayerNorm and RMSNorm

Step 7/8 Transformer Fundamentals

This page complements the conceptual explanation of Residual & LayerNorm with an interactive calculation. Entering your own vectors and following each normalization step live helps develop intuitive understanding of the mathematics.

Without normalization, activations in deep networks would explode or vanish. LayerNorm (or RMSNorm) appears before and/or after every Attention and FFN layer — typically 2 × number of layers = 64-256 normalizations per forward pass.

LayerNorm: x̂ = (x - μ) / √(σ² + ε), then y = γx̂ + β
RMSNorm (modern): simplified to y = x / RMS(x) · γ (~15% faster)
2 × d_model parameters per normalization layer (γ and β)

What is Layer Normalization?

LayerNorm stabilizes training of deep networks by normalizing activations over features (not over batch). It computes mean and variance for each individual vector and transforms it to Mean=0, Variance=1. Trainable parameters γ (Scale) and β (Shift) allow the model to learn the optimal distribution.

LayerNorm vs BatchNorm

LayerNorm:

Normalizes over features (dimensions)
Each sample individually
Independent of batch size
Ideal for sequences (NLP)

BatchNorm:

Normalizes over batch
All samples together
Depends on batch size
Ideal for CNNs (Vision)

RMSNorm (Modern)

Simplified variant of LayerNorm, used in Llama, Mistral, and many modern LLMs:

RMSNorm(x) = (x / RMS(x)) · γ
RMS(x) = √(Σ x²_i / n)

No mean subtraction (only RMS)
No shift parameter β
~10-15% faster than LayerNorm
Same quality in practice

Why Normalization Matters

Stabilizes training of deep networks
Reduces Internal Covariate Shift
Enables higher learning rates
Improves gradient flow
Reduces dependency on initialization

Parameters

LayerNorm has 2 × d_model trainable parameters:

γ (gamma): Scale parameter, usually initialized to 1
β (beta): Shift parameter, usually initialized to 0
ε (epsilon): Small constant (~10⁻⁵) for numerical stability

Example: With d_model=512, LayerNorm has 1024 parameters.

Layer Normalization Live Demo

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways

What is Layer Normalization?

Input Vector

Normalization Steps

LayerNorm vs BatchNorm

RMSNorm (Modern)

Why Normalization Matters

Parameters