Step-by-step visualization of how LayerNorm normalizes, scales, and shifts a vector
Layer Normalization is the mathematical foundation for stable training of deep networks. This live demo shows every computation step: Mean calculation, Variance, Normalization, and finally Scale (γ) and Shift (β) — the trainable parameters that allow the model to learn the optimal activation distribution.
This page complements the conceptual explanation of Residual & LayerNorm with an interactive calculation. Entering your own vectors and following each normalization step live helps develop intuitive understanding of the mathematics.
Without normalization, activations in deep networks would explode or vanish. LayerNorm (or RMSNorm) appears before and/or after every Attention and FFN layer — typically 2 × number of layers = 64-256 normalizations per forward pass.
LayerNorm stabilizes training of deep networks by normalizing activations over features (not over batch). It computes mean and variance for each individual vector and transforms it to Mean=0, Variance=1. Trainable parameters γ (Scale) and β (Shift) allow the model to learn the optimal distribution.
LayerNorm:
BatchNorm:
Simplified variant of LayerNorm, used in Llama, Mistral, and many modern LLMs:
LayerNorm has 2 × dmodel trainable parameters:
Example: With dmodel=512, LayerNorm has 1024 parameters.