What is Layer Normalization?

LayerNorm stabilizes training of deep networks by normalizing activations over features (not over batch). It computes mean and variance for each individual vector and transforms it to Mean=0, Variance=1. Trainable parameters γ (Scale) and β (Shift) allow the model to learn the optimal distribution.

Input Vector

LayerNorm vs BatchNorm

LayerNorm:

  • Normalizes over features (dimensions)
  • Each sample individually
  • Independent of batch size
  • Ideal for sequences (NLP)

BatchNorm:

  • Normalizes over batch
  • All samples together
  • Depends on batch size
  • Ideal for CNNs (Vision)

RMSNorm (Modern)

Simplified variant of LayerNorm, used in Llama, Mistral, and many modern LLMs:

RMSNorm(x) = (x / RMS(x)) · γ
RMS(x) = √(Σ x²_i / n)
  • No mean subtraction (only RMS)
  • No shift parameter β
  • ~10-15% faster than LayerNorm
  • Same quality in practice

Why Normalization Matters

  • Stabilizes training of deep networks
  • Reduces Internal Covariate Shift
  • Enables higher learning rates
  • Improves gradient flow
  • Reduces dependency on initialization

Parameters

LayerNorm has 2 × dmodel trainable parameters:

  • γ (gamma): Scale parameter, usually initialized to 1
  • β (beta): Shift parameter, usually initialized to 0
  • ε (epsilon): Small constant (~10⁻⁵) for numerical stability

Example: With dmodel=512, LayerNorm has 1024 parameters.