0.0

Comparison: Activation Functions

All Functions Compared
Fig. 1 | Activation functions ReLU, GELU, Swish overlaid. The current input value (slider) is marked as a vertical line.
Output for Current Input
Fig. 2 | The output values for the current input value, color-coded for each function.

Functions in Detail

ReLU (Rectified Linear Unit)
f(x) = max(0, x)
The original activation function from the original Transformer. Simple: Negative values are set to 0, positive values pass through.
Used in: Transformer (2017), original GPT/BERT
GELU (Gaussian Error Linear Unit)
f(x) = x · Φ(x)
Smoother alternative to ReLU. Uses the cumulative distribution function of the normal distribution (Φ). Allows small negative values to pass through, enabling finer gradients.
Used in: BERT, GPT-2, GPT-3
Swish (also SiLU)
f(x) = x · σ(x)
Smooth, self-gating function. The σ(x) term (Sigmoid) acts like a gating mechanism: for negative x the function is "closed", for positive x it's "opened".
Used in: EfficientNet, certain Transformer variants
SwiGLU (Swish Gated Linear Unit)
f(x) = (Swish(xW) ⊗ xV)W₂
Modern gate variant with two parallel paths. The Swish path is multiplied element-wise with a linear path. Requires 3 weight matrices instead of 2, but better expressivity.
Used in: Llama, PaLM, Mistral 7B, modern state-of-the-art models

Key Insights

1
ReLU is simple but problematic: It "kills" negative activations completely (Dead ReLU Problem). This can lead to neurons during training that are never activated.
2
GELU is smoother: Through the Gaussian function, small negative values can flow through. This enables better gradients during backpropagation, especially in deep networks.
3
Swish combines gating with smoothness: The Sigmoid term acts as a soft "on/off switch", while the Swish shape remains continuously differentiable. Better gradient flow than ReLU.
4
SwiGLU is the modern standard: The dual-path architecture with gating has become standard in modern LLMs. It requires more parameters (3 instead of 2 matrices), but the quality justifies the overhead.
5
Parameter trade-off: SwiGLU compensates for the third matrix through reduced hidden dimension (2.67× instead of 4× dmodel), so the total parameter count remains similar, but with better performance.
6
Activation behavior: Observe how the functions behave differently: ReLU is linear, GELU is S-shaped, Swish combines both. These differences influence training stability and convergence.