CHAPTER 1.6 · FEEDFORWARD NETWORKS

Activation Functions

From ReLU to GELU to SwiGLU: How activation functions give feedforward networks expressivity

Activation functions introduce non-linearity into neural networks. Without them, even a deep network would just be a linear transformation – incapable of learning complex patterns.

📖 Learning Context ▼

Understand why non-linearity is essential for deep networks
Recognize the differences between ReLU, GELU, and SwiGLU
Comprehend why SwiGLU became the standard in modern LLMs

Step 6/8 Transformer Fundamentals

Activation functions are part of the feedforward networks (FFN) that come after each attention layer. The FFN processes each token individually and stores world knowledge in the process.

SwiGLU (Llama, GPT-4) shows ~1-2% better performance than GELU at the same compute. The gating mechanism dynamically decides which information to pass through. The FFN contains ~⅔ of all model parameters.

ReLU: max(0, x) – simple, but "dead neurons" problem
GELU: smooth activation, GPT-2/BERT standard
SwiGLU: Gated Linear Unit, modern standard (Llama, GPT-4)

Input Value (x)

0.0

Comparison: Activation Functions

All Functions Compared

Fig. 1 | Activation functions ReLU, GELU, Swish overlaid. The current input value (slider) is marked as a vertical line.

Output for Current Input

Fig. 2 | The output values for the current input value, color-coded for each function.

Functions in Detail

ReLU (Rectified Linear Unit)

f(x) = max(0, x)

The original activation function from the original Transformer. Simple: Negative values are set to 0, positive values pass through.

Used in: Transformer (2017), original GPT/BERT

GELU (Gaussian Error Linear Unit)

f(x) = x · Φ(x)

Smoother alternative to ReLU. Uses the cumulative distribution function of the normal distribution (Φ). Allows small negative values to pass through, enabling finer gradients.

Used in: BERT, GPT-2, GPT-3

Swish (also SiLU)

f(x) = x · σ(x)

Smooth, self-gating function. The σ(x) term (Sigmoid) acts like a gating mechanism: for negative x the function is "closed", for positive x it's "opened".

Used in: EfficientNet, certain Transformer variants

SwiGLU (Swish Gated Linear Unit)

f(x) = (Swish(xW) ⊗ xV)W₂

Modern gate variant with two parallel paths. The Swish path is multiplied element-wise with a linear path. Requires 3 weight matrices instead of 2, but better expressivity.

Used in: Llama, PaLM, Mistral 7B, modern state-of-the-art models

Key Insights

ReLU is simple but problematic: It "kills" negative activations completely (Dead ReLU Problem). This can lead to neurons during training that are never activated.

GELU is smoother: Through the Gaussian function, small negative values can flow through. This enables better gradients during backpropagation, especially in deep networks.

Swish combines gating with smoothness: The Sigmoid term acts as a soft "on/off switch", while the Swish shape remains continuously differentiable. Better gradient flow than ReLU.

SwiGLU is the modern standard: The dual-path architecture with gating has become standard in modern LLMs. It requires more parameters (3 instead of 2 matrices), but the quality justifies the overhead.

Parameter trade-off: SwiGLU compensates for the third matrix through reduced hidden dimension (2.67× instead of 4× d_model), so the total parameter count remains similar, but with better performance.

Activation behavior: Observe how the functions behave differently: ReLU is linear, GELU is S-shaped, Swish combines both. These differences influence training stability and convergence.

Activation Functions

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways

Comparison: Activation Functions

Functions in Detail

Key Insights