SwiGLU Activation Function

Output = SwiGLU(x) = Swish(xW + b) βŠ™ (xV + c)

Gate Visualization

Gate = Οƒ(xV + c) ∈ [0, 1]

Gated Linear Units (GLU)

GLU extends linear transformations with gating: Output = (xW + b) βŠ™ Οƒ(xV + c). The gate Οƒ models which features are relevant.

SwiGLU vs GELU

GELU: smooth but no explicit selection. SwiGLU: adds gating layer β†’ model can explicitly decide which features to pass through.

Empirical Gains

SwiGLU in Transformers: +3-5% performance on same model size. PaLM, LLaMA, Claude use SwiGLU. Gating increases effective capacity.

Gating Mechanism

Two separate transformations: one for activation (Swish), one for gate (Sigmoid). The combination: Swish(x₁) βŠ™ Οƒ(xβ‚‚) allows nuanced feature selection.

Computational Cost

SwiGLU requires ~2Γ— linear operations compared to GELU. But output quality justifies: higher performance on same FLOPs.

Modern Standard

New foundation models (2024+) all use SwiGLU or similar gated variants. Simple ReLU is obsolete. Gating is now baseline in SOTA.