How gating mechanisms in modern activation functions increase model capacity
SwiGLU (Swish-Gated Linear Units) is the modern standard for activation functions in FFN layers. Unlike simple activations like ReLU or GELU, SwiGLU uses an explicit gating mechanism that allows the model to selectively pass or block information.
In the feedforward network (FFN) after the attention block, the activation function determines how information is transformed. SwiGLU has established itself as the most performant variant and is used in practically all state-of-the-art models since 2023.
SwiGLU improves model performance by 3-5% with the same parameter count. The gate (Sigmoid) decides which features are relevant, while Swish smooths the activation. This combination significantly increases the model's effective capacity.
GLU extends linear transformations with gating: Output = (xW + b) β Ο(xV + c). The gate Ο models which features are relevant.
GELU: smooth but no explicit selection. SwiGLU: adds gating layer β model can explicitly decide which features to pass through.
SwiGLU in Transformers: +3-5% performance on same model size. PaLM, LLaMA, Claude use SwiGLU. Gating increases effective capacity.
Two separate transformations: one for activation (Swish), one for gate (Sigmoid). The combination: Swish(xβ) β Ο(xβ) allows nuanced feature selection.
SwiGLU requires ~2Γ linear operations compared to GELU. But output quality justifies: higher performance on same FLOPs.
New foundation models (2024+) all use SwiGLU or similar gated variants. Simple ReLU is obsolete. Gating is now baseline in SOTA.