From ReLU to GELU to SwiGLU: How activation functions give feedforward networks expressivity
Activation functions introduce non-linearity into neural networks. Without them, even a deep network would just be a linear transformation – incapable of learning complex patterns.
Activation functions are part of the feedforward networks (FFN) that come after each attention layer. The FFN processes each token individually and stores world knowledge in the process.
SwiGLU (Llama, GPT-4) shows ~1-2% better performance than GELU at the same compute. The gating mechanism dynamically decides which information to pass through. The FFN contains ~⅔ of all model parameters.