Research Perspective: How Feedforward Networks store factual knowledge while Attention Layers route information
Feedforward Networks as Memory – a fascinating perspective: The FFN layers function like a massive key-value store for world knowledge, while Attention determines which knowledge is retrieved.
After Multi-Head Attention (Step 5), the FFN processes each token individually. This visualization shows how factual knowledge is retrieved in the process.
Research shows: During "Knowledge Editing" (e.g., "Who is the current president?"), often only FFN weights are modified. The FFN stores facts while Attention understands context and routes.
According to recent research, Feedforward Networks (FFNs) have a memory-like role in LLMs. They store factual knowledge and associations, while Attention layers determine which information is relevant. This explains why FFNs have the most parameters (~2/3) and why MoE models have specialized experts for different knowledge domains.
| Observation | Memory Hypothesis Interpretation | Implications |
|---|---|---|
| 2/3 of parameters in FFN | FFNs store most of the knowledge | Knowledge compression per parameter is important |
| FFN is position-wise | Each position can independently access its memory | Parallelization possible |
| MoE has specialized experts | Different experts store different domains | Router must select correct domain |
| Neuron as concept | Individual FFN neurons encode concepts | Interpretability possible |
| Adapter modules work | Small parameters can "inject" knowledge | Efficient fine-tuning possible (LoRA) |
Research shows that individual neurons in the FFN encode concepts. Example: One neuron activates for all forms of "Paris", another for all countries.
MoE routers need enough "capacity" to activate different neurons. Too little capacity = Information Loss, too much = Waste.
During Pretraining, FFN stores knowledge. During Fine-Tuning (RLHF), primarily the routing (Attention) is relearned. FFN remains relatively stable.
As FFN parameters grow, so does the "capacity" for knowledge. This is why very large FFNs are needed for very large knowledge. (dff = 4×dmodel or more)
LoRA works because it only relearns the "routing" (Attention/Adapter weights), not the base knowledge (FFN). Therefore it only needs ~0.1% new parameters!
FFN Memory enables Zero-Shot Learning: The model combines known concepts in new ways, without changing the memory.