The Hypothesis

According to recent research, Feedforward Networks (FFNs) have a memory-like role in LLMs. They store factual knowledge and associations, while Attention layers determine which information is relevant. This explains why FFNs have the most parameters (~2/3) and why MoE models have specialized experts for different knowledge domains.

Architecture Roles in the Transformer

Attention Layer (Router)

Function
Calculates which information is relevant and how it should be combined
Operation
Q·KTSoftmax → Weighted Average of Values
Parameters
~1/3 of Transformer parameters
Metaphor
"Routing Network" - determines which neurons should be active

Feedforward Layer (Memory)

Function
Stores factual knowledge, vocabulary, concept associations
Operation
Position-wise: x → ReLU(xW₁+b₁) → (xW₂+b₂)
Parameters
~2/3 of Transformer parameters
Metaphor
"Long-term Memory" - stores learned associations and facts

Practical Example: "Paris is the capital of France"

Input: "Paris"
Attention: Routes to relevant contexts (countries, geography, capitals)
FFN: Has learned: Paris_embedding → {country: France, type: capital}
Input: "What is the capital of France?"
Attention: Routes "France" token to stored "Paris" knowledge
FFN: Uses learned association to generate "Paris"
Evidence from Research
Observation Memory Hypothesis Interpretation Implications
2/3 of parameters in FFN FFNs store most of the knowledge Knowledge compression per parameter is important
FFN is position-wise Each position can independently access its memory Parallelization possible
MoE has specialized experts Different experts store different domains Router must select correct domain
Neuron as concept Individual FFN neurons encode concepts Interpretability possible
Adapter modules work Small parameters can "inject" knowledge Efficient fine-tuning possible (LoRA)

Knowledge Neurons

Research shows that individual neurons in the FFN encode concepts. Example: One neuron activates for all forms of "Paris", another for all countries.

Capacity Factor

MoE routers need enough "capacity" to activate different neurons. Too little capacity = Information Loss, too much = Waste.

Training Dynamics

During Pretraining, FFN stores knowledge. During Fine-Tuning (RLHF), primarily the routing (Attention) is relearned. FFN remains relatively stable.

Scaling Laws

As FFN parameters grow, so does the "capacity" for knowledge. This is why very large FFNs are needed for very large knowledge. (dff = 4×dmodel or more)

LoRA & Adapter

LoRA works because it only relearns the "routing" (Attention/Adapter weights), not the base knowledge (FFN). Therefore it only needs ~0.1% new parameters!

Zero-Shot Generalization

FFN Memory enables Zero-Shot Learning: The model combines known concepts in new ways, without changing the memory.

Mathematical Perspective
Attention as Router:
attention_scores = Softmax(Q·KT/√dk)
output = attention_scores · V

FFN as Memory:
memory(x) = ReLU(x·W₁ + b₁)·W₂ + b₂

Together:
y = Attention(x) + FFN(x) + x
Implications for Model Design

For Dense Models

dff Size
The larger dff, the more knowledge can be stored. Standard: 4×dmodel (GPT, BERT)
Parameter Budget
Larger FFN → Better knowledge, fewer Attention heads

For MoE Models

Expert Specialization
Different experts store different knowledge domains
Router Training
Router must learn which expert is relevant for which token