CHAPTER 1.6b · FEEDFORWARD NETWORKS

FFN as Memory

Research Perspective: How Feedforward Networks store factual knowledge while Attention Layers route information

Feedforward Networks as Memory – a fascinating perspective: The FFN layers function like a massive key-value store for world knowledge, while Attention determines which knowledge is retrieved.

📖 Learning Context ▼

Understand the FFN as Key-Value Memory metaphor
Comprehend the role division between Attention and FFN
Recognize why FFN contains ~⅔ of all parameters

Step 6/8 Transformer Fundamentals

After Multi-Head Attention (Step 5), the FFN processes each token individually. This visualization shows how factual knowledge is retrieved in the process.

Research shows: During "Knowledge Editing" (e.g., "Who is the current president?"), often only FFN weights are modified. The FFN stores facts while Attention understands context and routes.

FFN = Key-Value Memory with d_ff "storage slots"
Attention routes, FFN stores facts
~⅔ of all model parameters are in FFN layers

The Hypothesis

According to recent research, Feedforward Networks (FFNs) have a memory-like role in LLMs. They store factual knowledge and associations, while Attention layers determine which information is relevant. This explains why FFNs have the most parameters (~2/3) and why MoE models have specialized experts for different knowledge domains.

Architecture Roles in the Transformer

Attention Layer (Router)

Function

Calculates which information is relevant and how it should be combined

Operation

Q·K^T → Softmax → Weighted Average of Values

Parameters

~1/3 of Transformer parameters

Metaphor

"Routing Network" - determines which neurons should be active

Feedforward Layer (Memory)

Function

Stores factual knowledge, vocabulary, concept associations

Operation

Position-wise: x → ReLU(xW₁+b₁) → (xW₂+b₂)

Parameters

~2/3 of Transformer parameters

Metaphor

"Long-term Memory" - stores learned associations and facts

Practical Example: "Paris is the capital of France"

Input: "Paris"

Attention: Routes to relevant contexts (countries, geography, capitals)
FFN: Has learned: Paris_embedding → {country: France, type: capital}

Input: "What is the capital of France?"

Attention: Routes "France" token to stored "Paris" knowledge
FFN: Uses learned association to generate "Paris"

Evidence from Research

Observation	Memory Hypothesis Interpretation	Implications
2/3 of parameters in FFN	FFNs store most of the knowledge	Knowledge compression per parameter is important
FFN is position-wise	Each position can independently access its memory	Parallelization possible
MoE has specialized experts	Different experts store different domains	Router must select correct domain
Neuron as concept	Individual FFN neurons encode concepts	Interpretability possible
Adapter modules work	Small parameters can "inject" knowledge	Efficient fine-tuning possible (LoRA)

Knowledge Neurons

Research shows that individual neurons in the FFN encode concepts. Example: One neuron activates for all forms of "Paris", another for all countries.

Capacity Factor

MoE routers need enough "capacity" to activate different neurons. Too little capacity = Information Loss, too much = Waste.

Training Dynamics

During Pretraining, FFN stores knowledge. During Fine-Tuning (RLHF), primarily the routing (Attention) is relearned. FFN remains relatively stable.

Scaling Laws

As FFN parameters grow, so does the "capacity" for knowledge. This is why very large FFNs are needed for very large knowledge. (d_ff = 4×d_model or more)

LoRA & Adapter

LoRA works because it only relearns the "routing" (Attention/Adapter weights), not the base knowledge (FFN). Therefore it only needs ~0.1% new parameters!

Zero-Shot Generalization

FFN Memory enables Zero-Shot Learning: The model combines known concepts in new ways, without changing the memory.

Mathematical Perspective

Attention as Router:
attention_scores = Softmax(Q·K^T/√d_k)
output = attention_scores · V

FFN as Memory:
memory(x) = ReLU(x·W₁ + b₁)·W₂ + b₂

Together:
y = Attention(x) + FFN(x) + x

Implications for Model Design

For Dense Models

d_ff Size

The larger d_ff, the more knowledge can be stored. Standard: 4×d_model (GPT, BERT)

Parameter Budget

Larger FFN → Better knowledge, fewer Attention heads

For MoE Models

Expert Specialization

Different experts store different knowledge domains

Router Training

Router must learn which expert is relevant for which token