Transformer Block Scrollytelling

Step 1 of 8

The Input: Token Embeddings

Each Transformer block receives a sequence of token representations as input. These consist of the sum of Token Embeddings (what is the word?) and Position Encodings (where is it?).

BBatch NSequence length d_modele.g. 4096

Typical Values

Llama 3 70B: d_model = 8,192, N = up to 128K tokens

Step 2 of 8

RMSNorm: Stabilization

Before the attention layer, the input is normalized through RMSNorm (Root Mean Square Normalization). This is a simplified version of LayerNorm without mean shifting.

RMSNorm(x) = x / √(mean(x²) + ε) · γ

Modern models use Pre-Layer Normalization: The norm comes before each transformation, not after. This enables more stable training without a warmup phase.

Step 3 of 8

Multi-Head Attention

The heart of it: Multi-Head Self-Attention allows each token to gather information from all other (previous) tokens.

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

Multiple attention heads (e.g., h=64) work in parallel, each with its own Q/K/V projections. Causal Masking ensures that tokens can only attend to past positions.

GQA in Llama 2 70B

64 Query heads, but only 8 KV heads → 8× less KV-Cache memory

Step 4 of 8

Residual Connection #1

The Residual Connection (Skip Connection) adds the original input to the attention output. This "highway" connection is crucial for training very deep networks.

output = x + Attention(Norm(x))

Without residuals, gradients in a 96-layer model would practically vanish. Skip connections enable direct gradient flow from output to input.

Step 5 of 8

Second Normalization

Before the feedforward network, another RMSNorm layer follows. The pattern repeats: Norm → Transform → Add.

This consistent structure makes the Transformer modular and scalable – more layers simply means more identical blocks.

Step 6 of 8

SwiGLU Feedforward

The Feedforward Network (FFN) processes each position independently. Modern models use SwiGLU – a gating variant with three weight matrices instead of two.

SwiGLU(x) = (Swish(xW_gate) ⊗ xV) W_down

The hidden dimension d_ff is typically 2.67× d_model (instead of 4× in the original), to maintain the same parameter count.

Why SwiGLU?

The gating mechanism (⊗) enables selective activation – the network can "turn off" parts of the Hidden State.

Step 7 of 8

Residual Connection #2

A residual connection also wraps around the FFN. The final block output combines all information streams:

y = x + Attn + FFN

Through additive residuals, information can be both passed through unchanged and transformed – the model learns deltas to the input.

Step 8 of 8

Block Output

The output has exactly the same dimensions as the input: [B, N, d_model]. It becomes the input for the next block – or, in the last layer, the final projection onto the vocabulary.

Model	Layers	d_model	Heads
GPT-3 175B	96	12,288	96
Llama 3 70B	80	8,192	64
Mistral 7B	32	4,096	32

Overall Structure

Input → [Block × N] → Final Norm → Linear → Softmax → Token Probabilities

The Transformer Block

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways