Decoder-Only Transformer Block Pre-LN
Input Embeddings Input
[B, N, dmodel]
RMSNorm 1
RMSNorm Pre-LN
[B, N, dmodel] → [B, N, dmodel]
Multi-Head Attention with Residual
Multi-Head Attention Causal
[B, N, dmodel] → [B, N, dmodel]
Add (Residual) +
x + Attention(x)
RMSNorm Pre-LN
[B, N, dmodel] → [B, N, dmodel]
SwiGLU FFN MLP
[B, N, dmodel] → [B, N, dff] → [B, N, dmodel]
Add (Residual) +
x + FFN(x)
Block Output → Next Layer
[B, N, dmodel]
Step 1 of 8

The Input: Token Embeddings

Each Transformer block receives a sequence of token representations as input. These consist of the sum of Token Embeddings (what is the word?) and Position Encodings (where is it?).

BBatch NSequence length dmodele.g. 4096
Typical Values
Llama 3 70B: dmodel = 8,192, N = up to 128K tokens
Step 2 of 8

RMSNorm: Stabilization

Before the attention layer, the input is normalized through RMSNorm (Root Mean Square Normalization). This is a simplified version of LayerNorm without mean shifting.

RMSNorm(x) = x / √(mean(x²) + ε) · γ

Modern models use Pre-Layer Normalization: The norm comes before each transformation, not after. This enables more stable training without a warmup phase.

Step 3 of 8

Multi-Head Attention

The heart of it: Multi-Head Self-Attention allows each token to gather information from all other (previous) tokens.

Attention(Q, K, V) = softmax(QKT / √dk) · V

Multiple attention heads (e.g., h=64) work in parallel, each with its own Q/K/V projections. Causal Masking ensures that tokens can only attend to past positions.

GQA in Llama 2 70B
64 Query heads, but only 8 KV heads → 8× less KV-Cache memory
Step 4 of 8

Residual Connection #1

The Residual Connection (Skip Connection) adds the original input to the attention output. This "highway" connection is crucial for training very deep networks.

output = x + Attention(Norm(x))

Without residuals, gradients in a 96-layer model would practically vanish. Skip connections enable direct gradient flow from output to input.

Step 5 of 8

Second Normalization

Before the feedforward network, another RMSNorm layer follows. The pattern repeats: Norm → Transform → Add.

This consistent structure makes the Transformer modular and scalable – more layers simply means more identical blocks.

Step 6 of 8

SwiGLU Feedforward

The Feedforward Network (FFN) processes each position independently. Modern models use SwiGLU – a gating variant with three weight matrices instead of two.

SwiGLU(x) = (Swish(xWgate) ⊗ xV) Wdown

The hidden dimension dff is typically 2.67× dmodel (instead of 4× in the original), to maintain the same parameter count.

Why SwiGLU?
The gating mechanism (⊗) enables selective activation – the network can "turn off" parts of the Hidden State.
Step 7 of 8

Residual Connection #2

A residual connection also wraps around the FFN. The final block output combines all information streams:

y = x + Attn + FFN

Through additive residuals, information can be both passed through unchanged and transformed – the model learns deltas to the input.

Step 8 of 8

Block Output

The output has exactly the same dimensions as the input: [B, N, dmodel]. It becomes the input for the next block – or, in the last layer, the final projection onto the vocabulary.

Model Layers dmodel Heads
GPT-3 175B 96 12,288 96
Llama 3 70B 80 8,192 64
Mistral 7B 32 4,096 32
Overall Structure
Input → [Block × N] → Final Norm → Linear → Softmax → Token Probabilities