Scroll through the individual components of a decoder-only Transformer block. Each layer transforms the input – from Token embeddings to the final representation.
The Transformer Block is the fundamental building block of modern LLMs. It combines all previous concepts: Attention for token interactions, Feedforward for knowledge storage, Residuals and Normalization for stable training.
This is the synthesis of all previous concepts: Tokenization (1), Embeddings (2), Position Encoding (3), Self-Attention (4), Multi-Head (5), FFN (6), and Residual/LayerNorm (7) are combined here into a functioning block. With this knowledge you can fully understand modern LLM architectures.
LLMs like GPT-4 or Llama 3 stack 32-128 identical Transformer blocks. Each block deepens understanding: early blocks capture local patterns, later blocks abstract to complex concepts. The Pre-Layer-Norm architecture (Norm before each sublayer) has become standard as it enables more stable gradients than the original Post-Layer-Norm.
Each Transformer block receives a sequence of token representations as input. These consist of the sum of Token Embeddings (what is the word?) and Position Encodings (where is it?).
Before the attention layer, the input is normalized through RMSNorm (Root Mean Square Normalization). This is a simplified version of LayerNorm without mean shifting.
Modern models use Pre-Layer Normalization: The norm comes before each transformation, not after. This enables more stable training without a warmup phase.
The heart of it: Multi-Head Self-Attention allows each token to gather information from all other (previous) tokens.
Multiple attention heads (e.g., h=64) work in parallel, each with its own Q/K/V projections. Causal Masking ensures that tokens can only attend to past positions.
The Residual Connection (Skip Connection) adds the original input to the attention output. This "highway" connection is crucial for training very deep networks.
Without residuals, gradients in a 96-layer model would practically vanish. Skip connections enable direct gradient flow from output to input.
Before the feedforward network, another RMSNorm layer follows. The pattern repeats: Norm → Transform → Add.
This consistent structure makes the Transformer modular and scalable – more layers simply means more identical blocks.
The Feedforward Network (FFN) processes each position independently. Modern models use SwiGLU – a gating variant with three weight matrices instead of two.
The hidden dimension dff is typically 2.67× dmodel (instead of 4× in the original), to maintain the same parameter count.
A residual connection also wraps around the FFN. The final block output combines all information streams:
Through additive residuals, information can be both passed through unchanged and transformed – the model learns deltas to the input.
The output has exactly the same dimensions as the input: [B, N, dmodel]. It becomes the input for the next block – or, in the last layer, the final projection onto the vocabulary.
| Model | Layers | dmodel | Heads |
|---|---|---|---|
| GPT-3 175B | 96 | 12,288 | 96 |
| Llama 3 70B | 80 | 8,192 | 64 |
| Mistral 7B | 32 | 4,096 | 32 |