LLM Explorer – Causal Masking

Causal Masking Animation

Autoregressive generation: How the causal mask ensures tokens can only attend to previous positions, not future ones.

Causal Masking is the foundation of autoregressive language models. By masking future tokens, it ensures generation is "fair" – the model only sees what it has already produced.

📖 Learning Context ▼

Understand why autoregressive models need causal masking
Know the mathematical implementation (adding negative infinity before softmax)
Understand the difference between decoder-only, encoder-only, and encoder-decoder

Step 5/5 Chapter 8: Tools & Glossary

Technical fundamentals as reference.

Causal masking enables efficient training with teacher forcing and prevents "information leakage" from the future. Without masking, parallel training would be impossible.

The mask is an upper triangular matrix with negative infinity values
After softmax, masked positions become exactly 0, not just very small
GPT, Llama, Claude use causal masking; BERT uses bidirectional attention

Causal Masking in 3 Steps:

1. Compute attention scores: S = QK^T / √d_k
2. Add mask: S_masked = S + M (where M[i,j] = -∞ for j > i)
3. Softmax: A = softmax(S_masked)

Result: A[i,j] = 0 for all j > i (future positions)

Why Causal Mask?

In autoregressive generation, token t may only use information from tokens 0..t-1. This prevents "information leakage" from the future during training.

Implementation

The mask is an upper triangular matrix with -∞. After adding to the scores, these positions become exactly 0 through softmax, not just a very small value.

Decoder vs. Encoder

Decoder-only models (GPT, Llama) always use causal masking. Encoder-only (BERT) uses bidirectional attention without masking. Encoder-decoder (T5) combines both.

Training vs. Inference

The mask is identical in training and inference. Training: All tokens in parallel, but masked. Inference: Token-by-token generation with growing context.

Causal Masking Animation

Learning Objectives

Context: Where are we?

Why It Matters

Key Takeaways