6 Tokens
Step 1: Full Attention Matrix (without mask)
First, the attention matrix QKT is computed. Each token can theoretically attend to all others.
Causal Masking in 3 Steps:

1. Compute attention scores: S = QKT / √dk
2. Add mask: S_masked = S + M (where M[i,j] = -∞ for j > i)
3. Softmax: A = softmax(S_masked)

Result: A[i,j] = 0 for all j > i (future positions)
Why Causal Mask?
In autoregressive generation, token t may only use information from tokens 0..t-1. This prevents "information leakage" from the future during training.
Implementation
The mask is an upper triangular matrix with -∞. After adding to the scores, these positions become exactly 0 through softmax, not just a very small value.
Decoder vs. Encoder
Decoder-only models (GPT, Llama) always use causal masking. Encoder-only (BERT) uses bidirectional attention without masking. Encoder-decoder (T5) combines both.
Training vs. Inference
The mask is identical in training and inference. Training: All tokens in parallel, but masked. Inference: Token-by-token generation with growing context.