Autoregressive generation: How the causal mask ensures tokens can only attend to previous positions, not future ones.
Causal Masking is the foundation of autoregressive language models. By masking future tokens, it ensures generation is "fair" – the model only sees what it has already produced.
Technical fundamentals as reference.
Causal masking enables efficient training with teacher forcing and prevents "information leakage" from the future. Without masking, parallel training would be impossible.