Step 1: Process Context
The Transformer processes all previous tokens: "The", "cat", "sat", "on", "the".
Through Self-Attention, all tokens can attend to all other tokens and capture dependencies.
Step 2: Final Hidden State
After all Transformer blocks, the final hidden state emerges – a vector that compresses all context information.
Step 3: Projection to Vocabulary
A final linear layer projects this vector to the vocabulary size (typically: 50K - 128K tokens).
This produces "Logits" for each possible token.
Step 4: Softmax Normalization
The softmax function converts logits to probabilities (0-1, sum = 1):
The token with the highest probability is typically selected – in the example "mat".