The Basic Principle: Predicting the Next Token

The heart of LLM pretraining is a surprisingly simple task: Given all previous tokens, predict the next one. This "self-supervised" task is why models can be trained on trillions of tokens with minimal labeling effort.

Mathematically, this is formulated as an autoregressive language model:

Loss = -Σt log P(xt | x<t; θ)

Where xt is the current token and x<t are all previous tokens

The model learns through backpropagation to adjust the probability distribution P so that it predicts the ground-truth token with the highest possible probability.

Fig. 1 | Next-Token-Prediction Animation: A Transformer processes "The cat sat on the" and generates a probability distribution over the vocabulary. The most likely token is "mat" with the highest probability.
1.0x

How Prediction Works

Step 1: Process Context
The Transformer processes all previous tokens: "The", "cat", "sat", "on", "the". Through Self-Attention, all tokens can attend to all other tokens and capture dependencies.

Step 2: Final Hidden State
After all Transformer blocks, the final hidden state emerges – a vector that compresses all context information.

Step 3: Projection to Vocabulary
A final linear layer projects this vector to the vocabulary size (typically: 50K - 128K tokens). This produces "Logits" for each possible token.

Step 4: Softmax Normalization
The softmax function converts logits to probabilities (0-1, sum = 1):

P(tokeni) = exp(logiti) / Σj exp(logitj)

The token with the highest probability is typically selected – in the example "mat".

Fig. 2 | Distribution of probabilities over Top-20 tokens. "mat" leads with 28%, followed by "ground" (15%), "sofa" (12%). The long tail shows many tokens with low probability.

Teacher Forcing and the Exposure Bias Problem

During training, the model always uses the ground-truth tokens as context, not its own predictions. This is called Teacher Forcing and significantly simplifies training.

The Exposure Bias Problem: During inference, however, the model must work with its own predictions – potentially including erroneous tokens. This discrepancy between training and inference can lead to error accumulation.

Some methods to mitigate this:

  • Scheduled Sampling: During training, gradually replace real tokens with model predictions
  • Autoregressive Fine-Tuning: Additional fine-tuning rounds after pretraining
  • Decoding Strategies: Beam Search, Top-k, Top-p to find better sequences
Fig. 3 | Comparison: Teacher Forcing (green arrows = ground-truth) vs. Autoregressive Inference (red arrows = model predictions). During training, the context is correct; during inference, errors can accumulate.

Key Insights

🎯 Scalability

Self-Supervised Learning on unlabeled data enables training on trillions of tokens (trillions for o3).

📊 Simple Metric

Perplexity (exponential of Cross-Entropy Loss) is an intuitive metric: How surprising is the test sentence on average?

🔄 Autoregressive Generation

During generation, prediction → token → context → prediction is repeated in a loop until a stop token is reached.

⚡ Quality with Size

Emergent capabilities occur: Small models don't show certain abilities at all, large models acquire them spontaneously.

🎭 Ambiguity of Language

Many tokens are plausible. A model predicting 28% "mat" should also allow alternative answers.

🧠 Decoding Strategies

Greedy (best token), Sampling, Beam Search – different approaches for selecting tokens during generation.