Next-Token-Prediction

How LLMs learn to predict the next word: The fundamental mechanism of pretraining

Next-Token-Prediction is the entire learning algorithm of GPT: Given a text sequence, predict the next word. From this simple objective, grammar, facts, reasoning, and more emerge.

📖 Learning Context ▼

Understand the pretraining objective
Grasp why prediction leads to understanding
Know the role of Cross-Entropy Loss

Step 4/5 Chapter 8: Tools & Glossary

Embeddings & Tokens (4/5) – the fundamental training objective of all GPT-like models.

Next-Token-Prediction is the "unreasonable effectiveness" moment of LLMs: A trivial objective leads to complex behavior. Understanding this explains strengths and weaknesses.

Self-supervised: Labels come from the text itself
Scalable: More data = better model
Emergent: Complex capabilities arise untrained

The Basic Principle: Predicting the Next Token

The heart of LLM pretraining is a surprisingly simple task: Given all previous tokens, predict the next one. This "self-supervised" task is why models can be trained on trillions of tokens with minimal labeling effort.

Mathematically, this is formulated as an autoregressive language model:

Loss = -Σ_t log P(x_t | x_<t; θ)

Where x_t is the current token and x_<t are all previous tokens

The model learns through backpropagation to adjust the probability distribution P so that it predicts the ground-truth token with the highest possible probability.

Fig. 1 | Next-Token-Prediction Animation: A Transformer processes "The cat sat on the" and generates a probability distribution over the vocabulary. The most likely token is "mat" with the highest probability.

Example:

Animation Speed: 1.0x

How Prediction Works

Step 1: Process Context
The Transformer processes all previous tokens: "The", "cat", "sat", "on", "the". Through Self-Attention, all tokens can attend to all other tokens and capture dependencies.

Step 2: Final Hidden State
After all Transformer blocks, the final hidden state emerges – a vector that compresses all context information.

Step 3: Projection to Vocabulary
A final linear layer projects this vector to the vocabulary size (typically: 50K - 128K tokens). This produces "Logits" for each possible token.

Step 4: Softmax Normalization
The softmax function converts logits to probabilities (0-1, sum = 1):

P(token_i) = exp(logit_i) / Σ_j exp(logit_j)

The token with the highest probability is typically selected – in the example "mat".

Fig. 2 | Distribution of probabilities over Top-20 tokens. "mat" leads with 28%, followed by "ground" (15%), "sofa" (12%). The long tail shows many tokens with low probability.

Teacher Forcing and the Exposure Bias Problem

During training, the model always uses the ground-truth tokens as context, not its own predictions. This is called Teacher Forcing and significantly simplifies training.

The Exposure Bias Problem: During inference, however, the model must work with its own predictions – potentially including erroneous tokens. This discrepancy between training and inference can lead to error accumulation.

Some methods to mitigate this:

Scheduled Sampling: During training, gradually replace real tokens with model predictions
Autoregressive Fine-Tuning: Additional fine-tuning rounds after pretraining
Decoding Strategies: Beam Search, Top-k, Top-p to find better sequences

Fig. 3 | Comparison: Teacher Forcing (green arrows = ground-truth) vs. Autoregressive Inference (red arrows = model predictions). During training, the context is correct; during inference, errors can accumulate.

Key Insights

🎯 Scalability

Self-Supervised Learning on unlabeled data enables training on trillions of tokens (trillions for o3).

📊 Simple Metric

Perplexity (exponential of Cross-Entropy Loss) is an intuitive metric: How surprising is the test sentence on average?

🔄 Autoregressive Generation

During generation, prediction → token → context → prediction is repeated in a loop until a stop token is reached.

⚡ Quality with Size

Emergent capabilities occur: Small models don't show certain abilities at all, large models acquire them spontaneously.

🎭 Ambiguity of Language

Many tokens are plausible. A model predicting 28% "mat" should also allow alternative answers.

🧠 Decoding Strategies

Greedy (best token), Sampling, Beam Search – different approaches for selecting tokens during generation.