Chapter 1.1 · Visualization

Byte Pair Encoding (BPE) Animation

Watch step by step how the BPE algorithm breaks down text into tokens – the foundation of every LLM.

Tokenization is the first step in every LLM processing pipeline: Before a model can understand text, it must be broken down into discrete units. The BPE algorithm (Byte Pair Encoding) iteratively merges frequent character pairs to create an efficient vocabulary.

📖 Learning Context ▼

Understand why text must be tokenized before processing
Follow the BPE algorithm step by step
Recognize the trade-off between vocabulary size and sequence length

Step 1/8 Transformer Fundamentals

Tokenization is the entry point to every LLM. The token IDs generated here are converted to continuous vectors in the next step (1.2 Embeddings), which the model can then compute with.

The choice of tokenizer directly affects model performance: A larger vocabulary enables more compact text representations (shorter sequences) but requires more parameters in the embedding matrix. Modern models like Llama 3 (128K tokens) and GPT-4 (~100K tokens) have carefully optimized this trade-off.

BPE starts with individual characters and iteratively merges the most frequent pairs
Modern models use 50K-128K tokens (GPT-4: ~100K, Llama 3: 128K)
Subword tokenization balances vocabulary size against sequence length

BPE Algorithm Step by Step

Speed:

Keyboard: Space Play/Pause · → Next Step · R Reset

Step 0 of 0 Merges

Initialization Starting with individual characters as base tokens

Token Display

Pair Frequencies (current)

Vocabulary (0 Tokens)

Current Tokens

Merges Performed

1.0×

Compression

💡 How BPE Works

Byte Pair Encoding starts with individual characters and iteratively merges the most frequent adjacent pairs into new tokens. In the example aaabdaaabac, first aa → Z, then ab → Y, etc. are merged until the desired vocabulary size is reached.

Byte Pair Encoding (BPE) Animation

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways

Related Visualizations