CHAPTER 8.2b · TOKENIZATION

Vocabulary Explorer

Search the BPE vocabulary: look up Token IDs, tokenize text, and discover the most common tokens

The Vocabulary Explorer makes BPE tokenization tangible: search 50,000+ tokens, see how text is broken down, and understand why "Munich" is one token, but "Munich's" becomes three tokens.

📖 Learning Context ▼

Explore the BPE vocabulary hands-on
Understand the token-to-ID mapping
Distinguish between common and rare tokens

Step 4/5 Chapter 8: Tools & Glossary

Embeddings & Tokens (4/5) – interactive tool for understanding tokenization.

Token costs are API costs. Understanding how tokenization works helps optimize prompts and explains why German texts require more tokens than English.

Vocab Size: 32K (Llama) to 200K (GPT-4)
English Bias: Common English words = 1 token
Subwords: Rare words get split up

Modern LLMs use Byte Pair Encoding (BPE) vocabularies with 50,000–128,000 tokens. Each token can be a complete word, a subword, or individual characters. Common words are single tokens, while rare words are split into multiple sub-tokens.

100,000

Vocabulary Size

~50%

Word Tokens

~40%

Subword Tokens

~10%

Character/Special

Search Vocabulary

Token ID → Text

Enter Token ID

Text → Tokens

Enter Text

Result

Enter a token ID or text to search the vocabulary.

Most Common Tokens

ID	Token	Type	Frequency

Vocabulary Explorer

Learning Objectives

Context: Where are we?

Why It Matters

Key Takeaways