Watch step by step how the BPE algorithm breaks down text into tokens – the foundation of every LLM.
Tokenization is the entry point to every LLM. The token IDs generated here are converted to continuous vectors in the next step (1.2 Embeddings), which the model can then compute with.
The choice of tokenizer directly affects model performance: A larger vocabulary enables more compact text representations (shorter sequences) but requires more parameters in the embedding matrix. Modern models like Llama 3 (128K tokens) and GPT-4 (~100K tokens) have carefully optimized this trade-off.
Byte Pair Encoding starts with individual characters and iteratively
merges the most frequent adjacent pairs into new tokens.
In the example aaabdaaabac, first aa → Z,
then ab → Y, etc. are merged until the desired
vocabulary size is reached.