Modern LLMs use Byte Pair Encoding (BPE) vocabularies with 50,000–128,000 tokens. Each token can be a complete word, a subword, or individual characters. Common words are single tokens, while rare words are split into multiple sub-tokens.

100,000
Vocabulary Size
~50%
Word Tokens
~40%
Subword Tokens
~10%
Character/Special

Search Vocabulary

Token ID → Text

Text → Tokens

Result

Enter a token ID or text to search the vocabulary.

Most Common Tokens

ID Token Type Frequency