Step 1: Identify Token ID
Select a token from the vocabulary. Each token has a unique ID between 0 and V-1 (vocabulary size).
Embedding Lookup:
E ∈ ℝ^(V×d) ... Embedding Matrix
V = Vocabulary Size (e.g., 50,000)
d = Embedding Dimension (e.g., 512)

Lookup Operation:
embedding = E[token_id, :]

This is a simple row selection – no matrix multiplication needed!
Why Lookup?
The embedding matrix is a trainable lookup table. Each row corresponds to a token and contains its learned vector. The lookup is an O(1) operation.
Dimensions in Practice
Original Transformer: 512. BERT: 768. GPT-3: 12,288. Llama 2 7B: 4,096. Llama 3 70B: 8,192. Larger dimensions = more capacity, but also more parameters.
Parameter Count
Embedding matrix has V × d parameters. For GPT-4 (~100K vocabulary, estimated 12K dimension): 1.2 billion parameters just for embeddings!
Training
Embedding vectors are learned during pretraining through backpropagation. Semantically similar tokens develop similar vectors (see embedding-space-2d.html).