Words as points in semantic space – similar meanings are close together. This t-SNE projection reduces ~8,000 dimensions to 2D.
After tokenization (Step 1), we have discrete token IDs. Embeddings transform these into continuous vectors that enable mathematical operations. These vectors then flow through Positional Encoding (Step 3) into the Attention calculation.
The embedding dimension (dmodel) defines the model's capacity to represent meaning. Larger dimensions (GPT-4: 12,288, Llama 3 70B: 8,192) enable finer semantic distinctions but require more compute. The embedding matrix often contains the first quarter of all model parameters.
Every word in an LLM is represented by a high-dimensional vector
(e.g., d = 8,192 for Llama 3 70B). These vectors capture semantic
relationships: words with similar meaning have similar vectors and are
close together in space. The t-SNE projection makes this structure visible in 2D –
notice the clear clusters for animals, countries, verbs, and adjectives.