CHAPTER 8.1b · TRAINING

Data Composition

How LLMs are trained: Composition of web, books, code, and academic sources

Data Composition is the underrated factor in LLM quality: The mix of web, books, code, and academic sources determines what a model learns – and what it doesn't. "Garbage in, garbage out" applies even for trillions of tokens.

📖 Learning Context ▼

Know typical data sources for LLM training
Understand the importance of data quality vs. quantity
Recognize common data biases

Step 2/5 Chapter 8: Tools & Glossary

After Scaling Laws (1/5), we examine Data & Training (2/5) – what LLMs learn from.

Data quality trumps model size: Llama 2 outperforms GPT-3 partly due to more careful data curation. Understanding the data explains strengths and weaknesses.

Web Crawls: ~80% of data, but noisy
Books & Papers: High quality, but limited
Code: Surprisingly improves reasoning

Training Data Composition

The balance between different data sources determines model behavior

CommonCrawl (Web)

~60%

Books & Articles

~20%

Programming Code

~12%

Academic Sources

~5%

Other

~3%

Comparison: Data Mix of Different Models

GPT-4

Web: ~50%

Books: ~20%

Code: ~15%

Academic: ~10%

Size: ~1.76T Token

Llama 3

Web: ~60%

Books: ~15%

Code: ~15%

Academic: ~5%

Size: ~15T Token

Claude (Anthropic)

Web: ~55%

Books: ~25%

Code: ~12%

Academic: ~8%

Size: ~4T Token

Mistral 7B

Web: ~70%

Books: ~10%

Code: ~12%

Academic: ~8%

Size: ~600B Token

Key Insights

🔑 Key Insights

Web Dominates

CommonCrawl makes up 50-70% of the data. Largest available source, but variable quality.

Books for Quality

High-quality, long-range dependencies. Google Books, Project Gutenberg, academic sources.

Code for Capabilities

GitHub, GitLab, Stack Overflow. Contributes to reasoning and tool use.

Academic Rigor

arXiv, papers, dissertations. Small volume, but high conceptual density.

Deduplication

Removes duplicates, improves generalization. Complex algorithms (BloomFilter, exact matching).

Token vs. File

Large files ≠ more tokens. Tokenization varies by language and domain.

Data Quality & Cleaning

Selection & Filtering

Language Detection: Target language only

Quality Scoring: Remove low-quality

Perplexity Filtering: LM-based quality check

Deduplication

Exact Match: Identical sequences

N-gram Filter: Similar blocks

Dataset Level: Duplicates between sources

Concerns & Mitigations

Bias: Stratified Sampling

PII Removal: Privacy masking

Data Composition

Learning Objectives

Context: Where are we?

Why It Matters

Key Takeaways

Training Data Composition

Comparison: Data Mix of Different Models

GPT-4

Llama 3

Claude (Anthropic)

Mistral 7B

🔑 Key Insights

Web Dominates

Books for Quality

Code for Capabilities

Academic Rigor

Deduplication

Token vs. File

Data Quality & Cleaning

Selection & Filtering

Deduplication

Concerns & Mitigations