Training Data Composition

The balance between different data sources determines model behavior

CommonCrawl (Web)
~60%
Books & Articles
~20%
Programming Code
~12%
Academic Sources
~5%
Other
~3%

Comparison: Data Mix of Different Models

GPT-4

Web: ~50%
Books: ~20%
Code: ~15%
Academic: ~10%
Size: ~1.76T Token

Llama 3

Web: ~60%
Books: ~15%
Code: ~15%
Academic: ~5%
Size: ~15T Token

Claude (Anthropic)

Web: ~55%
Books: ~25%
Code: ~12%
Academic: ~8%
Size: ~4T Token

Mistral 7B

Web: ~70%
Books: ~10%
Code: ~12%
Academic: ~8%
Size: ~600B Token
Key Insights

🔑 Key Insights

Web Dominates

CommonCrawl makes up 50-70% of the data. Largest available source, but variable quality.

Books for Quality

High-quality, long-range dependencies. Google Books, Project Gutenberg, academic sources.

Code for Capabilities

GitHub, GitLab, Stack Overflow. Contributes to reasoning and tool use.

Academic Rigor

arXiv, papers, dissertations. Small volume, but high conceptual density.

Deduplication

Removes duplicates, improves generalization. Complex algorithms (BloomFilter, exact matching).

Token vs. File

Large files ≠ more tokens. Tokenization varies by language and domain.

Data Quality & Cleaning

Selection & Filtering

Language Detection: Target language only
Quality Scoring: Remove low-quality
Perplexity Filtering: LM-based quality check

Deduplication

Exact Match: Identical sequences
N-gram Filter: Similar blocks
Dataset Level: Duplicates between sources

Concerns & Mitigations

Bias: Stratified Sampling
Copyright: Consideration possible
PII Removal: Privacy masking