How LLMs are trained: Composition of web, books, code, and academic sources
Data Composition is the underrated factor in LLM quality: The mix of web, books, code, and academic sources determines what a model learns – and what it doesn't. "Garbage in, garbage out" applies even for trillions of tokens.
After Scaling Laws (1/5), we examine Data & Training (2/5) – what LLMs learn from.
Data quality trumps model size: Llama 2 outperforms GPT-3 partly due to more careful data curation. Understanding the data explains strengths and weaknesses.
The balance between different data sources determines model behavior
CommonCrawl makes up 50-70% of the data. Largest available source, but variable quality.
High-quality, long-range dependencies. Google Books, Project Gutenberg, academic sources.
GitHub, GitLab, Stack Overflow. Contributes to reasoning and tool use.
arXiv, papers, dissertations. Small volume, but high conceptual density.
Removes duplicates, improves generalization. Complex algorithms (BloomFilter, exact matching).
Large files ≠ more tokens. Tokenization varies by language and domain.