How model performance grew exponentially from 2017 to 2025 – from Transformer to o3 across all major benchmarks
The Benchmark Evolution documents the rapid progress of LLMs: From 2017 to 2025, MMLU scores rose from under 30% to over 90%. This timeline shows the milestones and explains why benchmarks must constantly be updated.
Scaling & Complexity (1/2) documents progress before we examine emergent capabilities (2/2).
Benchmarks are the measure of LLM progress. But: When models reach 90%+, we need new, harder tests. This dynamic shapes the research.
| Model | Release | Parameters | MMLU | ARC | Math | Notable Feature |
|---|---|---|---|---|---|---|
| Transformer | 2017 | - | - | - | - | Architecture foundation |
| BERT | 2018 | 340M | 77.3% | 64.6% | - | Encoder-Only |
| GPT-3 175B | 2020 | 175B | 54.9% | 51.4% | 2% | In-Context Learning |
| LLaMA 2 70B | 2023 | 70B | 63.9% | 68.2% | 28.7% | Open-Source |
| GPT-4 | 2023 | ~1.8T | 86.4% | 92.3% | 49.9% | MoE, Multimodal |
| Claude 3.5 | 2024 | ~175B | 88.3% | 94.2% | 58% | Constitutional AI |
| Llama 3.1 405B | 2024 | 405B | 85.9% | 92.3% | 53.3% | Dense, Open |
| o3 (April 2025) | 2025 | ? | 92.3% | 96.1% | 96.4% | Test-Time Compute |