MMLU Score (General Knowledge) Over Time
OpenAI (GPT-Series)
Meta (LLaMA)
Anthropic (Claude)
DeepSeek
Fig. 1 | Bubble Chart: X=Release Date, Y=MMLU Score, Size=Parameter Count. Trend: Exponential growth 2017-2023, then plateauing on knowledge. Reasoning models (o3) show new ascent.
Model Release Parameters MMLU ARC Math Notable Feature
Transformer 2017 - - - - Architecture foundation
BERT 2018 340M 77.3% 64.6% - Encoder-Only
GPT-3 175B 2020 175B 54.9% 51.4% 2% In-Context Learning
LLaMA 2 70B 2023 70B 63.9% 68.2% 28.7% Open-Source
GPT-4 2023 ~1.8T 86.4% 92.3% 49.9% MoE, Multimodal
Claude 3.5 2024 ~175B 88.3% 94.2% 58% Constitutional AI
Llama 3.1 405B 2024 405B 85.9% 92.3% 53.3% Dense, Open
o3 (April 2025) 2025 ? 92.3% 96.1% 96.4% Test-Time Compute
📈
Exponential Growth 2017-2023
MMLU grew from ~50% (GPT-3) to 86% (GPT-4) in 3 years. Log-plot shows power-law: ~13% MMLU gain per parameter doubling.
⏸️
Knowledge Plateau at 90%
Claude 3.5: 88%, o3: 92%. MMLU appears to saturate at 90-95%. Further improvements need new metrics or reasoning.
🧠
Reasoning Models Break Math
GPT-4: 49.9% Math. o3: 96.4%. Not through parameters, but through Test-Time Compute (RL + Verification). New trend 2025.
🔓
Open-Source Catching Up
Llama 2 70B (2023) vs GPT-4 large gap. Llama 3.1 405B (2024) nearly equal (85.9% vs 86.4%). Commodity hardware possible.
💎
Smaller ≠ Worse Anymore
Claude 3.5 (~175B): 88.3% MMLU. Llama 405B: 85.9%. Clever design beats raw parameters in 2024.
🚀
Next Frontier: Reasoning
o1/o3 show: Test-Time Compute is the new scaling axis. MMLU maybe saturated, but Math/Code/Reasoning continue to explode.