Model Performance Scores

Performance Trend

Model Score Release Parameters Type

AIME: Test of Mathematical Reasoning

American Invitational Mathematics Exam. 30 problems per year. Olympiad-level difficulty. Best indicator for true reasoning capability. GPT-4: 94%, Claude 4.5: 96%, DeepSeek-R1: 98%.

ELAIPBench: Enterprise-Ready Evaluation

New Enterprise LLM Evaluation Platform. Focus: Production readiness, not just raw performance. Evaluates: Reliability, Safety, Cost-Efficiency. More realistic scores than toy benchmarks.

ThinkBench: Reasoning Capacity

2025 benchmark specifically for test-time reasoning. Measures: Quality, Speed, Efficiency of Chain-of-Thought. Designed for GRPO/RL-trained models. Claude 4.5, GPT-5.1, DeepSeek-R1 optimized.

JustLogic: Pure Logic Puzzles

Pure logical puzzles without domain knowledge. Tests raw reasoning without memorized knowledge. Also a benchmark for consistency & self-correction across iterations.

Benchmarks Change Over Time

Data leakage is a problem: Models trained on benchmark data. Therefore: new benchmarks 2024-2025 (ThinkBench, ELAIPBench) with fresh data. Old benchmarks (ARC, MMLU) less informative now.

Future: Custom Benchmarks

Trend: Companies create custom benchmarks for their use cases (e.g., Legal Reasoning, Medical Diagnosis). Generic benchmarks less relevant. Shift to domain-specific evaluation.