AIME 2025
AIME 2025 is a competition-math benchmark that cleanly separates reasoning-tier from non-reasoning models. Solutions require multi-step deduction — reasoning models with chain-of-thought scaffolding clear 90%, while raw-text non-reasoning models cluster around 50–60%. A useful "is this a thinking model" signal.
Category: math · Source: artificialanalysis.ai ↗
| # | Model | Score | Setting | Measured | Source |
|---|---|---|---|---|---|
| 01 | 96.1% | CoT Deep Think | May 20, 2026 | link ↗ | |
| 02 | OpenAI | 94.2% | CoT max-thinking | Apr 23, 2026 | link ↗ |
| 03 | Anthropic | 93.5% | CoT max-thinking | Apr 16, 2026 | link ↗ |
| 04 | 92.8% | CoT Deep Think | Apr 22, 2026 | link ↗ | |
| 05 | OpenAI | 91.0% | CoT max-thinking | Aug 7, 2025 | link ↗ |
| 06 | Alibaba | 90.4% | CoT extended | May 20, 2026 | link ↗ |
| 07 | OpenAI | 88.9% | CoT high-effort | Apr 16, 2025 | link ↗ |
| 08 | DeepSeek | 88.6% | CoT extended | Apr 22, 2026 | link ↗ |
| 09 | Anthropic | 87.4% | CoT extended | Feb 17, 2026 | link ↗ |
| 10 | xAI | 86.1% | CoT | Jul 9, 2025 | link ↗ |
| 11 | DeepSeek | 79.8% | CoT | Jan 20, 2025 | link ↗ |
Other model leaderboards
- gpt.buzz Composite Model Index — cross-benchmark ranking
- MMLU-Pro — Multitask language understanding across 14 disciplines — the harder, contamination-resistant successor to MMLU. 10-option multiple choice with stronger distractors.
- GPQA Diamond — PhD-level science questions in biology, physics, and chemistry, written by domain experts. The hardest subset of GPQA — humans with PhDs in the subject score ~65%.
- HumanEval — OpenAI's 164 Python programming problems with hidden unit tests — the classic code-generation pass@1 benchmark.
- Aider Polyglot — Multi-language code edit benchmark — solve real Exercism problems across Python, JavaScript, Rust, Go, Java, C++ via edit-and-test loops.
- LiveCodeBench — Contamination-free code benchmark — only problems published AFTER a model's training cutoff. Refreshed monthly.
- MMMU — Massive Multi-discipline Multimodal Understanding — 11.5k college-level multimodal questions across 30 subjects (art, business, science, medicine, etc.).