AIME 2025

AIME 2025 is a competition-math benchmark that cleanly separates reasoning-tier from non-reasoning models. Solutions require multi-step deduction — reasoning models with chain-of-thought scaffolding clear 90%, while raw-text non-reasoning models cluster around 50–60%. A useful "is this a thinking model" signal.

Category: math · Source: artificialanalysis.ai ↗

#	Model	Score	Setting	Measured	Source
01	Gemini 3.5 Google	96.1%	CoT Deep Think	May 20, 2026	link ↗
02	GPT-5.5 OpenAI	94.2%	CoT max-thinking	Apr 23, 2026	link ↗
03	Claude 4.7 Opus Anthropic	93.5%	CoT max-thinking	Apr 16, 2026	link ↗
04	Gemini 3 Pro Google	92.8%	CoT Deep Think	Apr 22, 2026	link ↗
05	GPT-5 OpenAI	91.0%	CoT max-thinking	Aug 7, 2025	link ↗
06	Qwen3.7-Max Alibaba	90.4%	CoT extended	May 20, 2026	link ↗
07	o3 OpenAI	88.9%	CoT high-effort	Apr 16, 2025	link ↗
08	DeepSeek-V4-Pro DeepSeek	88.6%	CoT extended	Apr 22, 2026	link ↗
09	Claude 4.6 Sonnet Anthropic	87.4%	CoT extended	Feb 17, 2026	link ↗
10	Grok 4 xAI	86.1%	CoT	Jul 9, 2025	link ↗
11	DeepSeek-R1 DeepSeek	79.8%	CoT	Jan 20, 2025	link ↗

Other model leaderboards

gpt.buzz Composite Model Index — cross-benchmark ranking
MMLU-Pro — Multitask language understanding across 14 disciplines — the harder, contamination-resistant successor to MMLU. 10-option multiple choice with stronger distractors.
GPQA Diamond — PhD-level science questions in biology, physics, and chemistry, written by domain experts. The hardest subset of GPQA — humans with PhDs in the subject score ~65%.
HumanEval — OpenAI's 164 Python programming problems with hidden unit tests — the classic code-generation pass@1 benchmark.
Aider Polyglot — Multi-language code edit benchmark — solve real Exercism problems across Python, JavaScript, Rust, Go, Java, C++ via edit-and-test loops.
LiveCodeBench — Contamination-free code benchmark — only problems published AFTER a model's training cutoff. Refreshed monthly.
MMMU — Massive Multi-discipline Multimodal Understanding — 11.5k college-level multimodal questions across 30 subjects (art, business, science, medicine, etc.).