gpt.buzz
Sign in

Leaderboards/Models

AIME 2025

AIME 2025 is a competition-math benchmark that cleanly separates reasoning-tier from non-reasoning models. Solutions require multi-step deduction — reasoning models with chain-of-thought scaffolding clear 90%, while raw-text non-reasoning models cluster around 50–60%. A useful "is this a thinking model" signal.

Category: math · Source: artificialanalysis.ai

#ModelScoreSettingMeasuredSource
01Google logoGemini 3.5

Google

96.1%CoT Deep ThinkMay 20, 2026link ↗
02OpenAI logoGPT-5.5

OpenAI

94.2%CoT max-thinkingApr 23, 2026link ↗
03Anthropic logoClaude 4.7 Opus

Anthropic

93.5%CoT max-thinkingApr 16, 2026link ↗
04Google logoGemini 3 Pro

Google

92.8%CoT Deep ThinkApr 22, 2026link ↗
05OpenAI logoGPT-5

OpenAI

91.0%CoT max-thinkingAug 7, 2025link ↗
06Alibaba logoQwen3.7-Max

Alibaba

90.4%CoT extendedMay 20, 2026link ↗
07OpenAI logoo3

OpenAI

88.9%CoT high-effortApr 16, 2025link ↗
08DeepSeek logoDeepSeek-V4-Pro

DeepSeek

88.6%CoT extendedApr 22, 2026link ↗
09Anthropic logoClaude 4.6 Sonnet

Anthropic

87.4%CoT extendedFeb 17, 2026link ↗
10xAI logoGrok 4

xAI

86.1%CoTJul 9, 2025link ↗
11DeepSeek logoDeepSeek-R1

DeepSeek

79.8%CoTJan 20, 2025link ↗

Other model leaderboards

  • gpt.buzz Composite Model Index — cross-benchmark ranking
  • MMLU-ProMultitask language understanding across 14 disciplines — the harder, contamination-resistant successor to MMLU. 10-option multiple choice with stronger distractors.
  • GPQA DiamondPhD-level science questions in biology, physics, and chemistry, written by domain experts. The hardest subset of GPQA — humans with PhDs in the subject score ~65%.
  • HumanEvalOpenAI's 164 Python programming problems with hidden unit tests — the classic code-generation pass@1 benchmark.
  • Aider PolyglotMulti-language code edit benchmark — solve real Exercism problems across Python, JavaScript, Rust, Go, Java, C++ via edit-and-test loops.
  • LiveCodeBenchContamination-free code benchmark — only problems published AFTER a model's training cutoff. Refreshed monthly.
  • MMMUMassive Multi-discipline Multimodal Understanding — 11.5k college-level multimodal questions across 30 subjects (art, business, science, medicine, etc.).