MMLU-Pro
MMLU-Pro extends the original MMLU to 12,000+ questions across 14 disciplines with 10 answer options per question (up from 4) and stronger distractors that resist guessing. It's the de facto "general knowledge + reasoning" model benchmark for 2025–2026 — GPT-5.5 and Gemini 3 Pro both sit above 86%, while open-weight DeepSeek-V4-Pro reached 84% at fraction of the training compute.
Category: general knowledge · Source: huggingface.co ↗
| # | Model | Score | Setting | Measured | Source |
|---|---|---|---|---|---|
| 01 | 88.9% | 5-shot CoT, Deep Think | May 20, 2026 | link ↗ | |
| 02 | Anthropic | 88.2% | 5-shot CoT | Apr 16, 2026 | link ↗ |
| 03 | OpenAI | 87.6% | 5-shot CoT | Apr 23, 2026 | link ↗ |
| 04 | 87.1% | 5-shot CoT | Apr 22, 2026 | link ↗ | |
| 05 | Anthropic | 85.4% | 5-shot CoT | Feb 17, 2026 | link ↗ |
| 06 | OpenAI | 85.1% | 5-shot CoT | Aug 7, 2025 | link ↗ |
| 07 | DeepSeek | 84.2% | 5-shot CoT | Apr 22, 2026 | link ↗ |
| 08 | 84.1% | 5-shot CoT | Jun 17, 2025 | link ↗ | |
| 09 | Alibaba | 83.7% | 5-shot CoT | May 20, 2026 | link ↗ |
| 10 | xAI | 82.8% | 5-shot CoT | Jul 9, 2025 | link ↗ |
| 11 | Meta | 80.5% | 5-shot CoT | Apr 5, 2025 | link ↗ |
| 12 | Alibaba | 78.9% | 5-shot CoT | Apr 22, 2026 | link ↗ |
Other model leaderboards
- gpt.buzz Composite Model Index — cross-benchmark ranking
- GPQA Diamond — PhD-level science questions in biology, physics, and chemistry, written by domain experts. The hardest subset of GPQA — humans with PhDs in the subject score ~65%.
- HumanEval — OpenAI's 164 Python programming problems with hidden unit tests — the classic code-generation pass@1 benchmark.
- Aider Polyglot — Multi-language code edit benchmark — solve real Exercism problems across Python, JavaScript, Rust, Go, Java, C++ via edit-and-test loops.
- AIME 2025 — American Invitational Mathematics Examination — competition math problems requiring multi-step reasoning. Reasoning-tier models score in the 90s; non-reasoning in the 50s.
- LiveCodeBench — Contamination-free code benchmark — only problems published AFTER a model's training cutoff. Refreshed monthly.
- MMMU — Massive Multi-discipline Multimodal Understanding — 11.5k college-level multimodal questions across 30 subjects (art, business, science, medicine, etc.).