MMLU-Pro

MMLU-Pro extends the original MMLU to 12,000+ questions across 14 disciplines with 10 answer options per question (up from 4) and stronger distractors that resist guessing. It's the de facto "general knowledge + reasoning" model benchmark for 2025–2026 — GPT-5.5 and Gemini 3 Pro both sit above 86%, while open-weight DeepSeek-V4-Pro reached 84% at fraction of the training compute.

Category: general knowledge · Source: huggingface.co ↗

#	Model	Score	Setting	Measured	Source
01	Gemini 3.5 Google	88.9%	5-shot CoT, Deep Think	May 20, 2026	link ↗
02	Claude 4.7 Opus Anthropic	88.2%	5-shot CoT	Apr 16, 2026	link ↗
03	GPT-5.5 OpenAI	87.6%	5-shot CoT	Apr 23, 2026	link ↗
04	Gemini 3 Pro Google	87.1%	5-shot CoT	Apr 22, 2026	link ↗
05	Claude 4.6 Sonnet Anthropic	85.4%	5-shot CoT	Feb 17, 2026	link ↗
06	GPT-5 OpenAI	85.1%	5-shot CoT	Aug 7, 2025	link ↗
07	DeepSeek-V4-Pro DeepSeek	84.2%	5-shot CoT	Apr 22, 2026	link ↗
08	Gemini 2.5 Pro Google	84.1%	5-shot CoT	Jun 17, 2025	link ↗
09	Qwen3.7-Max Alibaba	83.7%	5-shot CoT	May 20, 2026	link ↗
10	Grok 4 xAI	82.8%	5-shot CoT	Jul 9, 2025	link ↗
11	Llama 4 Maverick Meta	80.5%	5-shot CoT	Apr 5, 2025	link ↗
12	Qwen3.6-27B Alibaba	78.9%	5-shot CoT	Apr 22, 2026	link ↗

Other model leaderboards

gpt.buzz Composite Model Index — cross-benchmark ranking
GPQA Diamond — PhD-level science questions in biology, physics, and chemistry, written by domain experts. The hardest subset of GPQA — humans with PhDs in the subject score ~65%.
HumanEval — OpenAI's 164 Python programming problems with hidden unit tests — the classic code-generation pass@1 benchmark.
Aider Polyglot — Multi-language code edit benchmark — solve real Exercism problems across Python, JavaScript, Rust, Go, Java, C++ via edit-and-test loops.
AIME 2025 — American Invitational Mathematics Examination — competition math problems requiring multi-step reasoning. Reasoning-tier models score in the 90s; non-reasoning in the 50s.
LiveCodeBench — Contamination-free code benchmark — only problems published AFTER a model's training cutoff. Refreshed monthly.
MMMU — Massive Multi-discipline Multimodal Understanding — 11.5k college-level multimodal questions across 30 subjects (art, business, science, medicine, etc.).