MMMU

MMMU tests multimodal reasoning at college-exam level — image + text questions across 30 subjects from art history to medical diagnostics. Gemini 3 Pro leads at ~84%, with GPT-5.5 and Claude 4.7 in the high 70s. The cleanest signal for "is this model genuinely multimodal or just text-with-vision-grafted-on".

Category: multimodal · Source: mmmu-benchmark.github.io ↗

#	Model	Score	Setting	Measured	Source
01	Gemini 3.5 Google	86.0%	val	May 20, 2026	link ↗
02	Gemini 3 Pro Google	84.3%	val	Apr 22, 2026	link ↗
03	Gemini Omni Google	82.5%	val	May 20, 2026	link ↗
04	Claude 4.7 Opus Anthropic	79.0%	val	Apr 16, 2026	link ↗
05	GPT-5.5 OpenAI	78.4%	val	Apr 23, 2026	link ↗
06	GPT-5 OpenAI	74.1%	val	Aug 7, 2025	link ↗
07	Gemini 2.5 Pro Google	72.4%	val	Jun 17, 2025	link ↗
08	Qwen3.6-27B Alibaba	70.3%	val	Apr 22, 2026	link ↗
09	Llama 4 Maverick Meta	68.5%	val	Apr 5, 2025	link ↗

Other model leaderboards

gpt.buzz Composite Model Index — cross-benchmark ranking
MMLU-Pro — Multitask language understanding across 14 disciplines — the harder, contamination-resistant successor to MMLU. 10-option multiple choice with stronger distractors.
GPQA Diamond — PhD-level science questions in biology, physics, and chemistry, written by domain experts. The hardest subset of GPQA — humans with PhDs in the subject score ~65%.
HumanEval — OpenAI's 164 Python programming problems with hidden unit tests — the classic code-generation pass@1 benchmark.
Aider Polyglot — Multi-language code edit benchmark — solve real Exercism problems across Python, JavaScript, Rust, Go, Java, C++ via edit-and-test loops.
AIME 2025 — American Invitational Mathematics Examination — competition math problems requiring multi-step reasoning. Reasoning-tier models score in the 90s; non-reasoning in the 50s.
LiveCodeBench — Contamination-free code benchmark — only problems published AFTER a model's training cutoff. Refreshed monthly.