gpt.buzz
Sign in

Leaderboards/Models

MMMU

MMMU tests multimodal reasoning at college-exam level — image + text questions across 30 subjects from art history to medical diagnostics. Gemini 3 Pro leads at ~84%, with GPT-5.5 and Claude 4.7 in the high 70s. The cleanest signal for "is this model genuinely multimodal or just text-with-vision-grafted-on".

Category: multimodal · Source: mmmu-benchmark.github.io

#ModelScoreSettingMeasuredSource
01Google logoGemini 3.5

Google

86.0%valMay 20, 2026link ↗
02Google logoGemini 3 Pro

Google

84.3%valApr 22, 2026link ↗
03Google logoGemini Omni

Google

82.5%valMay 20, 2026link ↗
04Anthropic logoClaude 4.7 Opus

Anthropic

79.0%valApr 16, 2026link ↗
05OpenAI logoGPT-5.5

OpenAI

78.4%valApr 23, 2026link ↗
06OpenAI logoGPT-5

OpenAI

74.1%valAug 7, 2025link ↗
07Google logoGemini 2.5 Pro

Google

72.4%valJun 17, 2025link ↗
08Alibaba logoQwen3.6-27B

Alibaba

70.3%valApr 22, 2026link ↗
09Meta logoLlama 4 Maverick

Meta

68.5%valApr 5, 2025link ↗

Other model leaderboards

  • gpt.buzz Composite Model Index — cross-benchmark ranking
  • MMLU-ProMultitask language understanding across 14 disciplines — the harder, contamination-resistant successor to MMLU. 10-option multiple choice with stronger distractors.
  • GPQA DiamondPhD-level science questions in biology, physics, and chemistry, written by domain experts. The hardest subset of GPQA — humans with PhDs in the subject score ~65%.
  • HumanEvalOpenAI's 164 Python programming problems with hidden unit tests — the classic code-generation pass@1 benchmark.
  • Aider PolyglotMulti-language code edit benchmark — solve real Exercism problems across Python, JavaScript, Rust, Go, Java, C++ via edit-and-test loops.
  • AIME 2025American Invitational Mathematics Examination — competition math problems requiring multi-step reasoning. Reasoning-tier models score in the 90s; non-reasoning in the 50s.
  • LiveCodeBenchContamination-free code benchmark — only problems published AFTER a model's training cutoff. Refreshed monthly.