gpt.buzz
Sign in

Leaderboards/Models

MMLU-Pro

MMLU-Pro extends the original MMLU to 12,000+ questions across 14 disciplines with 10 answer options per question (up from 4) and stronger distractors that resist guessing. It's the de facto "general knowledge + reasoning" model benchmark for 2025–2026 — GPT-5.5 and Gemini 3 Pro both sit above 86%, while open-weight DeepSeek-V4-Pro reached 84% at fraction of the training compute.

Category: general knowledge · Source: huggingface.co

#ModelScoreSettingMeasuredSource
01Google logoGemini 3.5

Google

88.9%5-shot CoT, Deep ThinkMay 20, 2026link ↗
02Anthropic logoClaude 4.7 Opus

Anthropic

88.2%5-shot CoTApr 16, 2026link ↗
03OpenAI logoGPT-5.5

OpenAI

87.6%5-shot CoTApr 23, 2026link ↗
04Google logoGemini 3 Pro

Google

87.1%5-shot CoTApr 22, 2026link ↗
05Anthropic logoClaude 4.6 Sonnet

Anthropic

85.4%5-shot CoTFeb 17, 2026link ↗
06OpenAI logoGPT-5

OpenAI

85.1%5-shot CoTAug 7, 2025link ↗
07DeepSeek logoDeepSeek-V4-Pro

DeepSeek

84.2%5-shot CoTApr 22, 2026link ↗
08Google logoGemini 2.5 Pro

Google

84.1%5-shot CoTJun 17, 2025link ↗
09Alibaba logoQwen3.7-Max

Alibaba

83.7%5-shot CoTMay 20, 2026link ↗
10xAI logoGrok 4

xAI

82.8%5-shot CoTJul 9, 2025link ↗
11Meta logoLlama 4 Maverick

Meta

80.5%5-shot CoTApr 5, 2025link ↗
12Alibaba logoQwen3.6-27B

Alibaba

78.9%5-shot CoTApr 22, 2026link ↗

Other model leaderboards

  • gpt.buzz Composite Model Index — cross-benchmark ranking
  • GPQA DiamondPhD-level science questions in biology, physics, and chemistry, written by domain experts. The hardest subset of GPQA — humans with PhDs in the subject score ~65%.
  • HumanEvalOpenAI's 164 Python programming problems with hidden unit tests — the classic code-generation pass@1 benchmark.
  • Aider PolyglotMulti-language code edit benchmark — solve real Exercism problems across Python, JavaScript, Rust, Go, Java, C++ via edit-and-test loops.
  • AIME 2025American Invitational Mathematics Examination — competition math problems requiring multi-step reasoning. Reasoning-tier models score in the 90s; non-reasoning in the 50s.
  • LiveCodeBenchContamination-free code benchmark — only problems published AFTER a model's training cutoff. Refreshed monthly.
  • MMMUMassive Multi-discipline Multimodal Understanding — 11.5k college-level multimodal questions across 30 subjects (art, business, science, medicine, etc.).