Leaderboards/Model Index

gpt.buzz Model Index

Composite capability ranking across 7 benchmarks — MMLU-Pro, GPQA Diamond, AIME 2025, Aider Polyglot, LiveCodeBench, HumanEval, and MMMU. Weighted to favor benchmarks that resist contamination (Aider, LiveCodeBench) and broad reasoning coverage (GPQA, MMLU-Pro).

Models without scores on at least one tracked benchmark are omitted. See the methodology page for the exact weights.

#	Model	Composite	MMLU-Pro	GPQA-D	HumanEval	Aider	AIME-25	LiveCB	MMMU
01	Claude 4.7 Opus Anthropic	98.5	88.2	90.1	97.4	91.2	93.5	80.2	79.0
02	GPT-5.5 OpenAI	97.8	87.6	89.4	96.8	89.7	94.2	78.9	78.4
03	Gemini 3 Pro Google	96.5	87.1	88.6	96.5	84.8	92.8	76.1	84.3
04	GPT-5 OpenAI	94.3	85.1	87.3	95.2	85.4	91.0	74.5	74.1
05	Claude 4.6 Sonnet Anthropic	86.2	85.4	85.8	95.9	83.6	87.4	73.8	—
06	DeepSeek-V4-Pro DeepSeek	84.4	84.2	82.4	95.1	80.1	88.6	72.4	—
07	Qwen3.7-Max Alibaba	84.1	83.7	83.0	93.9	78.4	90.4	71.0	—
08	Grok 4 xAI	74.5	82.8	87.5	—	72.5	86.1	65.3	—
09	Gemini 3.5 Google	63.0	88.9	91.2	—	—	96.1	—	86.0
10	Qwen3.6-27B Alibaba	44.9	78.9	—	91.5	70.2	—	—	70.3
11	Gemini 2.5 Pro Google	44.1	84.1	84.0	—	—	—	—	72.4
12	o3 OpenAI	33.1	—	87.7	—	—	88.9	—	—
13	Llama 4 Maverick Meta	24.5	80.5	—	—	—	—	—	68.5
14	DeepSeek-R1 DeepSeek	12.5	—	—	—	—	79.8	—	—
15	Gemini Omni Google	7.7	—	—	—	—	—	—	82.5
16	Mistral Large 2 Mistral	7.3	—	—	89.0	—	—	—	—

Per-benchmark leaderboards

MMLU-Progeneral knowledge

Multitask language understanding across 14 disciplines — the harder, contamination-resistant successor to MMLU. 10-option multiple choice with stronger distractors.

GPQA Diamondreasoning

PhD-level science questions in biology, physics, and chemistry, written by domain experts. The hardest subset of GPQA — humans with PhDs in the subject score ~65%.

HumanEvalcoding

OpenAI's 164 Python programming problems with hidden unit tests — the classic code-generation pass@1 benchmark.

Aider Polyglotcoding

Multi-language code edit benchmark — solve real Exercism problems across Python, JavaScript, Rust, Go, Java, C++ via edit-and-test loops.

AIME 2025math

American Invitational Mathematics Examination — competition math problems requiring multi-step reasoning. Reasoning-tier models score in the 90s; non-reasoning in the 50s.

LiveCodeBenchcoding

Contamination-free code benchmark — only problems published AFTER a model's training cutoff. Refreshed monthly.

MMMUmultimodal

Massive Multi-discipline Multimodal Understanding — 11.5k college-level multimodal questions across 30 subjects (art, business, science, medicine, etc.).