Leaderboards/Model Index
gpt.buzz Model Index
Composite capability ranking across 7 benchmarks — MMLU-Pro, GPQA Diamond, AIME 2025, Aider Polyglot, LiveCodeBench, HumanEval, and MMMU. Weighted to favor benchmarks that resist contamination (Aider, LiveCodeBench) and broad reasoning coverage (GPQA, MMLU-Pro).
Models without scores on at least one tracked benchmark are omitted. See the methodology page for the exact weights.
| # | Model | Composite | MMLU-Pro | GPQA-D | HumanEval | Aider | AIME-25 | LiveCB | MMMU |
|---|---|---|---|---|---|---|---|---|---|
| 01 | Anthropic | 98.5 | 88.2 | 90.1 | 97.4 | 91.2 | 93.5 | 80.2 | 79.0 |
| 02 | OpenAI | 97.8 | 87.6 | 89.4 | 96.8 | 89.7 | 94.2 | 78.9 | 78.4 |
| 03 | 96.5 | 87.1 | 88.6 | 96.5 | 84.8 | 92.8 | 76.1 | 84.3 | |
| 04 | OpenAI | 94.3 | 85.1 | 87.3 | 95.2 | 85.4 | 91.0 | 74.5 | 74.1 |
| 05 | Anthropic | 86.2 | 85.4 | 85.8 | 95.9 | 83.6 | 87.4 | 73.8 | — |
| 06 | DeepSeek | 84.4 | 84.2 | 82.4 | 95.1 | 80.1 | 88.6 | 72.4 | — |
| 07 | Alibaba | 84.1 | 83.7 | 83.0 | 93.9 | 78.4 | 90.4 | 71.0 | — |
| 08 | xAI | 74.5 | 82.8 | 87.5 | — | 72.5 | 86.1 | 65.3 | — |
| 09 | 63.0 | 88.9 | 91.2 | — | — | 96.1 | — | 86.0 | |
| 10 | Alibaba | 44.9 | 78.9 | — | 91.5 | 70.2 | — | — | 70.3 |
| 11 | 44.1 | 84.1 | 84.0 | — | — | — | — | 72.4 | |
| 12 | OpenAI | 33.1 | — | 87.7 | — | — | 88.9 | — | — |
| 13 | Meta | 24.5 | 80.5 | — | — | — | — | — | 68.5 |
| 14 | DeepSeek | 12.5 | — | — | — | — | 79.8 | — | — |
| 15 | 7.7 | — | — | — | — | — | — | 82.5 | |
| 16 | Mistral | 7.3 | — | — | 89.0 | — | — | — | — |
Per-benchmark leaderboards
Multitask language understanding across 14 disciplines — the harder, contamination-resistant successor to MMLU. 10-option multiple choice with stronger distractors.
PhD-level science questions in biology, physics, and chemistry, written by domain experts. The hardest subset of GPQA — humans with PhDs in the subject score ~65%.
OpenAI's 164 Python programming problems with hidden unit tests — the classic code-generation pass@1 benchmark.
Multi-language code edit benchmark — solve real Exercism problems across Python, JavaScript, Rust, Go, Java, C++ via edit-and-test loops.
American Invitational Mathematics Examination — competition math problems requiring multi-step reasoning. Reasoning-tier models score in the 90s; non-reasoning in the 50s.
Contamination-free code benchmark — only problems published AFTER a model's training cutoff. Refreshed monthly.
Massive Multi-discipline Multimodal Understanding — 11.5k college-level multimodal questions across 30 subjects (art, business, science, medicine, etc.).