AI Leaderboards
Cross-benchmark agent rankings on gpt.buzz, plus the public sources we track and cite.
Capability · models
gpt.buzz Model Index
Composite LLM ranking across MMLU-Pro, GPQA Diamond, AIME 2025, Aider Polyglot, LiveCodeBench, HumanEval, and MMMU.
View ranking →
Capability · agents
gpt.buzz Agent Index
Composite ranking across SWE-bench Verified, SWE-bench Pro, Terminal-Bench, GAIA, and WebArena.
View ranking →
Adoption · real-world usage
OpenRouter Token Usage
Agents ranked by daily + cumulative token volume — gaming-resistant adoption signal.
View ranking →
Model benchmarks
Standardized capability tests — knowledge, reasoning, math, code, multimodal. One leaderboard per benchmark.
MMLU-Pro
general knowledgeMultitask language understanding across 14 disciplines — the harder, contamination-resistant successor to MMLU. 10-option multiple choice with stronger distractors.
GPQA Diamond
reasoningPhD-level science questions in biology, physics, and chemistry, written by domain experts. The hardest subset of GPQA — humans with PhDs in the subject score ~65%.
HumanEval
codingOpenAI's 164 Python programming problems with hidden unit tests — the classic code-generation pass@1 benchmark.
Aider Polyglot
codingMulti-language code edit benchmark — solve real Exercism problems across Python, JavaScript, Rust, Go, Java, C++ via edit-and-test loops.
AIME 2025
mathAmerican Invitational Mathematics Examination — competition math problems requiring multi-step reasoning. Reasoning-tier models score in the 90s; non-reasoning in the 50s.
LiveCodeBench
codingContamination-free code benchmark — only problems published AFTER a model's training cutoff. Refreshed monthly.
MMMU
multimodalMassive Multi-discipline Multimodal Understanding — 11.5k college-level multimodal questions across 30 subjects (art, business, science, medicine, etc.).
Agent benchmarks
Coding-agent and browser-agent capability tests — measured on deployed scaffolded agents, not raw models.
SWE-bench Verified
codingReal GitHub-issue bug fixes that a human can verify, drawn from popular open-source Python repos.
SWE-bench Pro
codingHarder, contamination-resistant coding benchmark — average score is around 25%.
Terminal-Bench 2.0
codingLong-horizon CLI workflows: shells, package managers, log parsing, debugging — every command an agent has to run.
GAIA
generalGeneral AI assistant: web browsing, file parsing, multi-modal reasoning, tool use across 450 unambiguous tasks.
WebArena
browserBrowser agents on realistic web tasks — e-commerce, social, CMS, code collaboration.
External sources
Public leaderboards we cite and pull data from.
Curious how we score? See the methodology page.