AI Leaderboards

Cross-benchmark agent rankings on gpt.buzz, plus the public sources we track and cite.

Capability · models

gpt.buzz Model Index

Composite LLM ranking across MMLU-Pro, GPQA Diamond, AIME 2025, Aider Polyglot, LiveCodeBench, HumanEval, and MMMU.

View ranking →

Capability · agents

gpt.buzz Agent Index

Composite ranking across SWE-bench Verified, SWE-bench Pro, Terminal-Bench, GAIA, and WebArena.

View ranking →

Adoption · real-world usage

OpenRouter Token Usage

Agents ranked by daily + cumulative token volume — gaming-resistant adoption signal.

View ranking →

Model benchmarks

Standardized capability tests — knowledge, reasoning, math, code, multimodal. One leaderboard per benchmark.

MMLU-Pro

general knowledge

Multitask language understanding across 14 disciplines — the harder, contamination-resistant successor to MMLU. 10-option multiple choice with stronger distractors.

GPQA Diamond

reasoning

PhD-level science questions in biology, physics, and chemistry, written by domain experts. The hardest subset of GPQA — humans with PhDs in the subject score ~65%.

HumanEval

coding

OpenAI's 164 Python programming problems with hidden unit tests — the classic code-generation pass@1 benchmark.

Aider Polyglot

coding

Multi-language code edit benchmark — solve real Exercism problems across Python, JavaScript, Rust, Go, Java, C++ via edit-and-test loops.

AIME 2025

math

American Invitational Mathematics Examination — competition math problems requiring multi-step reasoning. Reasoning-tier models score in the 90s; non-reasoning in the 50s.

LiveCodeBench

coding

Contamination-free code benchmark — only problems published AFTER a model's training cutoff. Refreshed monthly.

MMMU

multimodal

Massive Multi-discipline Multimodal Understanding — 11.5k college-level multimodal questions across 30 subjects (art, business, science, medicine, etc.).

Agent benchmarks

Coding-agent and browser-agent capability tests — measured on deployed scaffolded agents, not raw models.

SWE-bench Verified

coding

Real GitHub-issue bug fixes that a human can verify, drawn from popular open-source Python repos.

SWE-bench Pro

coding

Harder, contamination-resistant coding benchmark — average score is around 25%.

Terminal-Bench 2.0

coding

Long-horizon CLI workflows: shells, package managers, log parsing, debugging — every command an agent has to run.

GAIA

general

General AI assistant: web browsing, file parsing, multi-modal reasoning, tool use across 450 unambiguous tasks.

WebArena

browser

Browser agents on realistic web tasks — e-commerce, social, CMS, code collaboration.

External sources

Public leaderboards we cite and pull data from.

Open LLM Leaderboard

external ↗

Hugging Face's ranking of open-weight LLMs across MMLU-Pro, GPQA, MATH, IFEval, BBH, and MUSR.

LMSys Chatbot Arena

external ↗

Crowdsourced human preference rankings — Elo-style ratings from millions of head-to-head comparisons.

Curious how we score? See the methodology page.