gpt.buzz Agent Index

AI Agent Leaderboard

Cross-benchmark ranking of every tracked AI agent. Each agent's best score on each benchmark is normalized against the field leader, then averaged using editorial weights below. Built to resist the benchmark gaming exposed in the April 2026 Berkeley/RDI report.

How the gpt.buzz Index is computed

Agents missing scores on some benchmarks are scored on what they have — weights redistribute. Full methodology at /methodology.

#	Agent	Index	SWE-V	SWE-P	TB-2	GAIA	WebA
01	Claude Code Anthropic	95.4	80.9	55.4	65.4	74.6	64.5
02	Codex OpenAI	70.0	85.0	56.8	77.3	—	—
03	OpenCode OpenCode	21.9	62.0	—	—	—	—
04	Cline Cline	20.5	58.0	—	—	—	—
05	Replit Agent Replit	18.4	52.0	—	—	—	—
06	Hermes Agent Nous Research	15.0	—	—	—	56.0	—
07	Manus Monica	13.7	—	—	—	51.0	—
08	OpenClaw Erik Steinberger	12.9	—	—	—	48.0	—
09	Devin Cognition	12.4	35.0	—	—	—	—
10	Operator OpenAI	10.0	—	—	—	—	65.8
11	Computer Use Anthropic	9.8	—	—	—	—	64.5

Tracked agents without scores yet (6)

These agents are tracked in our catalog but haven't reported scores on any of the 5 tracked benchmarks. Most likely they fall outside the coding / general-assistant / browser categories these benchmarks measure.

Per-benchmark leaderboards

SWE-bench Verified — Real GitHub-issue bug fixes that a human can verify, drawn from popular open-source Python repos.
SWE-bench Pro — Harder, contamination-resistant coding benchmark — average score is around 25%.
Terminal-Bench 2.0 — Long-horizon CLI workflows: shells, package managers, log parsing, debugging — every command an agent has to run.
GAIA — General AI assistant: web browsing, file parsing, multi-modal reasoning, tool use across 450 unambiguous tasks.
WebArena — Browser agents on realistic web tasks — e-commerce, social, CMS, code collaboration.

Last index recomputed May 12, 2026 — refreshes every 60 seconds.