gpt.buzz
Sign in

gpt.buzz Agent Index

AI Agent Leaderboard

Cross-benchmark ranking of every tracked AI agent. Each agent's best score on each benchmark is normalized against the field leader, then averaged using editorial weights below. Built to resist the benchmark gaming exposed in the April 2026 Berkeley/RDI report.

How the gpt.buzz Index is computed

Agents missing scores on some benchmarks are scored on what they have — weights redistribute. Full methodology at /methodology.

#AgentIndexSWE-VSWE-PTB-2GAIAWebA
01Claude Code

Anthropic

95.480.955.465.474.664.5
02Codex

OpenAI

70.085.056.877.3
03OpenCode

OpenCode

21.962.0
04Cline

Cline

20.558.0
05Replit Agent

Replit

18.452.0
06Hermes Agent

Nous Research

15.056.0
07Manus

Monica

13.751.0
08OpenClaw

Erik Steinberger

12.948.0
09Devin

Cognition

12.435.0
10Operator

OpenAI

10.065.8
11Computer Use

Anthropic

9.864.5

Tracked agents without scores yet (6)

These agents are tracked in our catalog but haven't reported scores on any of the 5 tracked benchmarks. Most likely they fall outside the coding / general-assistant / browser categories these benchmarks measure.

Per-benchmark leaderboards

  • SWE-bench Verified Real GitHub-issue bug fixes that a human can verify, drawn from popular open-source Python repos.
  • SWE-bench Pro Harder, contamination-resistant coding benchmark — average score is around 25%.
  • Terminal-Bench 2.0 Long-horizon CLI workflows: shells, package managers, log parsing, debugging — every command an agent has to run.
  • GAIA General AI assistant: web browsing, file parsing, multi-modal reasoning, tool use across 450 unambiguous tasks.
  • WebArena Browser agents on realistic web tasks — e-commerce, social, CMS, code collaboration.

Last index recomputed May 12, 2026 — refreshes every 60 seconds.