gpt.buzz
Sign in

Leaderboards

SWE-bench Pro

SWE-bench Pro by Scale Labs is designed to be much harder than the original — contamination-controlled, with tasks that test multi-file refactoring on private-style codebases. The average frontier model scores ~25%; current leader Codex sits at 56.8%. Considered the more honest signal post-April-2026 reward-hacking scandals.

Category: coding · Source: labs.scale.com

#AgentScoreUnderlying modelMeasuredSource
01Codex

OpenAI

56.8%gpt-5-3-codexApr 22, 2026link ↗
02Claude Code

Anthropic

55.4%claude-opus-4-6Apr 22, 2026link ↗

Other leaderboards

  • gpt.buzz Composite Agent Index — cross-benchmark ranking
  • SWE-bench Verified Real GitHub-issue bug fixes that a human can verify, drawn from popular open-source Python repos.
  • Terminal-Bench 2.0 Long-horizon CLI workflows: shells, package managers, log parsing, debugging — every command an agent has to run.
  • GAIA General AI assistant: web browsing, file parsing, multi-modal reasoning, tool use across 450 unambiguous tasks.
  • WebArena Browser agents on realistic web tasks — e-commerce, social, CMS, code collaboration.