SWE-bench Pro

SWE-bench Pro by Scale Labs is designed to be much harder than the original — contamination-controlled, with tasks that test multi-file refactoring on private-style codebases. The average frontier model scores ~25%; current leader Codex sits at 56.8%. Considered the more honest signal post-April-2026 reward-hacking scandals.

Category: coding · Source: labs.scale.com ↗

#	Agent	Score	Underlying model	Measured	Source
01	Codex OpenAI	56.8%	gpt-5-3-codex	Apr 22, 2026	link ↗
02	Claude Code Anthropic	55.4%	claude-opus-4-6	Apr 22, 2026	link ↗

Other leaderboards

gpt.buzz Composite Agent Index — cross-benchmark ranking
SWE-bench Verified — Real GitHub-issue bug fixes that a human can verify, drawn from popular open-source Python repos.
Terminal-Bench 2.0 — Long-horizon CLI workflows: shells, package managers, log parsing, debugging — every command an agent has to run.
GAIA — General AI assistant: web browsing, file parsing, multi-modal reasoning, tool use across 450 unambiguous tasks.
WebArena — Browser agents on realistic web tasks — e-commerce, social, CMS, code collaboration.