SWE-bench Pro
SWE-bench Pro by Scale Labs is designed to be much harder than the original — contamination-controlled, with tasks that test multi-file refactoring on private-style codebases. The average frontier model scores ~25%; current leader Codex sits at 56.8%. Considered the more honest signal post-April-2026 reward-hacking scandals.
Category: coding · Source: labs.scale.com ↗
| # | Agent | Score | Underlying model | Measured | Source |
|---|---|---|---|---|---|
| 01 | Codex OpenAI | 56.8% | gpt-5-3-codex | Apr 22, 2026 | link ↗ |
| 02 | Claude Code Anthropic | 55.4% | claude-opus-4-6 | Apr 22, 2026 | link ↗ |
Other leaderboards
- gpt.buzz Composite Agent Index — cross-benchmark ranking
- SWE-bench Verified — Real GitHub-issue bug fixes that a human can verify, drawn from popular open-source Python repos.
- Terminal-Bench 2.0 — Long-horizon CLI workflows: shells, package managers, log parsing, debugging — every command an agent has to run.
- GAIA — General AI assistant: web browsing, file parsing, multi-modal reasoning, tool use across 450 unambiguous tasks.
- WebArena — Browser agents on realistic web tasks — e-commerce, social, CMS, code collaboration.