SWE-bench Verified
SWE-bench Verified is the human-validated subset of the original SWE-bench — every problem has been manually checked to ensure the task is solvable with the provided context. The hardest and most-watched coding-agent benchmark in the industry. Claude Mythos Preview leads at 93.9% as of April 2026, though analysts caution that contamination + reward-hacking may inflate top scores.
Category: coding · Source: www.swebench.com ↗
| # | Agent | Score | Underlying model | Measured | Source |
|---|---|---|---|---|---|
| 01 | Codex OpenAI | 85.0% | gpt-5-3-codex | Apr 20, 2026 | link ↗ |
| 02 | Claude Code Anthropic | 80.9% | claude-opus-4-5 | Apr 15, 2026 | link ↗ |
| 03 | OpenCode OpenCode | 62.0% | claude-opus-4-6 | Apr 12, 2026 | link ↗ |
| 04 | Cline Cline | 58.0% | claude-opus-4-6 | Apr 10, 2026 | link ↗ |
| 05 | Replit Agent Replit | 52.0% | claude-opus-4-7 | Mar 11, 2026 | link ↗ |
| 06 | Devin Cognition | 35.0% | multi-llm | Feb 15, 2026 | link ↗ |
Other leaderboards
- gpt.buzz Composite Agent Index — cross-benchmark ranking
- SWE-bench Pro — Harder, contamination-resistant coding benchmark — average score is around 25%.
- Terminal-Bench 2.0 — Long-horizon CLI workflows: shells, package managers, log parsing, debugging — every command an agent has to run.
- GAIA — General AI assistant: web browsing, file parsing, multi-modal reasoning, tool use across 450 unambiguous tasks.
- WebArena — Browser agents on realistic web tasks — e-commerce, social, CMS, code collaboration.