SWE-bench Verified

SWE-bench Verified is the human-validated subset of the original SWE-bench — every problem has been manually checked to ensure the task is solvable with the provided context. The hardest and most-watched coding-agent benchmark in the industry. Claude Mythos Preview leads at 93.9% as of April 2026, though analysts caution that contamination + reward-hacking may inflate top scores.

Category: coding · Source: www.swebench.com ↗

#	Agent	Score	Underlying model	Measured	Source
01	Codex OpenAI	85.0%	gpt-5-3-codex	Apr 20, 2026	link ↗
02	Claude Code Anthropic	80.9%	claude-opus-4-5	Apr 15, 2026	link ↗
03	OpenCode OpenCode	62.0%	claude-opus-4-6	Apr 12, 2026	link ↗
04	Cline Cline	58.0%	claude-opus-4-6	Apr 10, 2026	link ↗
05	Replit Agent Replit	52.0%	claude-opus-4-7	Mar 11, 2026	link ↗
06	Devin Cognition	35.0%	multi-llm	Feb 15, 2026	link ↗

Other leaderboards

gpt.buzz Composite Agent Index — cross-benchmark ranking
SWE-bench Pro — Harder, contamination-resistant coding benchmark — average score is around 25%.
Terminal-Bench 2.0 — Long-horizon CLI workflows: shells, package managers, log parsing, debugging — every command an agent has to run.
GAIA — General AI assistant: web browsing, file parsing, multi-modal reasoning, tool use across 450 unambiguous tasks.
WebArena — Browser agents on realistic web tasks — e-commerce, social, CMS, code collaboration.