gpt.buzz
Sign in

Leaderboards

SWE-bench Verified

SWE-bench Verified is the human-validated subset of the original SWE-bench — every problem has been manually checked to ensure the task is solvable with the provided context. The hardest and most-watched coding-agent benchmark in the industry. Claude Mythos Preview leads at 93.9% as of April 2026, though analysts caution that contamination + reward-hacking may inflate top scores.

Category: coding · Source: www.swebench.com

#AgentScoreUnderlying modelMeasuredSource
01Codex

OpenAI

85.0%gpt-5-3-codexApr 20, 2026link ↗
02Claude Code

Anthropic

80.9%claude-opus-4-5Apr 15, 2026link ↗
03OpenCode

OpenCode

62.0%claude-opus-4-6Apr 12, 2026link ↗
04Cline

Cline

58.0%claude-opus-4-6Apr 10, 2026link ↗
05Replit Agent

Replit

52.0%claude-opus-4-7Mar 11, 2026link ↗
06Devin

Cognition

35.0%multi-llmFeb 15, 2026link ↗

Other leaderboards

  • gpt.buzz Composite Agent Index — cross-benchmark ranking
  • SWE-bench Pro Harder, contamination-resistant coding benchmark — average score is around 25%.
  • Terminal-Bench 2.0 Long-horizon CLI workflows: shells, package managers, log parsing, debugging — every command an agent has to run.
  • GAIA General AI assistant: web browsing, file parsing, multi-modal reasoning, tool use across 450 unambiguous tasks.
  • WebArena Browser agents on realistic web tasks — e-commerce, social, CMS, code collaboration.