gpt.buzz
Sign in

Leaderboards

Terminal-Bench 2.0

Terminal-Bench 2.0 tests an agent's ability to drive a terminal — running shell commands, parsing output, recovering from errors, completing multi-step ops. GPT-5.3-Codex leads at 77.3%, with Claude Opus around 65.4%. Higher correlation with day-to-day coding-agent UX than pure SWE-bench scores.

Category: coding · Source: awesomeagents.ai

#AgentScoreUnderlying modelMeasuredSource
01Codex

OpenAI

77.3%gpt-5-3-codexApr 22, 2026link ↗
02Claude Code

Anthropic

65.4%claude-opus-4-6Apr 22, 2026link ↗

Other leaderboards

  • gpt.buzz Composite Agent Index — cross-benchmark ranking
  • SWE-bench Verified Real GitHub-issue bug fixes that a human can verify, drawn from popular open-source Python repos.
  • SWE-bench Pro Harder, contamination-resistant coding benchmark — average score is around 25%.
  • GAIA General AI assistant: web browsing, file parsing, multi-modal reasoning, tool use across 450 unambiguous tasks.
  • WebArena Browser agents on realistic web tasks — e-commerce, social, CMS, code collaboration.