Terminal-Bench 2.0

Terminal-Bench 2.0 tests an agent's ability to drive a terminal — running shell commands, parsing output, recovering from errors, completing multi-step ops. GPT-5.3-Codex leads at 77.3%, with Claude Opus around 65.4%. Higher correlation with day-to-day coding-agent UX than pure SWE-bench scores.

Category: coding · Source: awesomeagents.ai ↗

#	Agent	Score	Underlying model	Measured	Source
01	Codex OpenAI	77.3%	gpt-5-3-codex	Apr 22, 2026	link ↗
02	Claude Code Anthropic	65.4%	claude-opus-4-6	Apr 22, 2026	link ↗

Other leaderboards

gpt.buzz Composite Agent Index — cross-benchmark ranking
SWE-bench Verified — Real GitHub-issue bug fixes that a human can verify, drawn from popular open-source Python repos.
SWE-bench Pro — Harder, contamination-resistant coding benchmark — average score is around 25%.
GAIA — General AI assistant: web browsing, file parsing, multi-modal reasoning, tool use across 450 unambiguous tasks.
WebArena — Browser agents on realistic web tasks — e-commerce, social, CMS, code collaboration.