Terminal-Bench 2.0
Terminal-Bench 2.0 tests an agent's ability to drive a terminal — running shell commands, parsing output, recovering from errors, completing multi-step ops. GPT-5.3-Codex leads at 77.3%, with Claude Opus around 65.4%. Higher correlation with day-to-day coding-agent UX than pure SWE-bench scores.
Category: coding · Source: awesomeagents.ai ↗
| # | Agent | Score | Underlying model | Measured | Source |
|---|---|---|---|---|---|
| 01 | Codex OpenAI | 77.3% | gpt-5-3-codex | Apr 22, 2026 | link ↗ |
| 02 | Claude Code Anthropic | 65.4% | claude-opus-4-6 | Apr 22, 2026 | link ↗ |
Other leaderboards
- gpt.buzz Composite Agent Index — cross-benchmark ranking
- SWE-bench Verified — Real GitHub-issue bug fixes that a human can verify, drawn from popular open-source Python repos.
- SWE-bench Pro — Harder, contamination-resistant coding benchmark — average score is around 25%.
- GAIA — General AI assistant: web browsing, file parsing, multi-modal reasoning, tool use across 450 unambiguous tasks.
- WebArena — Browser agents on realistic web tasks — e-commerce, social, CMS, code collaboration.