WebArena
WebArena is a self-hosted environment of 812 templated browser tasks across e-commerce, social, content management, and collaboration domains. Tests an agent's ability to navigate real-looking websites and complete multi-step transactions. Claude Mythos Preview leads at 68.7% in 2026.
Category: browser · Source: webarena.dev ↗
| # | Agent | Score | Underlying model | Measured | Source |
|---|---|---|---|---|---|
| 01 | Operator OpenAI | 65.8% | gpt-5-4-pro | Apr 25, 2026 | link ↗ |
| 02 | Claude Code Anthropic | 64.5% | claude-opus-4-6 | Apr 22, 2026 | link ↗ |
| 03 | Computer Use Anthropic | 64.5% | claude-opus-4-6 | Apr 22, 2026 | link ↗ |
Other leaderboards
- gpt.buzz Composite Agent Index — cross-benchmark ranking
- SWE-bench Verified — Real GitHub-issue bug fixes that a human can verify, drawn from popular open-source Python repos.
- SWE-bench Pro — Harder, contamination-resistant coding benchmark — average score is around 25%.
- Terminal-Bench 2.0 — Long-horizon CLI workflows: shells, package managers, log parsing, debugging — every command an agent has to run.
- GAIA — General AI assistant: web browsing, file parsing, multi-modal reasoning, tool use across 450 unambiguous tasks.