WebArena

WebArena is a self-hosted environment of 812 templated browser tasks across e-commerce, social, content management, and collaboration domains. Tests an agent's ability to navigate real-looking websites and complete multi-step transactions. Claude Mythos Preview leads at 68.7% in 2026.

Category: browser · Source: webarena.dev ↗

#	Agent	Score	Underlying model	Measured	Source
01	Operator OpenAI	65.8%	gpt-5-4-pro	Apr 25, 2026	link ↗
02	Claude Code Anthropic	64.5%	claude-opus-4-6	Apr 22, 2026	link ↗
03	Computer Use Anthropic	64.5%	claude-opus-4-6	Apr 22, 2026	link ↗

Other leaderboards

gpt.buzz Composite Agent Index — cross-benchmark ranking
SWE-bench Verified — Real GitHub-issue bug fixes that a human can verify, drawn from popular open-source Python repos.
SWE-bench Pro — Harder, contamination-resistant coding benchmark — average score is around 25%.
Terminal-Bench 2.0 — Long-horizon CLI workflows: shells, package managers, log parsing, debugging — every command an agent has to run.
GAIA — General AI assistant: web browsing, file parsing, multi-modal reasoning, tool use across 450 unambiguous tasks.