gpt.buzz
Sign in

Leaderboards

WebArena

WebArena is a self-hosted environment of 812 templated browser tasks across e-commerce, social, content management, and collaboration domains. Tests an agent's ability to navigate real-looking websites and complete multi-step transactions. Claude Mythos Preview leads at 68.7% in 2026.

Category: browser · Source: webarena.dev

#AgentScoreUnderlying modelMeasuredSource
01Operator

OpenAI

65.8%gpt-5-4-proApr 25, 2026link ↗
02Claude Code

Anthropic

64.5%claude-opus-4-6Apr 22, 2026link ↗
03Computer Use

Anthropic

64.5%claude-opus-4-6Apr 22, 2026link ↗

Other leaderboards

  • gpt.buzz Composite Agent Index — cross-benchmark ranking
  • SWE-bench Verified Real GitHub-issue bug fixes that a human can verify, drawn from popular open-source Python repos.
  • SWE-bench Pro Harder, contamination-resistant coding benchmark — average score is around 25%.
  • Terminal-Bench 2.0 Long-horizon CLI workflows: shells, package managers, log parsing, debugging — every command an agent has to run.
  • GAIA General AI assistant: web browsing, file parsing, multi-modal reasoning, tool use across 450 unambiguous tasks.