gpt.buzz
Sign in

Leaderboards

GAIA

GAIA (General AI Assistants) measures broad agent capability across reasoning, multimodal, web browsing, and tool use. The 2026 leaderboard hosted at Princeton HAL is led by Claude Sonnet 4.5 at 74.6%, with Anthropic models sweeping the top six positions.

Category: general · Source: hal.cs.princeton.edu

#AgentScoreUnderlying modelMeasuredSource
01Claude Code

Anthropic

74.6%claude-sonnet-4-5Apr 18, 2026link ↗
02Hermes Agent

Nous Research

56.0%claude-opus-4-7May 8, 2026link ↗
03Manus

Monica

51.0%multi-llmMar 20, 2026link ↗
04OpenClaw

Erik Steinberger

48.0%mixedApr 15, 2026link ↗

Other leaderboards

  • gpt.buzz Composite Agent Index — cross-benchmark ranking
  • SWE-bench Verified Real GitHub-issue bug fixes that a human can verify, drawn from popular open-source Python repos.
  • SWE-bench Pro Harder, contamination-resistant coding benchmark — average score is around 25%.
  • Terminal-Bench 2.0 Long-horizon CLI workflows: shells, package managers, log parsing, debugging — every command an agent has to run.
  • WebArena Browser agents on realistic web tasks — e-commerce, social, CMS, code collaboration.