GAIA

GAIA (General AI Assistants) measures broad agent capability across reasoning, multimodal, web browsing, and tool use. The 2026 leaderboard hosted at Princeton HAL is led by Claude Sonnet 4.5 at 74.6%, with Anthropic models sweeping the top six positions.

Category: general · Source: hal.cs.princeton.edu ↗

#	Agent	Score	Underlying model	Measured	Source
01	Claude Code Anthropic	74.6%	claude-sonnet-4-5	Apr 18, 2026	link ↗
02	Hermes Agent Nous Research	56.0%	claude-opus-4-7	May 8, 2026	link ↗
03	Manus Monica	51.0%	multi-llm	Mar 20, 2026	link ↗
04	OpenClaw Erik Steinberger	48.0%	mixed	Apr 15, 2026	link ↗

Other leaderboards

gpt.buzz Composite Agent Index — cross-benchmark ranking
SWE-bench Verified — Real GitHub-issue bug fixes that a human can verify, drawn from popular open-source Python repos.
SWE-bench Pro — Harder, contamination-resistant coding benchmark — average score is around 25%.
Terminal-Bench 2.0 — Long-horizon CLI workflows: shells, package managers, log parsing, debugging — every command an agent has to run.
WebArena — Browser agents on realistic web tasks — e-commerce, social, CMS, code collaboration.