GAIA
GAIA (General AI Assistants) measures broad agent capability across reasoning, multimodal, web browsing, and tool use. The 2026 leaderboard hosted at Princeton HAL is led by Claude Sonnet 4.5 at 74.6%, with Anthropic models sweeping the top six positions.
Category: general · Source: hal.cs.princeton.edu ↗
| # | Agent | Score | Underlying model | Measured | Source |
|---|---|---|---|---|---|
| 01 | Claude Code Anthropic | 74.6% | claude-sonnet-4-5 | Apr 18, 2026 | link ↗ |
| 02 | Hermes Agent Nous Research | 56.0% | claude-opus-4-7 | May 8, 2026 | link ↗ |
| 03 | Manus Monica | 51.0% | multi-llm | Mar 20, 2026 | link ↗ |
| 04 | OpenClaw Erik Steinberger | 48.0% | mixed | Apr 15, 2026 | link ↗ |
Other leaderboards
- gpt.buzz Composite Agent Index — cross-benchmark ranking
- SWE-bench Verified — Real GitHub-issue bug fixes that a human can verify, drawn from popular open-source Python repos.
- SWE-bench Pro — Harder, contamination-resistant coding benchmark — average score is around 25%.
- Terminal-Bench 2.0 — Long-horizon CLI workflows: shells, package managers, log parsing, debugging — every command an agent has to run.
- WebArena — Browser agents on realistic web tasks — e-commerce, social, CMS, code collaboration.