Aider Polyglot
The Aider Polyglot benchmark measures how well a model can edit existing code across 6 languages on Exercism problems with hidden test suites. Far more correlated with day-to-day coding-agent UX than HumanEval. Claude Opus 4.7 and GPT-5.5 trade the top spot around 90% as of May 2026.
Category: coding · Source: aider.chat ↗
| # | Model | Score | Setting | Measured | Source |
|---|---|---|---|---|---|
| 01 | Anthropic | 91.2% | edit + test | Apr 25, 2026 | link ↗ |
| 02 | OpenAI | 89.7% | edit + test | Apr 25, 2026 | link ↗ |
| 03 | OpenAI | 85.4% | edit + test | Sep 1, 2025 | link ↗ |
| 04 | 84.8% | edit + test | Apr 25, 2026 | link ↗ | |
| 05 | Anthropic | 83.6% | edit + test | Feb 25, 2026 | link ↗ |
| 06 | DeepSeek | 80.1% | edit + test | Apr 25, 2026 | link ↗ |
| 07 | Alibaba | 78.4% | edit + test | May 20, 2026 | link ↗ |
| 08 | xAI | 72.5% | edit + test | Jul 15, 2025 | link ↗ |
| 09 | Alibaba | 70.2% | edit + test | Apr 25, 2026 | link ↗ |
Other model leaderboards
- gpt.buzz Composite Model Index — cross-benchmark ranking
- MMLU-Pro — Multitask language understanding across 14 disciplines — the harder, contamination-resistant successor to MMLU. 10-option multiple choice with stronger distractors.
- GPQA Diamond — PhD-level science questions in biology, physics, and chemistry, written by domain experts. The hardest subset of GPQA — humans with PhDs in the subject score ~65%.
- HumanEval — OpenAI's 164 Python programming problems with hidden unit tests — the classic code-generation pass@1 benchmark.
- AIME 2025 — American Invitational Mathematics Examination — competition math problems requiring multi-step reasoning. Reasoning-tier models score in the 90s; non-reasoning in the 50s.
- LiveCodeBench — Contamination-free code benchmark — only problems published AFTER a model's training cutoff. Refreshed monthly.
- MMMU — Massive Multi-discipline Multimodal Understanding — 11.5k college-level multimodal questions across 30 subjects (art, business, science, medicine, etc.).