Aider Polyglot

The Aider Polyglot benchmark measures how well a model can edit existing code across 6 languages on Exercism problems with hidden test suites. Far more correlated with day-to-day coding-agent UX than HumanEval. Claude Opus 4.7 and GPT-5.5 trade the top spot around 90% as of May 2026.

Category: coding · Source: aider.chat ↗

#	Model	Score	Setting	Measured	Source
01	Claude 4.7 Opus Anthropic	91.2%	edit + test	Apr 25, 2026	link ↗
02	GPT-5.5 OpenAI	89.7%	edit + test	Apr 25, 2026	link ↗
03	GPT-5 OpenAI	85.4%	edit + test	Sep 1, 2025	link ↗
04	Gemini 3 Pro Google	84.8%	edit + test	Apr 25, 2026	link ↗
05	Claude 4.6 Sonnet Anthropic	83.6%	edit + test	Feb 25, 2026	link ↗
06	DeepSeek-V4-Pro DeepSeek	80.1%	edit + test	Apr 25, 2026	link ↗
07	Qwen3.7-Max Alibaba	78.4%	edit + test	May 20, 2026	link ↗
08	Grok 4 xAI	72.5%	edit + test	Jul 15, 2025	link ↗
09	Qwen3.6-27B Alibaba	70.2%	edit + test	Apr 25, 2026	link ↗

Other model leaderboards

gpt.buzz Composite Model Index — cross-benchmark ranking
MMLU-Pro — Multitask language understanding across 14 disciplines — the harder, contamination-resistant successor to MMLU. 10-option multiple choice with stronger distractors.
GPQA Diamond — PhD-level science questions in biology, physics, and chemistry, written by domain experts. The hardest subset of GPQA — humans with PhDs in the subject score ~65%.
HumanEval — OpenAI's 164 Python programming problems with hidden unit tests — the classic code-generation pass@1 benchmark.
AIME 2025 — American Invitational Mathematics Examination — competition math problems requiring multi-step reasoning. Reasoning-tier models score in the 90s; non-reasoning in the 50s.
LiveCodeBench — Contamination-free code benchmark — only problems published AFTER a model's training cutoff. Refreshed monthly.
MMMU — Massive Multi-discipline Multimodal Understanding — 11.5k college-level multimodal questions across 30 subjects (art, business, science, medicine, etc.).