gpt.buzz
Sign in

Leaderboards/Models

Aider Polyglot

The Aider Polyglot benchmark measures how well a model can edit existing code across 6 languages on Exercism problems with hidden test suites. Far more correlated with day-to-day coding-agent UX than HumanEval. Claude Opus 4.7 and GPT-5.5 trade the top spot around 90% as of May 2026.

Category: coding · Source: aider.chat

#ModelScoreSettingMeasuredSource
01Anthropic logoClaude 4.7 Opus

Anthropic

91.2%edit + testApr 25, 2026link ↗
02OpenAI logoGPT-5.5

OpenAI

89.7%edit + testApr 25, 2026link ↗
03OpenAI logoGPT-5

OpenAI

85.4%edit + testSep 1, 2025link ↗
04Google logoGemini 3 Pro

Google

84.8%edit + testApr 25, 2026link ↗
05Anthropic logoClaude 4.6 Sonnet

Anthropic

83.6%edit + testFeb 25, 2026link ↗
06DeepSeek logoDeepSeek-V4-Pro

DeepSeek

80.1%edit + testApr 25, 2026link ↗
07Alibaba logoQwen3.7-Max

Alibaba

78.4%edit + testMay 20, 2026link ↗
08xAI logoGrok 4

xAI

72.5%edit + testJul 15, 2025link ↗
09Alibaba logoQwen3.6-27B

Alibaba

70.2%edit + testApr 25, 2026link ↗

Other model leaderboards

  • gpt.buzz Composite Model Index — cross-benchmark ranking
  • MMLU-ProMultitask language understanding across 14 disciplines — the harder, contamination-resistant successor to MMLU. 10-option multiple choice with stronger distractors.
  • GPQA DiamondPhD-level science questions in biology, physics, and chemistry, written by domain experts. The hardest subset of GPQA — humans with PhDs in the subject score ~65%.
  • HumanEvalOpenAI's 164 Python programming problems with hidden unit tests — the classic code-generation pass@1 benchmark.
  • AIME 2025American Invitational Mathematics Examination — competition math problems requiring multi-step reasoning. Reasoning-tier models score in the 90s; non-reasoning in the 50s.
  • LiveCodeBenchContamination-free code benchmark — only problems published AFTER a model's training cutoff. Refreshed monthly.
  • MMMUMassive Multi-discipline Multimodal Understanding — 11.5k college-level multimodal questions across 30 subjects (art, business, science, medicine, etc.).