HumanEval

HumanEval remains the most-cited code-generation benchmark despite frontier models saturating it (>95% on every flagship). Kept in the index as a baseline filter — any serious model has to clear ~90% — and because vendor cards still report it. For real-world signal, weight Aider Polyglot more heavily.

Category: coding · Source: github.com ↗

#	Model	Score	Setting	Measured	Source
01	Claude 4.7 Opus Anthropic	97.4%	pass@1	Apr 16, 2026	link ↗
02	GPT-5.5 OpenAI	96.8%	pass@1	Apr 23, 2026	link ↗
03	Gemini 3 Pro Google	96.5%	pass@1	Apr 22, 2026	link ↗
04	Claude 4.6 Sonnet Anthropic	95.9%	pass@1	Feb 17, 2026	link ↗
05	GPT-5 OpenAI	95.2%	pass@1	Aug 7, 2025	link ↗
06	DeepSeek-V4-Pro DeepSeek	95.1%	pass@1	Apr 22, 2026	link ↗
07	Qwen3.7-Max Alibaba	93.9%	pass@1	May 20, 2026	link ↗
08	Qwen3.6-27B Alibaba	91.5%	pass@1	Apr 22, 2026	link ↗
09	Mistral Large 2 Mistral	89.0%	pass@1	Jul 24, 2024	link ↗

Other model leaderboards

gpt.buzz Composite Model Index — cross-benchmark ranking
MMLU-Pro — Multitask language understanding across 14 disciplines — the harder, contamination-resistant successor to MMLU. 10-option multiple choice with stronger distractors.
GPQA Diamond — PhD-level science questions in biology, physics, and chemistry, written by domain experts. The hardest subset of GPQA — humans with PhDs in the subject score ~65%.
Aider Polyglot — Multi-language code edit benchmark — solve real Exercism problems across Python, JavaScript, Rust, Go, Java, C++ via edit-and-test loops.
AIME 2025 — American Invitational Mathematics Examination — competition math problems requiring multi-step reasoning. Reasoning-tier models score in the 90s; non-reasoning in the 50s.
LiveCodeBench — Contamination-free code benchmark — only problems published AFTER a model's training cutoff. Refreshed monthly.
MMMU — Massive Multi-discipline Multimodal Understanding — 11.5k college-level multimodal questions across 30 subjects (art, business, science, medicine, etc.).