LiveCodeBench

LiveCodeBench (LCB) only evaluates models on problems published after their training cutoff, eliminating the contamination concern that plagues HumanEval and MBPP. The "honest" code benchmark — top scores are 10–20 points lower than HumanEval because nothing is memorized.

Category: coding · Source: livecodebench.github.io ↗

#	Model	Score	Setting	Measured	Source
01	Claude 4.7 Opus Anthropic	80.2%	pass@1, post-cutoff	Apr 30, 2026	link ↗
02	GPT-5.5 OpenAI	78.9%	pass@1, post-cutoff	Apr 30, 2026	link ↗
03	Gemini 3 Pro Google	76.1%	pass@1, post-cutoff	Apr 30, 2026	link ↗
04	GPT-5 OpenAI	74.5%	pass@1, post-cutoff	Sep 15, 2025	link ↗
05	Claude 4.6 Sonnet Anthropic	73.8%	pass@1, post-cutoff	Feb 28, 2026	link ↗
06	DeepSeek-V4-Pro DeepSeek	72.4%	pass@1, post-cutoff	Apr 30, 2026	link ↗
07	Qwen3.7-Max Alibaba	71.0%	pass@1, post-cutoff	May 20, 2026	link ↗
08	Grok 4 xAI	65.3%	pass@1, post-cutoff	Jul 15, 2025	link ↗

Other model leaderboards

gpt.buzz Composite Model Index — cross-benchmark ranking
MMLU-Pro — Multitask language understanding across 14 disciplines — the harder, contamination-resistant successor to MMLU. 10-option multiple choice with stronger distractors.
GPQA Diamond — PhD-level science questions in biology, physics, and chemistry, written by domain experts. The hardest subset of GPQA — humans with PhDs in the subject score ~65%.
HumanEval — OpenAI's 164 Python programming problems with hidden unit tests — the classic code-generation pass@1 benchmark.
Aider Polyglot — Multi-language code edit benchmark — solve real Exercism problems across Python, JavaScript, Rust, Go, Java, C++ via edit-and-test loops.
AIME 2025 — American Invitational Mathematics Examination — competition math problems requiring multi-step reasoning. Reasoning-tier models score in the 90s; non-reasoning in the 50s.
MMMU — Massive Multi-discipline Multimodal Understanding — 11.5k college-level multimodal questions across 30 subjects (art, business, science, medicine, etc.).