LiveCodeBench
LiveCodeBench (LCB) only evaluates models on problems published after their training cutoff, eliminating the contamination concern that plagues HumanEval and MBPP. The "honest" code benchmark — top scores are 10–20 points lower than HumanEval because nothing is memorized.
Category: coding · Source: livecodebench.github.io ↗
| # | Model | Score | Setting | Measured | Source |
|---|---|---|---|---|---|
| 01 | Anthropic | 80.2% | pass@1, post-cutoff | Apr 30, 2026 | link ↗ |
| 02 | OpenAI | 78.9% | pass@1, post-cutoff | Apr 30, 2026 | link ↗ |
| 03 | 76.1% | pass@1, post-cutoff | Apr 30, 2026 | link ↗ | |
| 04 | OpenAI | 74.5% | pass@1, post-cutoff | Sep 15, 2025 | link ↗ |
| 05 | Anthropic | 73.8% | pass@1, post-cutoff | Feb 28, 2026 | link ↗ |
| 06 | DeepSeek | 72.4% | pass@1, post-cutoff | Apr 30, 2026 | link ↗ |
| 07 | Alibaba | 71.0% | pass@1, post-cutoff | May 20, 2026 | link ↗ |
| 08 | xAI | 65.3% | pass@1, post-cutoff | Jul 15, 2025 | link ↗ |
Other model leaderboards
- gpt.buzz Composite Model Index — cross-benchmark ranking
- MMLU-Pro — Multitask language understanding across 14 disciplines — the harder, contamination-resistant successor to MMLU. 10-option multiple choice with stronger distractors.
- GPQA Diamond — PhD-level science questions in biology, physics, and chemistry, written by domain experts. The hardest subset of GPQA — humans with PhDs in the subject score ~65%.
- HumanEval — OpenAI's 164 Python programming problems with hidden unit tests — the classic code-generation pass@1 benchmark.
- Aider Polyglot — Multi-language code edit benchmark — solve real Exercism problems across Python, JavaScript, Rust, Go, Java, C++ via edit-and-test loops.
- AIME 2025 — American Invitational Mathematics Examination — competition math problems requiring multi-step reasoning. Reasoning-tier models score in the 90s; non-reasoning in the 50s.
- MMMU — Massive Multi-discipline Multimodal Understanding — 11.5k college-level multimodal questions across 30 subjects (art, business, science, medicine, etc.).