gpt.buzz
Sign in

Leaderboards/Models

HumanEval

HumanEval remains the most-cited code-generation benchmark despite frontier models saturating it (>95% on every flagship). Kept in the index as a baseline filter — any serious model has to clear ~90% — and because vendor cards still report it. For real-world signal, weight Aider Polyglot more heavily.

Category: coding · Source: github.com

#ModelScoreSettingMeasuredSource
01Anthropic logoClaude 4.7 Opus

Anthropic

97.4%pass@1Apr 16, 2026link ↗
02OpenAI logoGPT-5.5

OpenAI

96.8%pass@1Apr 23, 2026link ↗
03Google logoGemini 3 Pro

Google

96.5%pass@1Apr 22, 2026link ↗
04Anthropic logoClaude 4.6 Sonnet

Anthropic

95.9%pass@1Feb 17, 2026link ↗
05OpenAI logoGPT-5

OpenAI

95.2%pass@1Aug 7, 2025link ↗
06DeepSeek logoDeepSeek-V4-Pro

DeepSeek

95.1%pass@1Apr 22, 2026link ↗
07Alibaba logoQwen3.7-Max

Alibaba

93.9%pass@1May 20, 2026link ↗
08Alibaba logoQwen3.6-27B

Alibaba

91.5%pass@1Apr 22, 2026link ↗
09Mistral logoMistral Large 2

Mistral

89.0%pass@1Jul 24, 2024link ↗

Other model leaderboards

  • gpt.buzz Composite Model Index — cross-benchmark ranking
  • MMLU-ProMultitask language understanding across 14 disciplines — the harder, contamination-resistant successor to MMLU. 10-option multiple choice with stronger distractors.
  • GPQA DiamondPhD-level science questions in biology, physics, and chemistry, written by domain experts. The hardest subset of GPQA — humans with PhDs in the subject score ~65%.
  • Aider PolyglotMulti-language code edit benchmark — solve real Exercism problems across Python, JavaScript, Rust, Go, Java, C++ via edit-and-test loops.
  • AIME 2025American Invitational Mathematics Examination — competition math problems requiring multi-step reasoning. Reasoning-tier models score in the 90s; non-reasoning in the 50s.
  • LiveCodeBenchContamination-free code benchmark — only problems published AFTER a model's training cutoff. Refreshed monthly.
  • MMMUMassive Multi-discipline Multimodal Understanding — 11.5k college-level multimodal questions across 30 subjects (art, business, science, medicine, etc.).