gpt.buzz
Sign in

Leaderboards/Models

GPQA Diamond

GPQA Diamond is the "graduate-level google-proof" science benchmark — 198 questions written by domain PhDs to resist web search. Top frontier models are now in the 85–90% range, comfortably above the ~65% human-expert baseline. The benchmark that gave first hard evidence of super-human performance on specialized scientific reasoning.

Category: reasoning · Source: github.com

#ModelScoreSettingMeasuredSource
01Google logoGemini 3.5

Google

91.2%0-shot CoT, Deep ThinkMay 20, 2026link ↗
02Anthropic logoClaude 4.7 Opus

Anthropic

90.1%0-shot CoTApr 16, 2026link ↗
03OpenAI logoGPT-5.5

OpenAI

89.4%0-shot CoTApr 23, 2026link ↗
04Google logoGemini 3 Pro

Google

88.6%0-shot CoTApr 22, 2026link ↗
05OpenAI logoo3

OpenAI

87.7%0-shot CoTApr 16, 2025link ↗
06xAI logoGrok 4

xAI

87.5%0-shot CoTJul 9, 2025link ↗
07OpenAI logoGPT-5

OpenAI

87.3%0-shot CoTAug 7, 2025link ↗
08Anthropic logoClaude 4.6 Sonnet

Anthropic

85.8%0-shot CoTFeb 17, 2026link ↗
09Google logoGemini 2.5 Pro

Google

84.0%0-shot CoTJun 17, 2025link ↗
10Alibaba logoQwen3.7-Max

Alibaba

83.0%0-shot CoTMay 20, 2026link ↗
11DeepSeek logoDeepSeek-V4-Pro

DeepSeek

82.4%0-shot CoTApr 22, 2026link ↗

Other model leaderboards

  • gpt.buzz Composite Model Index — cross-benchmark ranking
  • MMLU-ProMultitask language understanding across 14 disciplines — the harder, contamination-resistant successor to MMLU. 10-option multiple choice with stronger distractors.
  • HumanEvalOpenAI's 164 Python programming problems with hidden unit tests — the classic code-generation pass@1 benchmark.
  • Aider PolyglotMulti-language code edit benchmark — solve real Exercism problems across Python, JavaScript, Rust, Go, Java, C++ via edit-and-test loops.
  • AIME 2025American Invitational Mathematics Examination — competition math problems requiring multi-step reasoning. Reasoning-tier models score in the 90s; non-reasoning in the 50s.
  • LiveCodeBenchContamination-free code benchmark — only problems published AFTER a model's training cutoff. Refreshed monthly.
  • MMMUMassive Multi-discipline Multimodal Understanding — 11.5k college-level multimodal questions across 30 subjects (art, business, science, medicine, etc.).