gpt.buzz
Sign in

Leaderboards/Model Index

gpt.buzz Model Index

Composite capability ranking across 7 benchmarks — MMLU-Pro, GPQA Diamond, AIME 2025, Aider Polyglot, LiveCodeBench, HumanEval, and MMMU. Weighted to favor benchmarks that resist contamination (Aider, LiveCodeBench) and broad reasoning coverage (GPQA, MMLU-Pro).

Models without scores on at least one tracked benchmark are omitted. See the methodology page for the exact weights.

#ModelCompositeMMLU-ProGPQA-DHumanEvalAiderAIME-25LiveCBMMMU
01Anthropic logoClaude 4.7 Opus

Anthropic

98.588.290.197.491.293.580.279.0
02OpenAI logoGPT-5.5

OpenAI

97.887.689.496.889.794.278.978.4
03Google logoGemini 3 Pro

Google

96.587.188.696.584.892.876.184.3
04OpenAI logoGPT-5

OpenAI

94.385.187.395.285.491.074.574.1
05Anthropic logoClaude 4.6 Sonnet

Anthropic

86.285.485.895.983.687.473.8
06DeepSeek logoDeepSeek-V4-Pro

DeepSeek

84.484.282.495.180.188.672.4
07Alibaba logoQwen3.7-Max

Alibaba

84.183.783.093.978.490.471.0
08xAI logoGrok 4

xAI

74.582.887.572.586.165.3
09Google logoGemini 3.5

Google

63.088.991.296.186.0
10Alibaba logoQwen3.6-27B

Alibaba

44.978.991.570.270.3
11Google logoGemini 2.5 Pro

Google

44.184.184.072.4
12OpenAI logoo3

OpenAI

33.187.788.9
13Meta logoLlama 4 Maverick

Meta

24.580.568.5
14DeepSeek logoDeepSeek-R1

DeepSeek

12.579.8
15Google logoGemini Omni

Google

7.782.5
16Mistral logoMistral Large 2

Mistral

7.389.0

Per-benchmark leaderboards