Robo2u

Benchmark the Benchmarks

How each benchmark performs as a measurement instrument. Computed across 56 ranked models. All correlations are Spearman ρ, reported with pair count (n). Consensus is computed leave-one-out β€” the correlated rank excludes the benchmark being evaluated, so a benchmark can't boost its own score.

BenchmarkCoverageρ w/ Arenaρ w/ Peers (LOO)Unique SignalTop-5 SpreadDynamic Range
Arena
arena_elo
4/56
7%
β€”0.80
n=4 (low)
0.20
max peer ρ = 0.80
β€”3%
MMLU-Pro
mmlu_pro
30/56
54%
0.60
n=4 (low)
0.77
n=30
0.18
max peer ρ = 0.82
4.9%
saturating
28%
SWE-Bench
swe_bench
22/56
39%
0.40
n=4 (low)
0.71
n=20
0.16
max peer ρ = 0.84
2.2%
saturating
59%
GPQA
gpqa
35/56
63%
0.80
n=4 (low)
0.91
n=33
0.16
max peer ρ = 0.84
3.0%
saturating
54%

Color key: β‰₯0.80, 0.50–0.80, 0.30–0.50, <0.30.

A low-n correlation (n<10) is rendered in italics and is unreliable β€” treat as directional only. Unique signal flags redundancy: a benchmark scoring <0.15 is mostly measuring what another benchmark already captures. Top-5 spread below 5% means the frontier is bunched β€” the benchmark has stopped discriminating at the top.

Benchmarks for Datasets

Meta-benchmarks of training data and eval datasetsβ€” distinct from model leaderboards. Pretraining-data benchmarks hold model & budget fixed and vary the corpus; eval-quality tools audit contamination and coverage.

BenchmarkCategoryWhat it measuresPublic
DCLM
DataComp / MLCommons
Pretraining dataDownstream task scores at fixed model+compute β€” crowns the best training-data recipeYes
FineWeb / FineWeb-Edu
Hugging Face
Pretraining dataAblation-based quality signals across CommonCrawl-derived corporaYes
Dolma
Allen AI
Pretraining dataOpen pretraining corpus + provenance / reproducibility studyYes
RedPajama-v2
Together AI
Pretraining data30T-token corpus with per-document quality scores for ablationYes
Nemotron-CC
NVIDIA
Pretraining data6.3T-token high-quality CommonCrawl with quality classifiersYes
The Pile
EleutherAI
Pretraining dataDiverse 800GB corpus; baseline against which later corpora compareYes
DataPerf
MLCommons
Data curationBenchmarks for data selection, cleaning, and slicing across tasksYes
MTEB
Hugging Face + Contextual
Embedding eval56+ embedding datasets β€” quasi-canonical dataset suite for retrieval modelsYes
BIG-bench
Google + community
Eval federation200+ tasks; 'benchmark of benchmarks' for capability breadthYes
HELM
Stanford CRFM
Eval frameworkHolistic eval across ~30 datasets with per-scenario breakdownsYes
SWE-bench Verified
Princeton + OpenAI
Coding evalCurated subset of SWE-bench where maintainer labels verify correctnessYes
HarmBench
CAIS + academia
Safety evalStandardised set for refusal / red-team evaluationYes
JailbreakBench
academic consortium
Safety evalReproducible jailbreak evaluation dataset + leaderboardYes
Contamination audits
various
Eval integrityDetect leakage of eval items into training data (Oren et al. etc.)Mixed

Live Saturation (computed from 512 models)

For each benchmark, the top score and top-10 mean. Benchmarks >95% are dead β€” frontier models can't be distinguished.

BenchmarkStatusN modelsTopTop-10 meanMedianSpread (top βˆ’ median)
MMLU-ProNear ceiling34589.8%88.5%75.2%14.6%
GPQA DiamondNear ceiling48194.1%92.2%65.9%28.2%
HLEActive47744.7%40.7%6.1%38.6%
LiveCodeBenchNear ceiling34391.7%88.8%42.4%49.3%
SciCodeActive47558.9%55.3%30.8%28.1%
MATH-500Saturated20199.4%99.0%83.5%15.9%
AIMESaturated19495.7%92.2%23.0%72.7%
AIME-25Saturated26999.0%96.5%53.3%45.7%
IFBenchMaturing40982.9%80.0%43.1%39.8%
τ²-BenchSaturated40198.8%97.8%40.4%58.4%

Cross-Benchmark Correlation (Spearman ρ)

Two benchmarks with ρ β‰₯ 0.85 are essentially redundant β€” they rank models the same way. Low ρ indicates the benchmarks measure different capabilities. Cells colored: strong (β‰₯0.85) Β· high (β‰₯0.7) Β· moderate Β· weak

vsMMLU-ProGPQA DiamondHLELiveCodeBenchSciCodeMATH-500AIMEAIME-25IFBenchτ²-Bench
MMLU-Pro1.000.970.690.890.910.890.860.800.780.71
GPQA Diamond0.971.000.840.930.890.890.870.870.820.76
HLE0.690.841.000.710.720.560.580.810.750.68
LiveCodeBench0.890.930.711.000.830.910.900.900.800.66
SciCode0.910.890.720.831.000.740.710.730.770.71
MATH-5000.890.890.560.910.741.000.960.910.650.57
AIME0.860.870.580.900.710.961.000.900.610.55
AIME-250.800.870.810.900.730.910.901.000.750.62
IFBench0.780.820.750.800.770.650.610.751.000.74
τ²-Bench0.710.760.680.660.710.570.550.620.741.00

Per-Provider Mean Score (does this benchmark favor any creator?)

Mean score per benchmark for each model creator's portfolio. Big differences across columns within a row β†’ that benchmark may have a provider bias (or that provider's models concentrate at one capability tier).

BenchmarkOpenAIanthropicgoogledeepseekmetaalibabamistralxAI
MMLU-Pro78.475.569.175.752.071.363.478.7
GPQA Diamond73.467.559.667.740.562.750.075.1
HLE15.912.410.212.97.09.25.016.3
LiveCodeBench64.245.841.452.117.044.031.055.6
SciCode40.838.528.734.718.826.625.238.1
MATH-50087.770.882.188.252.386.364.987.3
AIME50.224.135.655.311.142.415.048.8
AIME-2566.760.147.662.46.451.929.566.7
IFBench60.048.246.444.837.443.734.554.4
τ²-Bench59.666.534.151.919.852.330.476.4

Why some benchmarks are more legit than others

A high score on a benchmark only tells you something if the benchmark itself is trustworthy. Three things move a benchmark up or down the legitimacy scale:

A β€” Strong
Private/rotating, verifiable, non-saturated. Use to compare frontier capability.
B β€” Moderate
Verifiable but public, OR private but subjective. Useful with caution.
C β€” Weak
Public + gameable format OR approaching saturation. Marginal info.
D β€” Dead
Saturated + heavily memorized. Don't use to rank current models.

Benchmark Saturation Tracker

Data as of May 11, 2026
Hendrycks et al.DOpenCompetition math2021/3SaturatedGPT 5.2AIME, FrontierMath
Top 98.1
12.5K problems from AMC/AIME/Putnam. Saturated by reasoning models in 2024.
OpenAIDCodeCode (functional)2021/7SaturatedGPT 5.4SWE-Bench
Top 98.0
164 hand-written Python problems. Effectively solved; superseded by SWE-Bench.
OpenAIDOpenGrade-school math2021/10SaturatedGPT 5.4MATH
Top 97.0
8.5K word problems. Frontier all >95%.
GoogleDMixedDiverse reasoning2022/10SaturatedClaude Opus 4.6
Top 94.5
23 challenging tasks from BIG-Bench. Effectively solved.
Hendrycks et al.DMCQGeneral knowledge2020/9SaturatedClaude Opus 4.6MMLU-Pro
Top 92.0
57 subjects, multiple choice. Top frontier models all >90%. Successor MMLU-Pro raised difficulty.
Succinct LabsCDetAI image detection2026/2NewGemini 3 Pro Image (clean)
Top 100.0
100% clean β†’ 5-7% under simple perturbation. Headline: detectors not adversarially robust.
MAACOpenOlympiad math2024/2SaturatedGPT 5.2FrontierMath, IMO problems
Top 96.1
30 problems, used as standard reasoning eval since o1. Now saturated.
Rein et al.CMCQGraduate science2023/11SaturatedClaude Opus 4.6
Top 89.0
PhD-level bio/chem/physics. Saturated by early 2025 per LessWrong (Apr 2026).
TIGER-LabCMCQGeneral knowledge (harder)2024/6SaturatingClaude Opus 4.6
Top 88.0
Removed easy/contaminated questions; added complex reasoning. Frontier converging.
ARC Prize / CholletCOpenAbstract reasoning2019/11Saturatedo3 (high compute)ARC-AGI-2
Top 87.5
Took 5 years and o3's high-compute mode. Successor ARC-AGI-2 launched 2025.
AllenAIBPrefReal user queries2024/4ActiveGPT 5.4
Top 1285
ELO from real WildChat queries. Resists saturation by updating prompt distribution.
LlamaIndexBOpenDocument parsing2026/4NewLlamaParse Agentic
Top 84.9
~2K enterprise pages, 167K test rules across 5 dimensions.
Anthropic / StanfordBAgentTerminal agent2025/12SaturatingGPT 5.5
Top 82.7
Multi-step CLI tasks. v1 saturated within 6 months; v2 already at 80%+.
OpenAI / PrincetonBCodeReal-world SWE2024/8SaturatingDevin 3SWE-Bench Pro
Top 82.1
500 human-verified GitHub issues. Frontier β‰₯75% across labs.
ARC Prize / CholletBOpenAbstract reasoning2025/3SaturatingGPT 5.4ARC-AGI-3
Top 78.8
Faster than v1 to climb. v3 announced for late 2026.
PrincetonBCodeReal-world SWE (harder)2025/8SaturatingGLM 5.1
Top 58.4
Multi-file edits, harder issues. Latent Space (Apr 2026): 'soon to be saturated'.
Together AIBOpenOpen math problem-solving (multi-agent)2026/3Active (refreshing)alpha_omega_agents (Kissing Number D11: 604)
Living benchmark β€” new problems added; resists saturation by design. 11 new SOTAs in first 5 weeks
OpenAIAAgentEconomic agentic work2025/9SaturatingGPT 5.5
Top 84.9
44 occupations from top-9 GDP-contributing US industries. New framing: % expert-equivalent on real jobs.
METRAAgentAutonomous task duration2025/3SaturatingClaude Opus 4.6
Top 80.0
Claude Opus 4.6: succeeds on >80% of suite; 50% time horizon = 12h, 95% UCB = 60h. Frontier benchmark for agent autonomy.
Abacus AIAMixedContamination-free2024/6Active (refreshing)GPT 5.4
Top 79.2
Monthly question refresh prevents saturation; always-current snapshot.
Epoch AIAOpenResearch-level math2024/11ActiveGPT 5.4 Pro
Top 50.0
Hundreds of original problems graded by Tao/Gowers et al. Designed to resist saturation. May 2026: DeepMind multi-agent 'AI co-mathematician' hit 48% on Tier 4 (hardest tier); evaluators noted one proof could plausibly form a PhD thesis chapter.
CAIS + ScaleAMixedFrontier knowledge2025/1ActiveGPT 5.5
Top 31.0
3,000 questions from 1,000 expert authors. Designed as 'the last benchmark we'll need.'
Aisa Group / Thoughtful LabAAgentAgentic post-training (CLI agents fine-tune base LLMs)2026/4ActiveOpus 4.6 (Claude Code)
Top 23.2
Frontier CLI harnesses (Claude Code, Codex CLI, Gemini CLI, OpenCode) get a base LLM (Qwen3-1.7B/4B, SmolLM3-3B, Gemma-3-4B) and 10h on one H100 to post-train it for AIME/GPQA/GSM8K/HumanEval/Arena Hard/BFCL/HealthBench. Best agent 23.2 vs official instruct 51.1 vs base zero-shot 7.5. Anti-cheat: agent-as-judge detects test-set leakage, harness tampering, model substitution.
ARC Prize / CholletAAgentAbstract reasoning (interactive)2026/9Announced
Interactive games / multi-step exploration; not yet released.