Benchmark Saturation Tracker

Data as of May 11, 2026

Model										Top	Notes▼
1.□	MATHHendrycks et al.	D	Open	Competition math	2021/3	Saturated	GPT 5.2	AIME, FrontierMath	○	398.1	12.5K problems from AMC/AIME/Putnam. Saturated by reasoning models in 2024.
2.□	HumanEvalOpenAI	D	Code	Code (functional)	2021/7	Saturated	GPT 5.4	SWE-Bench	○	498.0	164 hand-written Python problems. Effectively solved; superseded by SWE-Bench.
3.□	GSM8KOpenAI	D	Open	Grade-school math	2021/10	Saturated	GPT 5.4	MATH	○	597.0	8.5K word problems. Frontier all >95%.
4.□	BIG-Bench HardGoogle	D	Mixed	Diverse reasoning	2022/10	Saturated	Claude Opus 4.6	-	○	794.5	23 challenging tasks from BIG-Bench. Effectively solved.
5.□	MMLUHendrycks et al.	D	MCQ	General knowledge	2020/9	Saturated	Claude Opus 4.6	MMLU-Pro	○	892.0	57 subjects, multiple choice. Top frontier models all >90%. Successor MMLU-Pro raised difficulty.
6.□	AdversImSuccinct Labs	C	Det	AI image detection	2026/2	New	Gemini 3 Pro Image (clean)	-	○	2100.0	100% clean → 5-7% under simple perturbation. Headline: detectors not adversarially robust.
7.□	AIME 2024MAA	C	Open	Olympiad math	2024/2	Saturated	GPT 5.2	FrontierMath, IMO problems	○	696.1	30 problems, used as standard reasoning eval since o1. Now saturated.
8.□	GPQA DiamondRein et al.	C	MCQ	Graduate science	2023/11	Saturated	Claude Opus 4.6	-	○	989.0	PhD-level bio/chem/physics. Saturated by early 2025 per LessWrong (Apr 2026).
9.□	MMLU-ProTIGER-Lab	C	MCQ	General knowledge (harder)	2024/6	Saturating	Claude Opus 4.6	-	○	1088.0	Removed easy/contaminated questions; added complex reasoning. Frontier converging.
10.□	ARC-AGI-1ARC Prize / Chollet	C	Open	Abstract reasoning	2019/11	Saturated	o3 (high compute)	ARC-AGI-2	○	1187.5	Took 5 years and o3's high-compute mode. Successor ARC-AGI-2 launched 2025.
11.□	WildBenchAllenAI	B	Pref	Real user queries	2024/4	Active	GPT 5.4	-	○	11285	ELO from real WildChat queries. Resists saturation by updating prompt distribution.
12.□	ParseBenchLlamaIndex	B	Open	Document parsing	2026/4	New	LlamaParse Agentic	-	○	1384.9	~2K enterprise pages, 167K test rules across 5 dimensions.
13.□	Terminal-Bench 2.0Anthropic / Stanford	B	Agent	Terminal agent	2025/12	Saturating	GPT 5.5	-	○	1482.7	Multi-step CLI tasks. v1 saturated within 6 months; v2 already at 80%+.
14.□	SWE-Bench VerifiedOpenAI / Princeton	B	Code	Real-world SWE	2024/8	Saturating	Devin 3	SWE-Bench Pro	○	1582.1	500 human-verified GitHub issues. Frontier ≥75% across labs.
15.□	ARC-AGI-2ARC Prize / Chollet	B	Open	Abstract reasoning	2025/3	Saturating	GPT 5.4	ARC-AGI-3	○	1878.8	Faster than v1 to climb. v3 announced for late 2026.
16.□	SWE-Bench ProPrinceton	B	Code	Real-world SWE (harder)	2025/8	Saturating	GLM 5.1	-	○	1958.4	Multi-file edits, harder issues. Latent Space (Apr 2026): 'soon to be saturated'.
17.□	EinsteinArenaTogether AI	B	Open	Open math problem-solving (multi-agent)	2026/3	Active (refreshing)	alpha_omega_agents (Kissing Number D11: 604)	-	○	-	Living benchmark — new problems added; resists saturation by design. 11 new SOTAs in first 5 weeks
18.□	GDPvalOpenAI	A	Agent	Economic agentic work	2025/9	Saturating	GPT 5.5	-	○	1284.9	44 occupations from top-9 GDP-contributing US industries. New framing: % expert-equivalent on real jobs.
19.□	METR Time HorizonMETR	A	Agent	Autonomous task duration	2025/3	Saturating	Claude Opus 4.6	-	○	1680.0	Claude Opus 4.6: succeeds on >80% of suite; 50% time horizon = 12h, 95% UCB = 60h. Frontier benchmark for agent autonomy.
20.□	LiveBenchAbacus AI	A	Mixed	Contamination-free	2024/6	Active (refreshing)	GPT 5.4	-	○	1779.2	Monthly question refresh prevents saturation; always-current snapshot.
21.□	FrontierMathEpoch AI	A	Open	Research-level math	2024/11	Active	GPT 5.4 Pro	-	○	2050.0	Hundreds of original problems graded by Tao/Gowers et al. Designed to resist saturation. May 2026: DeepMind multi-agent 'AI co-mathematician' hit 48% on Tier 4 (hardest tier); evaluators noted one proof could plausibly form a PhD thesis chapter.
22.□	Humanity's Last Exam (HLE)CAIS + Scale	A	Mixed	Frontier knowledge	2025/1	Active	GPT 5.5	-	○	2131.0	3,000 questions from 1,000 expert authors. Designed as 'the last benchmark we'll need.'
23.□	PostTrainBenchAisa Group / Thoughtful Lab	A	Agent	Agentic post-training (CLI agents fine-tune base LLMs)	2026/4	Active	Opus 4.6 (Claude Code)	-	○	2223.2	Frontier CLI harnesses (Claude Code, Codex CLI, Gemini CLI, OpenCode) get a base LLM (Qwen3-1.7B/4B, SmolLM3-3B, Gemma-3-4B) and 10h on one H100 to post-train it for AIME/GPQA/GSM8K/HumanEval/Arena Hard/BFCL/HealthBench. Best agent 23.2 vs official instruct 51.1 vs base zero-shot 7.5. Anti-cheat: agent-as-judge detects test-set leakage, harness tampering, model substitution.
24.□	ARC-AGI-3ARC Prize / Chollet	A	Agent	Abstract reasoning (interactive)	2026/9	Announced	-	-	○	-	Interactive games / multi-step exploration; not yet released.

MATH

Hendrycks et al.DOpenCompetition math2021/3SaturatedGPT 5.2AIME, FrontierMath

Top 98.1

12.5K problems from AMC/AIME/Putnam. Saturated by reasoning models in 2024.

HumanEval

OpenAIDCodeCode (functional)2021/7SaturatedGPT 5.4SWE-Bench

Top 98.0

164 hand-written Python problems. Effectively solved; superseded by SWE-Bench.

GSM8K

OpenAIDOpenGrade-school math2021/10SaturatedGPT 5.4MATH

Top 97.0

8.5K word problems. Frontier all >95%.

BIG-Bench Hard

GoogleDMixedDiverse reasoning2022/10SaturatedClaude Opus 4.6

Top 94.5

23 challenging tasks from BIG-Bench. Effectively solved.

MMLU

Hendrycks et al.DMCQGeneral knowledge2020/9SaturatedClaude Opus 4.6MMLU-Pro

Top 92.0

57 subjects, multiple choice. Top frontier models all >90%. Successor MMLU-Pro raised difficulty.

AdversIm

Succinct LabsCDetAI image detection2026/2NewGemini 3 Pro Image (clean)

Top 100.0

100% clean → 5-7% under simple perturbation. Headline: detectors not adversarially robust.

AIME 2024

MAACOpenOlympiad math2024/2SaturatedGPT 5.2FrontierMath, IMO problems

Top 96.1

30 problems, used as standard reasoning eval since o1. Now saturated.

GPQA Diamond

Rein et al.CMCQGraduate science2023/11SaturatedClaude Opus 4.6

Top 89.0

PhD-level bio/chem/physics. Saturated by early 2025 per LessWrong (Apr 2026).

MMLU-Pro

TIGER-LabCMCQGeneral knowledge (harder)2024/6SaturatingClaude Opus 4.6

Top 88.0

Removed easy/contaminated questions; added complex reasoning. Frontier converging.

10.

ARC-AGI-1

ARC Prize / CholletCOpenAbstract reasoning2019/11Saturatedo3 (high compute)ARC-AGI-2

Top 87.5

Took 5 years and o3's high-compute mode. Successor ARC-AGI-2 launched 2025.

11.

WildBench

AllenAIBPrefReal user queries2024/4ActiveGPT 5.4

Top 1285

ELO from real WildChat queries. Resists saturation by updating prompt distribution.

12.

ParseBench

LlamaIndexBOpenDocument parsing2026/4NewLlamaParse Agentic

Top 84.9

~2K enterprise pages, 167K test rules across 5 dimensions.

13.

Terminal-Bench 2.0

Anthropic / StanfordBAgentTerminal agent2025/12SaturatingGPT 5.5

Top 82.7

Multi-step CLI tasks. v1 saturated within 6 months; v2 already at 80%+.

14.

SWE-Bench Verified

OpenAI / PrincetonBCodeReal-world SWE2024/8SaturatingDevin 3SWE-Bench Pro

Top 82.1

500 human-verified GitHub issues. Frontier ≥75% across labs.

15.

ARC-AGI-2

ARC Prize / CholletBOpenAbstract reasoning2025/3SaturatingGPT 5.4ARC-AGI-3

Top 78.8

Faster than v1 to climb. v3 announced for late 2026.

16.

SWE-Bench Pro

PrincetonBCodeReal-world SWE (harder)2025/8SaturatingGLM 5.1

Top 58.4

Multi-file edits, harder issues. Latent Space (Apr 2026): 'soon to be saturated'.

17.

EinsteinArena

Together AIBOpenOpen math problem-solving (multi-agent)2026/3Active (refreshing)alpha_omega_agents (Kissing Number D11: 604)

Living benchmark — new problems added; resists saturation by design. 11 new SOTAs in first 5 weeks

18.

GDPval

OpenAIAAgentEconomic agentic work2025/9SaturatingGPT 5.5

Top 84.9

44 occupations from top-9 GDP-contributing US industries. New framing: % expert-equivalent on real jobs.

19.

METR Time Horizon

METRAAgentAutonomous task duration2025/3SaturatingClaude Opus 4.6

Top 80.0

Claude Opus 4.6: succeeds on >80% of suite; 50% time horizon = 12h, 95% UCB = 60h. Frontier benchmark for agent autonomy.

20.

LiveBench

Abacus AIAMixedContamination-free2024/6Active (refreshing)GPT 5.4

Top 79.2

Monthly question refresh prevents saturation; always-current snapshot.

21.

FrontierMath

Epoch AIAOpenResearch-level math2024/11ActiveGPT 5.4 Pro

Top 50.0

Hundreds of original problems graded by Tao/Gowers et al. Designed to resist saturation. May 2026: DeepMind multi-agent 'AI co-mathematician' hit 48% on Tier 4 (hardest tier); evaluators noted one proof could plausibly form a PhD thesis chapter.

22.

Humanity's Last Exam (HLE)

CAIS + ScaleAMixedFrontier knowledge2025/1ActiveGPT 5.5

Top 31.0

3,000 questions from 1,000 expert authors. Designed as 'the last benchmark we'll need.'

23.

PostTrainBench

Aisa Group / Thoughtful LabAAgentAgentic post-training (CLI agents fine-tune base LLMs)2026/4ActiveOpus 4.6 (Claude Code)

Top 23.2

Frontier CLI harnesses (Claude Code, Codex CLI, Gemini CLI, OpenCode) get a base LLM (Qwen3-1.7B/4B, SmolLM3-3B, Gemma-3-4B) and 10h on one H100 to post-train it for AIME/GPQA/GSM8K/HumanEval/Arena Hard/BFCL/HealthBench. Best agent 23.2 vs official instruct 51.1 vs base zero-shot 7.5. Anti-cheat: agent-as-judge detects test-set leakage, harness tampering, model substitution.

24.

ARC-AGI-3

ARC Prize / CholletAAgentAbstract reasoning (interactive)2026/9Announced

Interactive games / multi-step exploration; not yet released.

Benchmark	Coverage	ρ w/ Arena	ρ w/ Peers (LOO)	Unique Signal	Top-5 Spread	Dynamic Range
Arena arena_elo	4/56 7%	—	0.80 n=4 (low)	0.20 max peer ρ = 0.80	—	3%
MMLU-Pro mmlu_pro	30/56 54%	0.60 n=4 (low)	0.77 n=30	0.18 max peer ρ = 0.82	4.9% saturating	28%
SWE-Bench swe_bench	22/56 39%	0.40 n=4 (low)	0.71 n=20	0.16 max peer ρ = 0.84	2.2% saturating	59%
GPQA gpqa	35/56 63%	0.80 n=4 (low)	0.91 n=33	0.16 max peer ρ = 0.84	3.0% saturating	54%

Benchmark	Category	What it measures	Public
DCLM DataComp / MLCommons	Pretraining data	Downstream task scores at fixed model+compute — crowns the best training-data recipe	Yes
FineWeb / FineWeb-Edu Hugging Face	Pretraining data	Ablation-based quality signals across CommonCrawl-derived corpora	Yes
Dolma Allen AI	Pretraining data	Open pretraining corpus + provenance / reproducibility study	Yes
RedPajama-v2 Together AI	Pretraining data	30T-token corpus with per-document quality scores for ablation	Yes
Nemotron-CC NVIDIA	Pretraining data	6.3T-token high-quality CommonCrawl with quality classifiers	Yes
The Pile EleutherAI	Pretraining data	Diverse 800GB corpus; baseline against which later corpora compare	Yes
DataPerf MLCommons	Data curation	Benchmarks for data selection, cleaning, and slicing across tasks	Yes
MTEB Hugging Face + Contextual	Embedding eval	56+ embedding datasets — quasi-canonical dataset suite for retrieval models	Yes
BIG-bench Google + community	Eval federation	200+ tasks; 'benchmark of benchmarks' for capability breadth	Yes
HELM Stanford CRFM	Eval framework	Holistic eval across ~30 datasets with per-scenario breakdowns	Yes
SWE-bench Verified Princeton + OpenAI	Coding eval	Curated subset of SWE-bench where maintainer labels verify correctness	Yes
HarmBench CAIS + academia	Safety eval	Standardised set for refusal / red-team evaluation	Yes
JailbreakBench academic consortium	Safety eval	Reproducible jailbreak evaluation dataset + leaderboard	Yes
Contamination audits various	Eval integrity	Detect leakage of eval items into training data (Oren et al. etc.)	Mixed

Benchmark	Status	N models	Top	Top-10 mean	Median	Spread (top − median)
MMLU-Pro	Near ceiling	345	89.8%	88.5%	75.2%	14.6%
GPQA Diamond	Near ceiling	481	94.1%	92.2%	65.9%	28.2%
HLE	Active	477	44.7%	40.7%	6.1%	38.6%
LiveCodeBench	Near ceiling	343	91.7%	88.8%	42.4%	49.3%
SciCode	Active	475	58.9%	55.3%	30.8%	28.1%
MATH-500	Saturated	201	99.4%	99.0%	83.5%	15.9%
AIME	Saturated	194	95.7%	92.2%	23.0%	72.7%
AIME-25	Saturated	269	99.0%	96.5%	53.3%	45.7%
IFBench	Maturing	409	82.9%	80.0%	43.1%	39.8%
τ²-Bench	Saturated	401	98.8%	97.8%	40.4%	58.4%

vs	MMLU-Pro	GPQA Diamond	HLE	LiveCodeBench	SciCode	MATH-500	AIME	AIME-25	IFBench	τ²-Bench
MMLU-Pro	1.00	0.97	0.69	0.89	0.91	0.89	0.86	0.80	0.78	0.71
GPQA Diamond	0.97	1.00	0.84	0.93	0.89	0.89	0.87	0.87	0.82	0.76
HLE	0.69	0.84	1.00	0.71	0.72	0.56	0.58	0.81	0.75	0.68
LiveCodeBench	0.89	0.93	0.71	1.00	0.83	0.91	0.90	0.90	0.80	0.66
SciCode	0.91	0.89	0.72	0.83	1.00	0.74	0.71	0.73	0.77	0.71
MATH-500	0.89	0.89	0.56	0.91	0.74	1.00	0.96	0.91	0.65	0.57
AIME	0.86	0.87	0.58	0.90	0.71	0.96	1.00	0.90	0.61	0.55
AIME-25	0.80	0.87	0.81	0.90	0.73	0.91	0.90	1.00	0.75	0.62
IFBench	0.78	0.82	0.75	0.80	0.77	0.65	0.61	0.75	1.00	0.74
τ²-Bench	0.71	0.76	0.68	0.66	0.71	0.57	0.55	0.62	0.74	1.00

Benchmark	OpenAI	anthropic	google	deepseek	meta	alibaba	mistral	xAI
MMLU-Pro	78.4	75.5	69.1	75.7	52.0	71.3	63.4	78.7
GPQA Diamond	73.4	67.5	59.6	67.7	40.5	62.7	50.0	75.1
HLE	15.9	12.4	10.2	12.9	7.0	9.2	5.0	16.3
LiveCodeBench	64.2	45.8	41.4	52.1	17.0	44.0	31.0	55.6
SciCode	40.8	38.5	28.7	34.7	18.8	26.6	25.2	38.1
MATH-500	87.7	70.8	82.1	88.2	52.3	86.3	64.9	87.3
AIME	50.2	24.1	35.6	55.3	11.1	42.4	15.0	48.8
AIME-25	66.7	60.1	47.6	62.4	6.4	51.9	29.5	66.7
IFBench	60.0	48.2	46.4	44.8	37.4	43.7	34.5	54.4
τ²-Bench	59.6	66.5	34.1	51.9	19.8	52.3	30.4	76.4

Benchmark the Benchmarks

Benchmarks for Datasets

Live Saturation (computed from 512 models)

Cross-Benchmark Correlation (Spearman ρ)

Per-Provider Mean Score (does this benchmark favor any creator?)

Why some benchmarks are more legit than others

Benchmark Saturation Tracker