Benchmark Saturation Tracker
Data as of May 11, 2026| Model | Top | NotesβΌ | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.β‘ | D | Open | Competition math | 2021/3 | Saturated | GPT 5.2 | AIME, FrontierMath | β | 398.1 | 12.5K problems from AMC/AIME/Putnam. Saturated by reasoning models in 2024. | |
| 2.β‘ | D | Code | Code (functional) | 2021/7 | Saturated | GPT 5.4 | SWE-Bench | β | 498.0 | 164 hand-written Python problems. Effectively solved; superseded by SWE-Bench. | |
| 3.β‘ | D | Open | Grade-school math | 2021/10 | Saturated | GPT 5.4 | MATH | β | 597.0 | 8.5K word problems. Frontier all >95%. | |
| 4.β‘ | D | Mixed | Diverse reasoning | 2022/10 | Saturated | Claude Opus 4.6 | - | β | 794.5 | 23 challenging tasks from BIG-Bench. Effectively solved. | |
| 5.β‘ | D | MCQ | General knowledge | 2020/9 | Saturated | Claude Opus 4.6 | MMLU-Pro | β | 892.0 | 57 subjects, multiple choice. Top frontier models all >90%. Successor MMLU-Pro raised difficulty. | |
| 6.β‘ | C | Det | AI image detection | 2026/2 | New | Gemini 3 Pro Image (clean) | - | β | 2100.0 | 100% clean β 5-7% under simple perturbation. Headline: detectors not adversarially robust. | |
| 7.β‘ | C | Open | Olympiad math | 2024/2 | Saturated | GPT 5.2 | FrontierMath, IMO problems | β | 696.1 | 30 problems, used as standard reasoning eval since o1. Now saturated. | |
| 8.β‘ | C | MCQ | Graduate science | 2023/11 | Saturated | Claude Opus 4.6 | - | β | 989.0 | PhD-level bio/chem/physics. Saturated by early 2025 per LessWrong (Apr 2026). | |
| 9.β‘ | C | MCQ | General knowledge (harder) | 2024/6 | Saturating | Claude Opus 4.6 | - | β | 1088.0 | Removed easy/contaminated questions; added complex reasoning. Frontier converging. | |
| 10.β‘ | C | Open | Abstract reasoning | 2019/11 | Saturated | o3 (high compute) | ARC-AGI-2 | β | 1187.5 | Took 5 years and o3's high-compute mode. Successor ARC-AGI-2 launched 2025. | |
| 11.β‘ | B | Pref | Real user queries | 2024/4 | Active | GPT 5.4 | - | β | 11285 | ELO from real WildChat queries. Resists saturation by updating prompt distribution. | |
| 12.β‘ | B | Open | Document parsing | 2026/4 | New | LlamaParse Agentic | - | β | 1384.9 | ~2K enterprise pages, 167K test rules across 5 dimensions. | |
| 13.β‘ | B | Agent | Terminal agent | 2025/12 | Saturating | GPT 5.5 | - | β | 1482.7 | Multi-step CLI tasks. v1 saturated within 6 months; v2 already at 80%+. | |
| 14.β‘ | B | Code | Real-world SWE | 2024/8 | Saturating | Devin 3 | SWE-Bench Pro | β | 1582.1 | 500 human-verified GitHub issues. Frontier β₯75% across labs. | |
| 15.β‘ | B | Open | Abstract reasoning | 2025/3 | Saturating | GPT 5.4 | ARC-AGI-3 | β | 1878.8 | Faster than v1 to climb. v3 announced for late 2026. | |
| 16.β‘ | B | Code | Real-world SWE (harder) | 2025/8 | Saturating | GLM 5.1 | - | β | 1958.4 | Multi-file edits, harder issues. Latent Space (Apr 2026): 'soon to be saturated'. | |
| 17.β‘ | B | Open | Open math problem-solving (multi-agent) | 2026/3 | Active (refreshing) | alpha_omega_agents (Kissing Number D11: 604) | - | β | - | Living benchmark β new problems added; resists saturation by design. 11 new SOTAs in first 5 weeks | |
| 18.β‘ | A | Agent | Economic agentic work | 2025/9 | Saturating | GPT 5.5 | - | β | 1284.9 | 44 occupations from top-9 GDP-contributing US industries. New framing: % expert-equivalent on real jobs. | |
| 19.β‘ | A | Agent | Autonomous task duration | 2025/3 | Saturating | Claude Opus 4.6 | - | β | 1680.0 | Claude Opus 4.6: succeeds on >80% of suite; 50% time horizon = 12h, 95% UCB = 60h. Frontier benchmark for agent autonomy. | |
| 20.β‘ | A | Mixed | Contamination-free | 2024/6 | Active (refreshing) | GPT 5.4 | - | β | 1779.2 | Monthly question refresh prevents saturation; always-current snapshot. | |
| 21.β‘ | A | Open | Research-level math | 2024/11 | Active | GPT 5.4 Pro | - | β | 2050.0 | Hundreds of original problems graded by Tao/Gowers et al. Designed to resist saturation. May 2026: DeepMind multi-agent 'AI co-mathematician' hit 48% on Tier 4 (hardest tier); evaluators noted one proof could plausibly form a PhD thesis chapter. | |
| 22.β‘ | A | Mixed | Frontier knowledge | 2025/1 | Active | GPT 5.5 | - | β | 2131.0 | 3,000 questions from 1,000 expert authors. Designed as 'the last benchmark we'll need.' | |
| 23.β‘ | A | Agent | Agentic post-training (CLI agents fine-tune base LLMs) | 2026/4 | Active | Opus 4.6 (Claude Code) | - | β | 2223.2 | Frontier CLI harnesses (Claude Code, Codex CLI, Gemini CLI, OpenCode) get a base LLM (Qwen3-1.7B/4B, SmolLM3-3B, Gemma-3-4B) and 10h on one H100 to post-train it for AIME/GPQA/GSM8K/HumanEval/Arena Hard/BFCL/HealthBench. Best agent 23.2 vs official instruct 51.1 vs base zero-shot 7.5. Anti-cheat: agent-as-judge detects test-set leakage, harness tampering, model substitution. | |
| 24.β‘ | A | Agent | Abstract reasoning (interactive) | 2026/9 | Announced | - | - | β | - | Interactive games / multi-step exploration; not yet released. | |