Reasoning / Math Models

Data as of June 18, 2026

Model										FrMathA	AIMEB	MATH L5C		Notes▼
1.□	GPT 5.2OpenAI	2025/12	?	128K	$2/$14	73	130.34	📝👁️🔊	○	240.7	296.1	198.1	1.7	-
2.□	Kimi K2.6Moonshot	2026/4	1T (32B active, MoE)	256K	-	-	-	📝👁️	●	-	196.4	-	2.3	-
3.□	o4-miniOpenAI	2025/4	?	128K	-	161	21.94	📝	○	-	-	297.8	3	-
4.□	GPT 5.4 ProOpenAI	2026/3	?	128K	-	-	-	📝👁️🔊	○	150.0	-	-	3.3	-
5.□	o3OpenAI	2025/4	?	200K	-	118	5.38	📝	○	-	-	397.8	3.3	-
6.□	GLM 4.6Z.ai	2025/7	754B MoE	200K	-	-	-	📝	●	-	495.3	-	3.3	Open-weight MoE; basis for Big Pickle; strong reasoning + tool use
7.□	ZAYA1-74B-PreviewZyphra	2026/5	74B (4B active, MoE)	?	-	-	-	📝	●	-	-	-	3.7	Pre-RL base checkpoint; trained on AMD hardware. Marks Zyphra's move beyond small-MoE experimentation. Companion ZAYA1-VL-8B (700M active) released alongside.
8.□	Nemotron 3 SuperNVIDIA	2026/4	120B (12B active, MoE)	1M	-	-	-	📝	●	-	-	-	3.7	Hybrid Mamba-Attention MoE w/ LatentMoE; trained on 25T tokens; 2.2x throughput vs GPT-OSS-120B, 7.5x vs Qwen3.5-122B; native MTP speculative decoding
9.□	Hy3 PreviewTencent Hunyuan	2026/4	295B (21B active, MoE)	256K	-	-	-	📝	●	-	-	-	3.7	Open-weight Hunyuan 3 preview; 192 experts, top-8 routing; strong STEM/reasoning release
10.□	Claude Sonnet 4.5Anthropic	2025/9	?	200K	$3/$15	-	-	📝👁️	○	-	-	497.7	3.7	-
11.□	Gemini 3.1 ProGoogle	2026/2	?	1M	$2/$12	109	29.71	📝👁️🔊	○	436.9	395.6	-	3.7	-
12.□	Big PickleOpenCode Zen	2025/10	754B MoE	200K	free/free	-	-	📝	○	-	-	-	3.7	Stealth model (GLM 4.6); free on OpenCode Zen; reasoning + tool calling; text-only
13.□	GLM 5.1Z.ai	2026/4	754B MoE	200K	$0.95/$3	74	1.64	📝👁️	●	-	595.3	-	3.7	-
14.□	Claude Opus 4.8Anthropic	2026/5	?	200K	$15/$75	-	-	📝👁️	○	-	-	-	3.7	Incremental over 4.7; high CoT faithfulness; no steganographic reasoning found in white-box SAE analysis
15.□	Claude Opus 4.7Anthropic	2026/4	?	200K	$15/$75	-	-	📝👁️	○	-	-	-	3.7	-
16.□	GPT-5.5OpenAI	2026/4	?	400K	-	-	-	📝👁️🔊	○	-	-	-	3.7	-
17.□	DeepSeek V4DeepSeek	2026/4	1.6T (49B active, MoE)	1M	$2/$3	-	-	📝	●	-	-	-	3.7	NIST/CAISI eval (May 2026): most capable PRC model tested, ~8mo behind US frontier; IRT-Elo 800 vs GPT-5.5 1260, Opus 4.6 999. CAISI scores: GPQA-Diamond 90%, OTIS-AIME-2025 97%, SWE-Bench Verified 74%, ARC-AGI-2 46%. Pricing is DeepSeek V4 Pro (developer-reported).
18.□	Kimi LinearMoonshot	2026/4	48B (3B active, MoE)	1M	-	-	-	📝	●	-	-	-	3.7	Novel architecture: 3:1 KDA-to-MLA ratio (Kimi Delta Attention + Multi-head Latent Attention). 75% smaller KV cache, 6x decoding throughput at 1M context. Paired with FlashKDA CUTLASS kernels. Efficiency-focused, not flagship-capability.
19.□	Claude Sonnet 4.6Anthropic	2026/2	?	200K	-	-	-	📝👁️	○	-	-	-	3.7	-
20.□	Claude Haiku 4.5Anthropic	2025/10	?	200K	-	-	-	📝👁️	○	-	-	-	3.7	-
21.□	Ring-2.6-1TInclusionAI	2026/5	1T (63B active, MoE)	262K	-	-	-	📝	●	-	-	-	3.7	Trillion-parameter thinking model w/ adaptive reasoning; tuned for coding agents, tool use, long-horizon tasks; high/xhigh reasoning modes
22.□	Grok 4.20xAI	2026/4	-	-	-	-	-	📝	○	-	-	-	3.7	xAI reasoning/agent flagship. Scores pending public benchmark publication.
23.□	MiniMax M2.7MiniMax	2026/3	230B (10B active, MoE)	200K	-	-	-	📝	●	-	-	-	3.7	MiniMax reasoning model (text-only). Open weights, non-commercial license. Scores: AA Intelligence Index 50.
24.□	GPT-5.1OpenAI	2025/12	-	-	-	-	-	📝	○	-	-	-	3.7	Iterative GPT-5 release. Scores pending.
25.□	Gemini 3 FlashGoogle	2026/3	-	-	-	-	-	📝👁️	○	-	-	-	3.7	Fast tier of Gemini 3 family. Scores pending.
26.□	Gemini 2.5 ProGoogle	2025/3	-	-	-	-	-	📝👁️	○	-	-	-	3.7	Previous-gen Gemini flagship. Largely superseded by 3.1 Pro.
27.□	Kimi K2.5 ThinkingMoonshot	2026/1	-	-	-	-	-	📝	●	-	-	-	3.7	Predecessor to K2.6. Scores pending.
28.□	DeepSeek V3.2DeepSeek	2026/2	-	-	-	-	-	📝	●	-	-	-	3.7	DeepSeek V3.2-Exp release. Scores pending.
29.□	MiniMax M3MiniMax	2026/6	?	1M	$0.60/$2	-	-	📝👁️	●	-	-	-	3.7	Open-weight frontier coding + 1M context + native multimodality. SWE-Bench Pro 59.0, Terminal-Bench 2.1 66.0, BrowseComp 83.5. MSA: ~9x faster prefill / 15x decode at 1M.
30.□	Gemini 3.5 FlashGoogle	2026/5	?	1M	$2/$9	-	-	📝👁️🔊	○	-	-	-	3.7	~4x faster than prior frontier; beats Gemini 3.1 Pro on coding/agentic (Terminal-Bench 2.1 76.2, MCP Atlas 83.6, CharXiv 84.2). Dynamic thinking on by default.
31.□	Qwen3.7 PlusAlibaba	2026/6	?	1M	$0.40/$2	-	-	📝👁️🎥	○	-	-	-	3.7	AA Intelligence Index 53.3 (coding 46.5). Multimodal (text/image/video in). ScreenSpot Pro 79.0 GUI grounding.
32.□	Nemotron 3 Ultra 550BNVIDIA	2026/6	550B (55B active, MoE)	1M	-	146.3	-	📝	●	-	-	-	3.7	AA Intelligence Index 48; AA-Omniscience 78.7 (top non-hallucination in set). ~20T train tokens, 11 langs + 43 prog langs.
33.□	SEA-LION v4 27BAI Singapore	2025/11	27B (Gemma 3)	128K	-	-	-	📝👁️	●	-	-	-	3.7	Southeast Asian multilingual (11 langs incl Malay); Gemma 3 27B base. #4 on SEA-HELM, #1 Tamil/Filipino; runs on a 32GB laptop.
34.□	MaLLaM 5BMesolitica	2024/1	5B	20K	-	-	-	📝	●	-	-	-	3.7	Malaysia LLM (Mistral-based); Malay slang/colloquialisms + 16 regional MY languages, ~90B Malay tokens.
35.□	SeaLLMs v3 7BAlibaba DAMO	2024/7	7B (Qwen2)	32K	-	-	-	📝	●	-	-	-	3.7	Southeast Asian multilingual (Malay/Indonesian/Thai/Vietnamese/…); Qwen2-based, SOTA for its size on SEA tasks.
36.□	Qwen3.7 MaxAlibaba	2026/5	?	1M	$3/$8	-	-	📝👁️	○	-	-	-	3.7	Qwen3.7 flagship; agent-centric (long-horizon ~35h). AA Intelligence Index 56.6 (#5, top Chinese at launch); Terminal-Bench 2.0 69.7.
37.□	Qwen 3.6 PlusAlibaba	2026/4	?	128K	-	-	-	📝👁️	○	-	695.3	-	4	-
38.□	Qwen3-MaxAlibaba	2025/9	?	128K	-	-	-	📝👁️	○	-	-	597.1	4	-
39.□	GPT 5 miniOpenAI	2025/8	?	128K	-	-	-	📝👁️🔊	○	-	-	696.8	4.3	-
40.□	MAI-Thinking-1Microsoft AI	2026/6	35B active (SMoE)	256K	-	-	-	📝	○	-	794.5	-	4.3	Microsoft's first in-house reasoning model (Build 2026); no distillation from OpenAI/Anthropic. AIME 2026 94.5; SWE-Bench Pro 53 (~Opus 4.6). Foundry private preview.
41.□	DeepSeek-R1DeepSeek	2025/1	685B	128K	$0.55/$2	60	0.84	📝	●	-	-	796.6	4.7	-
42.□	Claude Opus 4.6Anthropic	2026/2	?	200K	$5/$25	40	1.78	📝👁️	○	340.7	894.4	-	5	-
43.□	o3-miniOpenAI	2025/1	?	128K	-	160	7.12	📝	○	-	-	896.5	5	-
44.□	Seed 2.0 ProByteDance	-	?	128K	-	-	-	📝	○	-	994.2	-	5	-
45.□	Gemma 4 31BGoogle	-	31B	128K	-	-	-	📝	●	-	1089.2	-	5.3	-

GPT 5.2

#1.7

OpenAI?128K$2/$1473130.34📝👁️🔊

FrMath 40.7AIME 96.1MATH L5 98.1

Kimi K2.6●

#2.3

Moonshot1T (32B active, MoE)256K📝👁️

AIME 96.4

o4-mini

OpenAI?128K16121.94📝

MATH L5 97.8

GPT 5.4 Pro

#3.3

OpenAI?128K📝👁️🔊

FrMath 50.0

#3.3

OpenAI?200K1185.38📝

MATH L5 97.8

GLM 4.6●

#3.3

Z.ai754B MoE200K📝

AIME 95.3

Open-weight MoE; basis for Big Pickle; strong reasoning + tool use

ZAYA1-74B-Preview●

#3.7

Zyphra74B (4B active, MoE)?📝

Pre-RL base checkpoint; trained on AMD hardware. Marks Zyphra's move beyond small-MoE experimentation. Companion ZAYA1-VL-8B (700M active) released alongside.

Nemotron 3 Super●

#3.7

NVIDIA120B (12B active, MoE)1M📝

Hybrid Mamba-Attention MoE w/ LatentMoE; trained on 25T tokens; 2.2x throughput vs GPT-OSS-120B, 7.5x vs Qwen3.5-122B; native MTP speculative decoding

Hy3 Preview●

#3.7

Tencent Hunyuan295B (21B active, MoE)256K📝

Open-weight Hunyuan 3 preview; 192 experts, top-8 routing; strong STEM/reasoning release

10.

Claude Sonnet 4.5

#3.7

Anthropic?200K$3/$15📝👁️

MATH L5 97.7

11.

Gemini 3.1 Pro

#3.7

Google?1M$2/$1210929.71📝👁️🔊

FrMath 36.9AIME 95.6

12.

Big Pickle

#3.7

OpenCode Zen754B MoE200Kfree/free📝

Stealth model (GLM 4.6); free on OpenCode Zen; reasoning + tool calling; text-only

13.

GLM 5.1●

#3.7

Z.ai754B MoE200K$0.95/$3741.64📝👁️

AIME 95.3

14.

Claude Opus 4.8

#3.7

Anthropic?200K$15/$75📝👁️

Incremental over 4.7; high CoT faithfulness; no steganographic reasoning found in white-box SAE analysis

15.

Claude Opus 4.7

#3.7

Anthropic?200K$15/$75📝👁️

16.

GPT-5.5

#3.7

OpenAI?400K📝👁️🔊

17.

DeepSeek V4●

#3.7

DeepSeek1.6T (49B active, MoE)1M$2/$3📝

NIST/CAISI eval (May 2026): most capable PRC model tested, ~8mo behind US frontier; IRT-Elo 800 vs GPT-5.5 1260, Opus 4.6 999. CAISI scores: GPQA-Diamond 90%, OTIS-AIME-2025 97%, SWE-Bench Verified 74%, ARC-AGI-2 46%. Pricing is DeepSeek V4 Pro (developer-reported).

18.

Kimi Linear●

#3.7

Moonshot48B (3B active, MoE)1M📝

Novel architecture: 3:1 KDA-to-MLA ratio (Kimi Delta Attention + Multi-head Latent Attention). 75% smaller KV cache, 6x decoding throughput at 1M context. Paired with FlashKDA CUTLASS kernels. Efficiency-focused, not flagship-capability.

19.

Claude Sonnet 4.6

#3.7

Anthropic?200K📝👁️

20.

Claude Haiku 4.5

#3.7

Anthropic?200K📝👁️

21.

Ring-2.6-1T●

#3.7

InclusionAI1T (63B active, MoE)262K📝

Trillion-parameter thinking model w/ adaptive reasoning; tuned for coding agents, tool use, long-horizon tasks; high/xhigh reasoning modes

22.

Grok 4.20

#3.7

xAI📝

xAI reasoning/agent flagship. Scores pending public benchmark publication.

23.

MiniMax M2.7●

#3.7

MiniMax230B (10B active, MoE)200K📝

MiniMax reasoning model (text-only). Open weights, non-commercial license. Scores: AA Intelligence Index 50.

24.

GPT-5.1

#3.7

OpenAI📝

Iterative GPT-5 release. Scores pending.

25.

Gemini 3 Flash

#3.7

Google📝👁️

Fast tier of Gemini 3 family. Scores pending.

26.

Gemini 2.5 Pro

#3.7

Google📝👁️

Previous-gen Gemini flagship. Largely superseded by 3.1 Pro.

27.

Kimi K2.5 Thinking●

#3.7

Moonshot📝

Predecessor to K2.6. Scores pending.

28.

DeepSeek V3.2●

#3.7

DeepSeek📝

DeepSeek V3.2-Exp release. Scores pending.

29.

MiniMax M3●

#3.7

MiniMax?1M$0.60/$2📝👁️

Open-weight frontier coding + 1M context + native multimodality. SWE-Bench Pro 59.0, Terminal-Bench 2.1 66.0, BrowseComp 83.5. MSA: ~9x faster prefill / 15x decode at 1M.

30.

Gemini 3.5 Flash

#3.7

Google?1M$2/$9📝👁️🔊

~4x faster than prior frontier; beats Gemini 3.1 Pro on coding/agentic (Terminal-Bench 2.1 76.2, MCP Atlas 83.6, CharXiv 84.2). Dynamic thinking on by default.

31.

Qwen3.7 Plus

#3.7

Alibaba?1M$0.40/$2📝👁️🎥

AA Intelligence Index 53.3 (coding 46.5). Multimodal (text/image/video in). ScreenSpot Pro 79.0 GUI grounding.

32.

Nemotron 3 Ultra 550B●

#3.7

NVIDIA550B (55B active, MoE)1M146.3📝

AA Intelligence Index 48; AA-Omniscience 78.7 (top non-hallucination in set). ~20T train tokens, 11 langs + 43 prog langs.

33.

SEA-LION v4 27B●

#3.7

AI Singapore27B (Gemma 3)128K📝👁️

Southeast Asian multilingual (11 langs incl Malay); Gemma 3 27B base. #4 on SEA-HELM, #1 Tamil/Filipino; runs on a 32GB laptop.

34.

MaLLaM 5B●

#3.7

Mesolitica5B20K📝

Malaysia LLM (Mistral-based); Malay slang/colloquialisms + 16 regional MY languages, ~90B Malay tokens.

35.

SeaLLMs v3 7B●

#3.7

Alibaba DAMO7B (Qwen2)32K📝

Southeast Asian multilingual (Malay/Indonesian/Thai/Vietnamese/…); Qwen2-based, SOTA for its size on SEA tasks.

36.

Qwen3.7 Max

#3.7

Alibaba?1M$3/$8📝👁️

Qwen3.7 flagship; agent-centric (long-horizon ~35h). AA Intelligence Index 56.6 (#5, top Chinese at launch); Terminal-Bench 2.0 69.7.

37.

Qwen 3.6 Plus

Alibaba?128K📝👁️

AIME 95.3

38.

Qwen3-Max

Alibaba?128K📝👁️

MATH L5 97.1

39.

GPT 5 mini

#4.3

OpenAI?128K📝👁️🔊

MATH L5 96.8

40.

MAI-Thinking-1

#4.3

Microsoft AI35B active (SMoE)256K📝

AIME 94.5

Microsoft's first in-house reasoning model (Build 2026); no distillation from OpenAI/Anthropic. AIME 2026 94.5; SWE-Bench Pro 53 (~Opus 4.6). Foundry private preview.

41.

DeepSeek-R1●

#4.7

DeepSeek685B128K$0.55/$2600.84📝

MATH L5 96.6

42.

Claude Opus 4.6

Anthropic?200K$5/$25401.78📝👁️

FrMath 40.7AIME 94.4

43.

o3-mini

OpenAI?128K1607.12📝