Robo2u

Reasoning / Math Models

Data as of June 18, 2026
#1.7
OpenAI?128K$2/$1473130.34πŸ“πŸ‘οΈπŸ”Š
FrMath 40.7AIME 96.1MATH L5 98.1
2.Kimi K2.6●
#2.3
Moonshot1T (32B active, MoE)256KπŸ“πŸ‘οΈ
AIME 96.4
OpenAI?128K16121.94πŸ“
MATH L5 97.8
OpenAI?128KπŸ“πŸ‘οΈπŸ”Š
FrMath 50.0
5.o3
#3.3
OpenAI?200K1185.38πŸ“
MATH L5 97.8
6.GLM 4.6●
#3.3
Z.ai754B MoE200KπŸ“
AIME 95.3
Open-weight MoE; basis for Big Pickle; strong reasoning + tool use
#3.7
Zyphra74B (4B active, MoE)?πŸ“
Pre-RL base checkpoint; trained on AMD hardware. Marks Zyphra's move beyond small-MoE experimentation. Companion ZAYA1-VL-8B (700M active) released alongside.
#3.7
NVIDIA120B (12B active, MoE)1MπŸ“
Hybrid Mamba-Attention MoE w/ LatentMoE; trained on 25T tokens; 2.2x throughput vs GPT-OSS-120B, 7.5x vs Qwen3.5-122B; native MTP speculative decoding
9.Hy3 Preview●
#3.7
Tencent Hunyuan295B (21B active, MoE)256KπŸ“
Open-weight Hunyuan 3 preview; 192 experts, top-8 routing; strong STEM/reasoning release
Anthropic?200K$3/$15πŸ“πŸ‘οΈ
MATH L5 97.7
Google?1M$2/$1210929.71πŸ“πŸ‘οΈπŸ”Š
FrMath 36.9AIME 95.6
#3.7
OpenCode Zen754B MoE200Kfree/freeπŸ“
Stealth model (GLM 4.6); free on OpenCode Zen; reasoning + tool calling; text-only
13.GLM 5.1●
#3.7
Z.ai754B MoE200K$0.95/$3741.64πŸ“πŸ‘οΈ
AIME 95.3
Anthropic?200K$15/$75πŸ“πŸ‘οΈ
Incremental over 4.7; high CoT faithfulness; no steganographic reasoning found in white-box SAE analysis
Anthropic?200K$15/$75πŸ“πŸ‘οΈ
#3.7
OpenAI?400KπŸ“πŸ‘οΈπŸ”Š
17.DeepSeek V4●
#3.7
DeepSeek1.6T (49B active, MoE)1M$2/$3πŸ“
NIST/CAISI eval (May 2026): most capable PRC model tested, ~8mo behind US frontier; IRT-Elo 800 vs GPT-5.5 1260, Opus 4.6 999. CAISI scores: GPQA-Diamond 90%, OTIS-AIME-2025 97%, SWE-Bench Verified 74%, ARC-AGI-2 46%. Pricing is DeepSeek V4 Pro (developer-reported).
18.Kimi Linear●
#3.7
Moonshot48B (3B active, MoE)1MπŸ“
Novel architecture: 3:1 KDA-to-MLA ratio (Kimi Delta Attention + Multi-head Latent Attention). 75% smaller KV cache, 6x decoding throughput at 1M context. Paired with FlashKDA CUTLASS kernels. Efficiency-focused, not flagship-capability.
Anthropic?200KπŸ“πŸ‘οΈ
Anthropic?200KπŸ“πŸ‘οΈ
21.Ring-2.6-1T●
#3.7
InclusionAI1T (63B active, MoE)262KπŸ“
Trillion-parameter thinking model w/ adaptive reasoning; tuned for coding agents, tool use, long-horizon tasks; high/xhigh reasoning modes
#3.7
xAIπŸ“
xAI reasoning/agent flagship. Scores pending public benchmark publication.
23.MiniMax M2.7●
#3.7
MiniMax230B (10B active, MoE)200KπŸ“
MiniMax reasoning model (text-only). Open weights, non-commercial license. Scores: AA Intelligence Index 50.
#3.7
OpenAIπŸ“
Iterative GPT-5 release. Scores pending.
GoogleπŸ“πŸ‘οΈ
Fast tier of Gemini 3 family. Scores pending.
GoogleπŸ“πŸ‘οΈ
Previous-gen Gemini flagship. Largely superseded by 3.1 Pro.
#3.7
MoonshotπŸ“
Predecessor to K2.6. Scores pending.
28.DeepSeek V3.2●
#3.7
DeepSeekπŸ“
DeepSeek V3.2-Exp release. Scores pending.
29.MiniMax M3●
#3.7
MiniMax?1M$0.60/$2πŸ“πŸ‘οΈ
Open-weight frontier coding + 1M context + native multimodality. SWE-Bench Pro 59.0, Terminal-Bench 2.1 66.0, BrowseComp 83.5. MSA: ~9x faster prefill / 15x decode at 1M.
Google?1M$2/$9πŸ“πŸ‘οΈπŸ”Š
~4x faster than prior frontier; beats Gemini 3.1 Pro on coding/agentic (Terminal-Bench 2.1 76.2, MCP Atlas 83.6, CharXiv 84.2). Dynamic thinking on by default.
Alibaba?1M$0.40/$2πŸ“πŸ‘οΈπŸŽ₯
AA Intelligence Index 53.3 (coding 46.5). Multimodal (text/image/video in). ScreenSpot Pro 79.0 GUI grounding.
NVIDIA550B (55B active, MoE)1M146.3πŸ“
AA Intelligence Index 48; AA-Omniscience 78.7 (top non-hallucination in set). ~20T train tokens, 11 langs + 43 prog langs.
#3.7
AI Singapore27B (Gemma 3)128KπŸ“πŸ‘οΈ
Southeast Asian multilingual (11 langs incl Malay); Gemma 3 27B base. #4 on SEA-HELM, #1 Tamil/Filipino; runs on a 32GB laptop.
34.MaLLaM 5B●
#3.7
Mesolitica5B20KπŸ“
Malaysia LLM (Mistral-based); Malay slang/colloquialisms + 16 regional MY languages, ~90B Malay tokens.
35.SeaLLMs v3 7B●
#3.7
Alibaba DAMO7B (Qwen2)32KπŸ“
Southeast Asian multilingual (Malay/Indonesian/Thai/Vietnamese/…); Qwen2-based, SOTA for its size on SEA tasks.
Alibaba?1M$3/$8πŸ“πŸ‘οΈ
Qwen3.7 flagship; agent-centric (long-horizon ~35h). AA Intelligence Index 56.6 (#5, top Chinese at launch); Terminal-Bench 2.0 69.7.
Alibaba?128KπŸ“πŸ‘οΈ
AIME 95.3
Alibaba?128KπŸ“πŸ‘οΈ
MATH L5 97.1
#4.3
OpenAI?128KπŸ“πŸ‘οΈπŸ”Š
MATH L5 96.8
Microsoft AI35B active (SMoE)256KπŸ“
AIME 94.5
Microsoft's first in-house reasoning model (Build 2026); no distillation from OpenAI/Anthropic. AIME 2026 94.5; SWE-Bench Pro 53 (~Opus 4.6). Foundry private preview.
41.DeepSeek-R1●
#4.7
DeepSeek685B128K$0.55/$2600.84πŸ“
MATH L5 96.6
Anthropic?200K$5/$25401.78πŸ“πŸ‘οΈ
FrMath 40.7AIME 94.4
#5
OpenAI?128K1607.12πŸ“
MATH L5 96.5
ByteDance?128KπŸ“
AIME 94.2
45.Gemma 4 31B●
#5.3
Google31B128KπŸ“
AIME 89.2