Robo2u

Vision (Image Understanding)

Data as of June 12, 2026
Anthropic?200K$5/$25
Arena 1295MathVis+Py 84.6V*+Py 86.4
Google?1M$2/$12
Arena 1287
OpenAI?128K$2/$14
Arena 1278
Anthropic?200K$3/$15
Arena 1271
Google?1M$0.50/$3
Arena 1268
ByteDance?128K
Arena 1257
Z.ai744B (40B active, MoE)200K$1/$4
OpenAI?128K$2/$14
MathVis+Py 96.1V*+Py 98.4
OpenAI?128K$2/$14
Arena 1249
Mistral128B256K$2/$8
Anthropic?200K$15/$75
Anthropic?200K$15/$75
Anthropic?200K$1/$5
#7
OpenAI?400K
Google?1M$2/$12
MathVis+Py 95.7V*+Py 96.9
16.Kimi K2.6●
#7
Moonshot1T128K
MMMU-Pro 79.4MathVis+Py 93.2V*+Py 96.9
InclusionAI??
18.GLM 5.1●
#7
Z.ai
DeepSeek
20.Qwen 3.6 Plus●
#7
Alibaba
21.DeepSeek V3.2●
#7
DeepSeek
22.GLM 4.6●
#7
Z.ai
23.MiniMax M3●
#7
MiniMax?1M$0.60/$2
Google?1M$2/$9
Alibaba?1M$0.40/$2
Alibaba?1M$3/$8
Google?1M$1/$10
Arena 1246
Moonshot1T128K
Arena 1245MathVis+Py 85.0V*+Py 86.9
xAI?128K
Arena 1243
30.Qwen 3.5 397B●
#11
Alibaba397B128K
Arena 1240
#12
OpenAI?128K$2/$14
Arena 1225
#13
Alibaba235B128K
Arena 1215
#14
Meta400B1M
Arena 1147

Image Generation

Data as of June 9, 2026
OpenAI?
Arena 1385
Successor to GPT-Image-1.5; supports detailed instruction following, accurate placement, available on Vercel AI Gateway
Reve4K
Arena 1273
#2 on text-to-image arena. Pitched as 'best 4K image model'; introduces layout-based generation/editing β€” precise control over where every object and text region lands
Google2048Γ—2048$0.04
Arena 1269
Microsoft AI2048Γ—2048
Arena 1253
#4 on text-to-image arena (#3 per Microsoft), #2 on image-to-image; successor to MAI Image 2
OpenAI2048Γ—2048$0.08
Arena 1241
xAI
Arena 1234
Quality variant added to LMArena May 6 2026
Google4K$0.24
Arena 1232
8.Ideogram 4.0●
#8
Ideogram2048Γ—2048
Arena 1204
#9 overall on text-to-image arena, #1 open-weight model. Open weights at launch; trained with bounding boxes tied to region descriptions for layout control; excels at text rendering and commercial design
Luma AI
Arena 1192
10.MAI Image 2
#10
Microsoft AI2048Γ—2048
Arena 1183
#11
Luma AI
Arena 1175
xAI2048Γ—2048free
Arena 1172
#13
Alibaba
Arena 1168
14.Reve v1.5
#14
Reve2048Γ—2048
Arena 1164
Black Forest Labs2048Γ—2048$0.08
Arena 1164
xAI
Arena 1160
Sourceful4K
Agentic text-to-image with autonomous self-correction; #1 Text-to-Image & Image Editing on Artificial Analysis (Feb 2026). $0.15 per 1K/2K image, $0.33 per 4K.
Black Forest Labs
Arena 1156
Black Forest Labs
Arena 1156
Google1024Γ—1024$0.03
Arena 1152
Codename 'Nano Banana' on LMArena leaderboard; conversational image editing, character consistency
#20
Tencent2048Γ—2048free
Arena 1151
China API
22.Flux 2 Dev●
#21
Black Forest Labs
Arena 1149
Google2048Γ—2048$0.06
Arena 1148
ByteDance2048Γ—2048$0.03
Arena 1142
China API
ByteDance2048Γ—2048
Arena 1141
26.Wan 2.6 T2I●
#25
Alibaba1536Γ—1536free
Arena 1134
Google
Arena 1130
#27
Alibaba
Arena 1130
OpenAI2048Γ—2048$0.04
Arena 1115
30.Recraft v4
#29
Recraft2048Γ—2048$0.04
Arena 1099
31.Ideogram v3
#30
Ideogram2048Γ—2048$0.08
Arena 1049
#31
OpenAI1792Γ—1024$0.08
Arena 968.0
33.SD 3.5 Large●
#32
Stability AI1024Γ—1024free
Arena 938.0

Video Generation

Data as of June 9, 2026
ByteDance15s1080p$0.10
Arena 1450
China API; via Dreamina
Google8s4K$0.75
Arena 1371
Vertex AI only
OpenAI25s1080p$0.50
Arena 1364
SHUTTING DOWN Apr 26
xAI720p
Arena 1361
5.Wan 2.6●
#5
Alibaba10s720pfree
Arena 1349
Open weights
#6
Google8s1080p$0.50
Arena 1341
#7
OpenAI25s1080p$0.25
Arena 1340
SHUTTING DOWN Apr 26
ByteDance15s720p$0.05
Arena 1259
China API
Runway10s1080p$0.17
Arena 1242
Kuaishou10s4K$0.02
Leads text-to-video arena; native 4K, Multi-Shot Storyboard, dual audio inputs. Extend to 3 min on paid tiers.
Alibaba15s1080p$0.22
Tops Artificial Analysis Video Arena (T2V & I2V). API/demo only as of Jun 2026 β€” weights NOT released (HF/GitHub 'coming soon'); 'open-source' framing is marketing. ~15B params, single-stream transformer, 8-step CFG-free (10s 1080p in ~32s), synced audio. Access via Alibaba Cloud Bailian / fal.
12.PixVerse v5.6
#10
PixVerse1080p
Arena 1238
Kuaishou15s1080p$0.04
Arena 1218
Cheapest quality option
14.Ray 3
#12
Luma AI1080p
Arena 1207
15.Hailuo 2.3
#13
MiniMax6s1080p
Arena 1197
China API
16.LTX-2.3●
#14
Lightricks20s1080pfree
Arena 1185
Open weights; synced audio+video; native 9:16 portrait; fp8 + distilled variants
17.LTX-2●
#15
Lightricks20s4Kfree
Arena 1175
Open weights; first open audiovisual model; 4K @ 50fps
#16
Tencent5s720pfree
Arena 1170
Open weights; China API
#17
Google8s4K$0.30
Arena 1164
20.Pika 2.2
#18
Pika10s1080p
Arena 1009

Visual Factual Knowledge (WorldVQA)

Data as of April 26, 2026
GoogleFrontier VLM
Overall % 47.5
Top overall F-score on WorldVQA
2.Kimi K2.5●
#2
MoonshotFrontier VLM
Overall % 46.8
Author's own model; #2 by 0.7%
AnthropicFrontier VLM
Overall % 37.5
Strong on Head categories, weaker on Tail
ByteDanceFrontier VLM
Overall % 35.2
ByteDance Seed visual frontier

Document Parsing (ParseBench)

Data as of April 25, 2026
LlamaIndexAgentic parser
Overall 84.9Tables 90.7Charts 78.1Faithful 89.7Format 85.2Ground 80.6
Highest overall; agentic loop with retries
GoogleVLM (high-reasoning)
Overall 75.0Tables 91.5Charts 64.8Faithful 90.9Format 68.3Ground 59.8
Tables SOTA; weak on grounding
ReductoAgentic parser
Overall 73.0Tables 80.4Charts 73.4Faithful 86.4Format 57.6Ground 67.1
Specialist parsing service; agentic mode
LlamaIndexCost-tier parser
Overall 71.9Tables 73.2Charts 66.7Faithful 88.0Format 73.0Ground 58.6
<$0.004/page; budget-tier of LlamaParse
GoogleVLM (light-reasoning)
Overall 71.0Tables 89.8Charts 64.8Faithful 86.2Format 58.4Ground 56.0
Tradeoff vs Thinking High: cheaper, formatting drops
Adobe ResearchSpecialized OCR
Overall 70.1Tables 89.2Charts 65.1Faithful 83.7Format 61.4Ground 51.2
Tables strong; grounding lags
GoogleVLM (frontier)
Overall 69.1Tables 91.0Charts 41.1Faithful 90.2Format 52.4Ground 71.0
Highest faithfulness; charts surprisingly weak
ReductoSpecialist parser
Overall 67.8Tables 70.3Charts 57.0Faithful 86.4Format 56.8Ground 68.7
Non-agentic mode; tied with Extend
ExtendDocument AI
Overall 67.8Tables 85.9Charts 40.4Faithful 85.0Format 59.5Ground 68.3
YC-backed structured-data extractor
OpenAIVLM (reasoning)
Overall 67.8Tables 90.0Charts 65.5Faithful 86.8Format 60.1Ground 36.3
Tables strong; grounding worst in top 10

AI Image Detection (AdversIm)

Data as of April 25, 2026
GoogleDetector + Generator-95Proprietary (API)
Clean % 100.0Perturbed % 5.0
Top clean-image detection; collapses under blur/noise/JPEG
OpenAIDetector + Generator-94Proprietary (API)
Clean % 100.0Perturbed % 6.0
Tied for clean-image SOTA; equally fragile under perturbation
xAIGenerator (most evasive)Proprietary (X Premium+)
Perturbed % 7.0
Most evasive generator under perturbation in the benchmark
4.Qwen Image●
#3
AlibabaDetector + Generator-94Apache 2.0
Clean % 100.0Perturbed % 6.0
Only fully open-weight entry; same fragility as closed peers
ByteDanceDetector + Generator-93Proprietary (API)
Clean % 99.0Perturbed % 6.0
ByteDance Seed image gen; detection nearly matches frontier

Face-Swap / Avatar / Synthetic Media

Data as of April 24, 2026
1.DeepFaceLab●
iperovFace-swap (open source)unlimitedsource-dependentGPLv3
Canonical OSS face-swap toolkit
2.LivePortrait●
KuaishouPortrait animationunlimited512-1024MIT
Driven by reference video motion
3.FaceFusion 3●
Henry RuhsFace-swap (open source)unlimitedsource-dependentMIT
GPU + CPU pipelines; active dev
4.Hallo3●
Fudan / BaiduAudio-driven talking-headunlimited1024MIT
Voice-driven portrait animation
5.D-ID Creative Reality Studio
D-IDAvatar / talking-head5min1080pProprietary
Widely used for photo-to-video
6.Pika 2.2
Pika LabsText/image-to-video10s1080pProprietary
Pikaffects for stylised motion
7.MiniMax Hailuo 02
MiniMaxText/image-to-video10s1080pProprietary
Lowest-cost frontier video gen
KuaishouText/image-to-video10s1080pProprietary
Strong face/body motion realism
9.Luma Ray 2
Luma AIText/image-to-video10s1080pProprietary
Strong physics fidelity
10.HeyGen Avatar V5
HeyGenAvatar / talking-head5min4KProprietary
Most-used commercial deepfake stack
RunwayText/image-to-video16s1080pProprietary API
Character and scene consistency focus
Google DeepMindText-to-video (with audio)60s4KProprietary (Gemini API, Flow)
SynthID watermark on all outputs
OpenAIText-to-video60s1080pProprietary (ChatGPT Pro/Plus)
C2PA provenance watermarking
14.Flux Kontext●
Black Forest LabsImage edit / face-swapn/a2KOpen weights
Reference-driven image editing; face-swap capable
ByteDanceText/image-to-video10s1080pProprietary (Venice, CapCut)
#1 in public video arena (2026-04)