Vision (Image Understanding)
Data as of June 12, 2026| Model | ArenaB | MMMU-ProA | MathVis+PyB | V*+PyB | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.β‘ | 2026/2 | ? | 200K | $5/$25 | β | 11295 | - | 584.6 | 586.4 | 1 | |
| 2.β‘ | 2025/11 | ? | 1M | $2/$12 | β | 21287 | - | - | - | 2 | |
| 3.β‘ | 2025/12 | ? | 128K | $2/$14 | β | 31278 | - | - | - | 3 | |
| 4.β‘ | 2026/2 | ? | 200K | $3/$15 | β | 41271 | - | - | - | 4 | |
| 5.β‘ | 2025/12 | ? | 1M | $0.50/$3 | β | 51268 | - | - | - | 5 | |
| 6.β‘ | 2026/2 | ? | 128K | - | β | 61257 | - | - | - | 6 | |
| 7.β‘ | 2026/4 | 744B (40B active, MoE) | 200K | $1/$4 | β | - | - | - | - | 7 | |
| 8.β‘ | 2026/3 | ? | 128K | $2/$14 | β | - | - | 196.1 | 198.4 | 7 | |
| 9.β‘ | 2025/11 | ? | 128K | $2/$14 | β | 71249 | - | - | - | 7 | |
| 10.β‘ | 2026/4 | 128B | 256K | $2/$8 | β | - | - | - | - | 7 | |
| 11.β‘ | 2026/5 | ? | 200K | $15/$75 | β | - | - | - | - | 7 | |
| 12.β‘ | 2026/4 | ? | 200K | $15/$75 | β | - | - | - | - | 7 | |
| 13.β‘ | 2025/10 | ? | 200K | $1/$5 | β | - | - | - | - | 7 | |
| 14.β‘ | 2026/4 | ? | 400K | - | β | - | - | - | - | 7 | |
| 15.β‘ | 2026/2 | ? | 1M | $2/$12 | β | - | - | 295.7 | 296.9 | 7 | |
| 16.β‘ | 2026/4 | 1T | 128K | - | β | - | 179.4 | 393.2 | 396.9 | 7 | |
| 17.β‘ | 2025/10 | ? | ? | - | β | - | - | - | - | 7 | |
| 18.β‘ | 2026/4 | - | - | - | β | - | - | - | - | 7 | |
| 19.β‘ | 2026/4 | - | - | - | β | - | - | - | - | 7 | |
| 20.β‘ | 2026/4 | - | - | - | β | - | - | - | - | 7 | |
| 21.β‘ | 2026/2 | - | - | - | β | - | - | - | - | 7 | |
| 22.β‘ | 2025/11 | - | - | - | β | - | - | - | - | 7 | |
| 23.β‘ | 2026/6 | ? | 1M | $0.60/$2 | β | - | - | - | - | 7 | |
| 24.β‘ | 2026/5 | ? | 1M | $2/$9 | β | - | - | - | - | 7 | |
| 25.β‘ | 2026/6 | ? | 1M | $0.40/$2 | β | - | - | - | - | 7 | |
| 26.β‘ | 2026/5 | ? | 1M | $3/$8 | β | - | - | - | - | 7 | |
| 27.β‘ | 2025/3 | ? | 1M | $1/$10 | β | 81246 | - | - | - | 8 | |
| 28.β‘ | 2026/1 | 1T | 128K | - | β | 91245 | - | 485.0 | 486.9 | 9 | |
| 29.β‘ | - | ? | 128K | - | β | 101243 | - | - | - | 10 | |
| 30.β‘ | 2026/2 | 397B | 128K | - | β | 111240 | - | - | - | 11 | |
| 31.β‘ | 2025/8 | ? | 128K | $2/$14 | β | 121225 | - | - | - | 12 | |
| 32.β‘ | 2025/9 | 235B | 128K | - | β | 131215 | - | - | - | 13 | |
| 33.β‘ | 2025/4 | 400B | 1M | - | β | 141147 | - | - | - | 14 | |
#7
Z.ai744B (40B active, MoE)200K$1/$4
#7
Mistral128B256K$2/$8
#7
Anthropic?200K$15/$75
#7
Anthropic?200K$15/$75
#7
Anthropic?200K$1/$5
#7
OpenAI?400K
#7
InclusionAI??
#7
Z.ai
#7
DeepSeek
#7
Alibaba
#7
DeepSeek
#7
Z.ai
#7
MiniMax?1M$0.60/$2
#7
Google?1M$2/$9
#7
Alibaba?1M$0.40/$2
#7
Alibaba?1M$3/$8
Image Generation
Data as of June 9, 2026| Model | ArenaB | NotesβΌ | ||||||
|---|---|---|---|---|---|---|---|---|
| 1.β‘ | 2026/4 | ? | - | β | 11385 | 1 | Successor to GPT-Image-1.5; supports detailed instruction following, accurate placement, available on Vercel AI Gateway | |
| 2.β‘ | 2026/6 | 4K | - | β | 21273 | 2 | #2 on text-to-image arena. Pitched as 'best 4K image model'; introduces layout-based generation/editing β precise control over where every object and text region lands | |
| 3.β‘ | 2026/2 | 2048Γ2048 | $0.04 | β | 31269 | 3 | - | |
| 4.β‘ | 2026/5 | 2048Γ2048 | - | β | 41253 | 4 | #4 on text-to-image arena (#3 per Microsoft), #2 on image-to-image; successor to MAI Image 2 | |
| 5.β‘ | 2025/12 | 2048Γ2048 | $0.08 | β | 51241 | 5 | - | |
| 6.β‘ | 2026/5 | - | - | β | 61234 | 6 | Quality variant added to LMArena May 6 2026 | |
| 7.β‘ | 2025/11 | 4K | $0.24 | β | 71232 | 7 | - | |
| 8.β‘ | 2026/6 | 2048Γ2048 | - | β | 81204 | 8 | #9 overall on text-to-image arena, #1 open-weight model. Open weights at launch; trained with bounding boxes tied to region descriptions for layout control; excels at text rendering and commercial design | |
| 9.β‘ | 2026/5 | - | - | β | 91192 | 9 | - | |
| 10.β‘ | MAI Image 2Microsoft AI | - | 2048Γ2048 | - | β | 101183 | 10 | - |
| 11.β‘ | 2026/5 | - | - | β | 111175 | 11 | - | |
| 12.β‘ | 2025/7 | 2048Γ2048 | free | β | 121172 | 12 | - | |
| 13.β‘ | 2026/4 | - | - | β | 131168 | 13 | - | |
| 14.β‘ | Reve v1.5Reve | - | 2048Γ2048 | - | β | 141164 | 14 | - |
| 15.β‘ | - | 2048Γ2048 | $0.08 | β | 151164 | 15 | - | |
| 16.β‘ | - | - | - | β | 161160 | 16 | - | |
| 17.β‘ | 2025/11 | 4K | - | β | - | 16 | Agentic text-to-image with autonomous self-correction; #1 Text-to-Image & Image Editing on Artificial Analysis (Feb 2026). $0.15 per 1K/2K image, $0.33 per 4K. | |
| 18.β‘ | - | - | - | β | 171156 | 17 | - | |
| 19.β‘ | - | - | - | β | 181156 | 18 | - | |
| 20.β‘ | 2025/8 | 1024Γ1024 | $0.03 | β | 191152 | 19 | Codename 'Nano Banana' on LMArena leaderboard; conversational image editing, character consistency | |
| 21.β‘ | 2025/9 | 2048Γ2048 | free | β | 201151 | 20 | China API | |
| 22.β‘ | - | - | - | β | 211149 | 21 | - | |
| 23.β‘ | - | 2048Γ2048 | $0.06 | β | 221148 | 22 | - | |
| 24.β‘ | 2025/12 | 2048Γ2048 | $0.03 | β | 231142 | 23 | China API | |
| 25.β‘ | - | 2048Γ2048 | - | β | 241141 | 24 | - | |
| 26.β‘ | 2025/12 | 1536Γ1536 | free | β | 251134 | 25 | - | |
| 27.β‘ | - | - | - | β | 261130 | 26 | - | |
| 28.β‘ | - | - | - | β | 271130 | 27 | - | |
| 29.β‘ | 2025/4 | 2048Γ2048 | $0.04 | β | 281115 | 28 | - | |
| 30.β‘ | Recraft v4Recraft | - | 2048Γ2048 | $0.04 | β | 291099 | 29 | - |
| 31.β‘ | Ideogram v3Ideogram | 2025/3 | 2048Γ2048 | $0.08 | β | 301049 | 30 | - |
| 32.β‘ | 2023/11 | 1792Γ1024 | $0.08 | β | 31968.0 | 31 | - | |
| 33.β‘ | 2024/10 | 1024Γ1024 | free | β | 32938.0 | 32 | - | |
#1
OpenAI?
Arena 1385
Successor to GPT-Image-1.5; supports detailed instruction following, accurate placement, available on Vercel AI Gateway
#2
Reve4K
Arena 1273
#2 on text-to-image arena. Pitched as 'best 4K image model'; introduces layout-based generation/editing β precise control over where every object and text region lands
#4
Microsoft AI2048Γ2048
Arena 1253
#4 on text-to-image arena (#3 per Microsoft), #2 on image-to-image; successor to MAI Image 2
#8
Ideogram2048Γ2048
Arena 1204
#9 overall on text-to-image arena, #1 open-weight model. Open weights at launch; trained with bounding boxes tied to region descriptions for layout control; excels at text rendering and commercial design
10.MAI Image 2
#10Microsoft AI2048Γ2048
Arena 1183
14.Reve v1.5
#14Reve2048Γ2048
Arena 1164
#16
Sourceful4K
Agentic text-to-image with autonomous self-correction; #1 Text-to-Image & Image Editing on Artificial Analysis (Feb 2026). $0.15 per 1K/2K image, $0.33 per 4K.
Google1024Γ1024$0.03
Arena 1152
Codename 'Nano Banana' on LMArena leaderboard; conversational image editing, character consistency
30.Recraft v4
#29Recraft2048Γ2048$0.04
Arena 1099
31.Ideogram v3
#30Ideogram2048Γ2048$0.08
Arena 1049
Video Generation
Data as of June 9, 2026| Model | ArenaB | NotesβΌ | |||||||
|---|---|---|---|---|---|---|---|---|---|
| 1.β‘ | 2026/2 | 15s | 1080p | $0.10 | β | 11450 | 1 | China API; via Dreamina | |
| 2.β‘ | 2025/10 | 8s | 4K | $0.75 | β | 21371 | 2 | Vertex AI only | |
| 3.β‘ | 2025/9 | 25s | 1080p | $0.50 | β | 31364 | 3 | SHUTTING DOWN Apr 26 | |
| 4.β‘ | - | - | 720p | - | β | 41361 | 4 | - | |
| 5.β‘ | 2025/12 | 10s | 720p | free | β | 51349 | 5 | Open weights | |
| 6.β‘ | 2025/5 | 8s | 1080p | $0.50 | β | 61341 | 6 | - | |
| 7.β‘ | 2025/9 | 25s | 1080p | $0.25 | β | 71340 | 7 | SHUTTING DOWN Apr 26 | |
| 8.β‘ | 2025/12 | 15s | 720p | $0.05 | β | 81259 | 8 | China API | |
| 9.β‘ | 2025/3 | 10s | 1080p | $0.17 | β | 91242 | 9 | - | |
| 10.β‘ | 2026/2 | 10s | 4K | $0.02 | β | - | 9 | Leads text-to-video arena; native 4K, Multi-Shot Storyboard, dual audio inputs. Extend to 3 min on paid tiers. | |
| 11.β‘ | 2026/4 | 15s | 1080p | $0.22 | β | - | 9 | Tops Artificial Analysis Video Arena (T2V & I2V). API/demo only as of Jun 2026 β weights NOT released (HF/GitHub 'coming soon'); 'open-source' framing is marketing. ~15B params, single-stream transformer, 8-step CFG-free (10s 1080p in ~32s), synced audio. Access via Alibaba Cloud Bailian / fal. | |
| 12.β‘ | PixVerse v5.6PixVerse | - | - | 1080p | - | β | 101238 | 10 | - |
| 13.β‘ | 2025/12 | 15s | 1080p | $0.04 | β | 111218 | 11 | Cheapest quality option | |
| 14.β‘ | - | - | 1080p | - | β | 121207 | 12 | - | |
| 15.β‘ | 2025/10 | 6s | 1080p | - | β | 131197 | 13 | China API | |
| 16.β‘ | 2026/5 | 20s | 1080p | free | β | 141185 | 14 | Open weights; synced audio+video; native 9:16 portrait; fp8 + distilled variants | |
| 17.β‘ | 2026/1 | 20s | 4K | free | β | 151175 | 15 | Open weights; first open audiovisual model; 4K @ 50fps | |
| 18.β‘ | 2025/11 | 5s | 720p | free | β | 161170 | 16 | Open weights; China API | |
| 19.β‘ | 2024/12 | 8s | 4K | $0.30 | β | 171164 | 17 | - | |
| 20.β‘ | 2025/2 | 10s | 1080p | - | β | 181009 | 18 | - | |
#9
Kuaishou10s4K$0.02
Leads text-to-video arena; native 4K, Multi-Shot Storyboard, dual audio inputs. Extend to 3 min on paid tiers.
#9
Alibaba15s1080p$0.22
Tops Artificial Analysis Video Arena (T2V & I2V). API/demo only as of Jun 2026 β weights NOT released (HF/GitHub 'coming soon'); 'open-source' framing is marketing. ~15B params, single-stream transformer, 8-step CFG-free (10s 1080p in ~32s), synced audio. Access via Alibaba Cloud Bailian / fal.
12.PixVerse v5.6
#10PixVerse1080p
Arena 1238
14.
Ray 3
#12Luma AI1080p
Arena 1207
15.
Hailuo 2.3
#13MiniMax6s1080p
Arena 1197
China API
#14
Lightricks20s1080pfree
Arena 1185
Open weights; synced audio+video; native 9:16 portrait; fp8 + distilled variants
20.
Pika 2.2
#18Pika10s1080p
Arena 1009
Visual Factual Knowledge (WorldVQA)
Data as of April 26, 2026| Model | Overall %B | NotesβΌ | |||||
|---|---|---|---|---|---|---|---|
| 1.β‘ | Frontier VLM | 2025/11 | β | 147.5 | 1 | Top overall F-score on WorldVQA | |
| 2.β‘ | Frontier VLM | 2026/1 | β | 246.8 | 2 | Author's own model; #2 by 0.7% | |
| 3.β‘ | Frontier VLM | 2025/9 | β | 337.5 | 3 | Strong on Head categories, weaker on Tail | |
| 4.β‘ | Frontier VLM | 2025/10 | β | 435.2 | 4 | ByteDance Seed visual frontier | |
Document Parsing (ParseBench)
Data as of April 25, 2026| Model | OverallB | TablesB | ChartsB | FaithfulB | FormatB | GroundB | NotesβΌ | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.β‘ | Agentic parser | 2026/4 | β | 184.9 | 390.7 | 178.1 | 389.7 | 185.2 | 180.6 | 1 | Highest overall; agentic loop with retries | |
| 2.β‘ | VLM (high-reasoning) | 2025/12 | β | 275.0 | 191.5 | 764.8 | 190.9 | 368.3 | 659.8 | 2 | Tables SOTA; weak on grounding | |
| 3.β‘ | Agentic parser | 2024/8 | β | 373.0 | 880.4 | 273.4 | 686.4 | 857.6 | 567.1 | 3 | Specialist parsing service; agentic mode | |
| 4.β‘ | Cost-tier parser | 2026/4 | β | 471.9 | 973.2 | 366.7 | 488.0 | 273.0 | 758.6 | 4 | <$0.004/page; budget-tier of LlamaParse | |
| 5.β‘ | VLM (light-reasoning) | 2025/12 | β | 571.0 | 589.8 | 664.8 | 886.2 | 758.4 | 856.0 | 5 | Tradeoff vs Thinking High: cheaper, formatting drops | |
| 6.β‘ | Specialized OCR | 2025/9 | β | 670.1 | 689.2 | 565.1 | 1083.7 | 461.4 | 951.2 | 6 | Tables strong; grounding lags | |
| 7.β‘ | VLM (frontier) | 2026/2 | β | 769.1 | 291.0 | 941.1 | 290.2 | 1052.4 | 271.0 | 7 | Highest faithfulness; charts surprisingly weak | |
| 8.β‘ | Specialist parser | 2024/8 | β | 867.8 | 1070.3 | 857.0 | 786.4 | 956.8 | 368.7 | 8 | Non-agentic mode; tied with Extend | |
| 9.β‘ | Document AI | 2024/11 | β | 967.8 | 785.9 | 1040.4 | 985.0 | 659.5 | 468.3 | 9 | YC-backed structured-data extractor | |
| 10.β‘ | VLM (reasoning) | 2026/4 | β | 1067.8 | 490.0 | 465.5 | 586.8 | 560.1 | 1036.3 | 10 | Tables strong; grounding worst in top 10 | |
LlamaIndexAgentic parser
Overall 84.9Tables 90.7Charts 78.1Faithful 89.7Format 85.2Ground 80.6
Highest overall; agentic loop with retries
GoogleVLM (high-reasoning)
Overall 75.0Tables 91.5Charts 64.8Faithful 90.9Format 68.3Ground 59.8
Tables SOTA; weak on grounding
ReductoAgentic parser
Overall 73.0Tables 80.4Charts 73.4Faithful 86.4Format 57.6Ground 67.1
Specialist parsing service; agentic mode
LlamaIndexCost-tier parser
Overall 71.9Tables 73.2Charts 66.7Faithful 88.0Format 73.0Ground 58.6
<$0.004/page; budget-tier of LlamaParse
GoogleVLM (light-reasoning)
Overall 71.0Tables 89.8Charts 64.8Faithful 86.2Format 58.4Ground 56.0
Tradeoff vs Thinking High: cheaper, formatting drops
#6
Adobe ResearchSpecialized OCR
Overall 70.1Tables 89.2Charts 65.1Faithful 83.7Format 61.4Ground 51.2
Tables strong; grounding lags
#7
GoogleVLM (frontier)
Overall 69.1Tables 91.0Charts 41.1Faithful 90.2Format 52.4Ground 71.0
Highest faithfulness; charts surprisingly weak
#8
ReductoSpecialist parser
Overall 67.8Tables 70.3Charts 57.0Faithful 86.4Format 56.8Ground 68.7
Non-agentic mode; tied with Extend
#9
ExtendDocument AI
Overall 67.8Tables 85.9Charts 40.4Faithful 85.0Format 59.5Ground 68.3
YC-backed structured-data extractor
OpenAIVLM (reasoning)
Overall 67.8Tables 90.0Charts 65.5Faithful 86.8Format 60.1Ground 36.3
Tables strong; grounding worst in top 10
AI Image Detection (AdversIm)
Data as of April 25, 2026| Model | Clean %C | Perturbed %B | NotesβΌ | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1.β‘ | Detector + Generator | -95 | 2025/11 | Proprietary (API) | β | 1100.0 | 55.0 | 1 | Top clean-image detection; collapses under blur/noise/JPEG | |
| 2.β‘ | Detector + Generator | -94 | 2025/9 | Proprietary (API) | β | 2100.0 | 26.0 | 2 | Tied for clean-image SOTA; equally fragile under perturbation | |
| 3.β‘ | Generator (most evasive) | - | 2025/8 | Proprietary (X Premium+) | β | - | 17.0 | 2 | Most evasive generator under perturbation in the benchmark | |
| 4.β‘ | Detector + Generator | -94 | 2025/8 | Apache 2.0 | β | 3100.0 | 36.0 | 3 | Only fully open-weight entry; same fragility as closed peers | |
| 5.β‘ | Detector + Generator | -93 | 2025/10 | Proprietary (API) | β | 499.0 | 46.0 | 4 | ByteDance Seed image gen; detection nearly matches frontier | |
GoogleDetector + Generator-95Proprietary (API)
Clean % 100.0Perturbed % 5.0
Top clean-image detection; collapses under blur/noise/JPEG
#2
OpenAIDetector + Generator-94Proprietary (API)
Clean % 100.0Perturbed % 6.0
Tied for clean-image SOTA; equally fragile under perturbation
#2
xAIGenerator (most evasive)Proprietary (X Premium+)
Perturbed % 7.0
Most evasive generator under perturbation in the benchmark
#3
AlibabaDetector + Generator-94Apache 2.0
Clean % 100.0Perturbed % 6.0
Only fully open-weight entry; same fragility as closed peers
#4
ByteDanceDetector + Generator-93Proprietary (API)
Clean % 99.0Perturbed % 6.0
ByteDance Seed image gen; detection nearly matches frontier
Face-Swap / Avatar / Synthetic Media
Data as of April 24, 2026| Model | NotesβΌ | |||||||
|---|---|---|---|---|---|---|---|---|
| 1.β‘ | 2024/4 | Face-swap (open source) | unlimited | source-dependent | GPLv3 | β | Canonical OSS face-swap toolkit | |
| 2.β‘ | 2024/7 | Portrait animation | unlimited | 512-1024 | MIT | β | Driven by reference video motion | |
| 3.β‘ | 2025/3 | Face-swap (open source) | unlimited | source-dependent | MIT | β | GPU + CPU pipelines; active dev | |
| 4.β‘ | 2025/3 | Audio-driven talking-head | unlimited | 1024 | MIT | β | Voice-driven portrait animation | |
| 5.β‘ | 2025/6 | Avatar / talking-head | 5min | 1080p | Proprietary | β | Widely used for photo-to-video | |
| 6.β‘ | 2025/7 | Text/image-to-video | 10s | 1080p | Proprietary | β | Pikaffects for stylised motion | |
| 7.β‘ | 2025/8 | Text/image-to-video | 10s | 1080p | Proprietary | β | Lowest-cost frontier video gen | |
| 8.β‘ | 2025/9 | Text/image-to-video | 10s | 1080p | Proprietary | β | Strong face/body motion realism | |
| 9.β‘ | 2025/9 | Text/image-to-video | 10s | 1080p | Proprietary | β | Strong physics fidelity | |
| 10.β‘ | 2025/10 | Avatar / talking-head | 5min | 4K | Proprietary | β | Most-used commercial deepfake stack | |
| 11.β‘ | 2025/10 | Text/image-to-video | 16s | 1080p | Proprietary API | β | Character and scene consistency focus | |
| 12.β‘ | 2025/11 | Text-to-video (with audio) | 60s | 4K | Proprietary (Gemini API, Flow) | β | SynthID watermark on all outputs | |
| 13.β‘ | 2025/12 | Text-to-video | 60s | 1080p | Proprietary (ChatGPT Pro/Plus) | β | C2PA provenance watermarking | |
| 14.β‘ | 2026/3 | Image edit / face-swap | n/a | 2K | Open weights | β | Reference-driven image editing; face-swap capable | |
| 15.β‘ | 2026/4 | Text/image-to-video | 10s | 1080p | Proprietary (Venice, CapCut) | β | #1 in public video arena (2026-04) | |
1.
DeepFaceLabβ
iperovFace-swap (open source)unlimitedsource-dependentGPLv3
Canonical OSS face-swap toolkit
3.
FaceFusion 3β
Henry RuhsFace-swap (open source)unlimitedsource-dependentMIT
GPU + CPU pipelines; active dev
4.
Hallo3β
Fudan / BaiduAudio-driven talking-headunlimited1024MIT
Voice-driven portrait animation
5.
D-ID Creative Reality Studio
D-IDAvatar / talking-head5min1080pProprietary
Widely used for photo-to-video
6.
Pika 2.2
Pika LabsText/image-to-video10s1080pProprietary
Pikaffects for stylised motion
7.
MiniMax Hailuo 02
MiniMaxText/image-to-video10s1080pProprietary
Lowest-cost frontier video gen
9.
Luma Ray 2
Luma AIText/image-to-video10s1080pProprietary
Strong physics fidelity
10.
HeyGen Avatar V5
HeyGenAvatar / talking-head5min4KProprietary
Most-used commercial deepfake stack
RunwayText/image-to-video16s1080pProprietary API
Character and scene consistency focus
Google DeepMindText-to-video (with audio)60s4KProprietary (Gemini API, Flow)
SynthID watermark on all outputs
Black Forest LabsImage edit / face-swapn/a2KOpen weights
Reference-driven image editing; face-swap capable
ByteDanceText/image-to-video10s1080pProprietary (Venice, CapCut)
#1 in public video arena (2026-04)