Robo2u

Autonomous Agent Benchmarks

Data as of June 12, 2026
Anthropic
HLE+Tools 53.0Browse 83.7DSQA 91.3Toolathlon 47.2SWE-V 80.8GAIA 67.2WebArena 54.3OSWorld-V 72.7ClawEval 66.3
Top on real-world agent tasks
2.Ling-2.6-1T●
#3.3
InclusionAI
SWE-V 82.2
Instant instruct model; fast-thinking; SOTA on SWE-bench Verified; real-world agents, tool use, persistent coding agents
3.Devin 3
#3.5
Cognition
SWE-V 82.1
Autonomous SWE agent; $500/mo
Google
HLE+Tools 51.4Browse 85.9DSQA 81.9Toolathlon 48.8SWE-V 80.6GAIA 66.1WebArena 51.8OSWorld-V 45.9
Computer Use API
5.Manus 2
#4.3
Manus
GAIA 72.0
Chinese agent platform; GAIA leader
6.Kimi K2.6●
#4.3
Moonshot
HLE+Tools 54.0Browse 83.2DSQA 92.5Toolathlon 50.0SWE-V 80.2OSWorld-V 73.1
Alibaba
SWE-V 80.4
Agent-centric flagship; long-horizon autonomous execution.
#4.8
H Company
OSWorld-V 78.8
Apache 2.0; 35B MoE (3B active); UI-grounding/computer-use specialist; powers HoloTab Chrome ext. Score is OSWorld-Verified (cleaned subset), not raw OSWorld
Z.ai
SWE-V 77.8
Native multimodal agent model; CogViT vision encoder; leads AndroidWorld, WebVoyager, ZClawBench; design-to-code; OpenClaw/Claude Code compatible
#5.5
OpenAI
HLE+Tools 52.1Browse 82.7DSQA 78.6Toolathlon 54.6SWE-V 76.9GAIA 65.8WebArena 52.1OSWorld-V 75.0
Operator successor
#5.5
OpenRouter
Agentic foundation model on OpenRouter; native tool use, long-context tasks, code generation, automated workflows; compatible with Claude Code and OpenClaw. Free preview; provider may log prompts/completions.
12.GLM 4.6●
#5.5
Z.ai
SWE-V 77.8
Open-weight MoE; basis for Big Pickle; agentic tool use specialist
Alibaba
SWE-V 78.8GAIA 57.6
Tool use specialist
Anthropic
System card 2026-05-29: 5x fewer dishonest reports and 10x less overconfidence on agentic coding vs 4.7; dropped adversarial-business training (less dishonest, more scam-susceptible)
Anthropic
#5.5
OpenAI
#5.5
DeepSeek
1.6T-A49B MoE; 1M context; runnable on Huawei Ascend chips
Anthropic
#5.5
DeepSeek
Smaller MoE companion to V4 Pro; high-throughput inference
20.Ring-2.6-1T●
#5.5
InclusionAI
Trillion-parameter thinking model; PinchBench 87.6, ClawEval, TAU2-Bench, GAIA2-search; adaptive reasoning for tool-heavy agent workflows
#5.5
xAI
xAI agent-capable flagship. Tool-use scores pending.
22.MiniMax M2.7●
#5.5
MiniMax
MiniMax agentic-capable (text-only). Open weights. Tool-use scores pending.
#5.5
OpenAI
Iterative GPT-5 release with agent improvements. Scores pending.
Google
Fast tier of Gemini 3, agent-capable. Scores pending.
Google
Previous-gen flagship; agent-capable.
26.MiniMax M3●
#5.5
MiniMax
Browse 83.5
Agentic open-weight; BrowseComp 83.5, Terminal-Bench 2.1 66.0, 1M context.
Google
MCP Atlas 83.6; Terminal-Bench 2.1 76.2; beats Gemini 3.1 Pro on agentic suite.
Alibaba
Multimodal agent model; ScreenSpot Pro 79.0 GUI grounding.
Microsoft AI
Agentic coding model for Copilot/VS Code; SWE-Bench Pro 51.
Anthropic
SWE-V 79.6GAIA 62.4WebArena 50.2OSWorld-V 44.7
Best value agent
#5.8
OpenCode Zen
SWE-V 77.8
Stealth model (GLM 4.6); free on OpenCode Zen; reasoning + tool calling; text-only
32.GLM 5.1●
#6
Z.ai
SWE-V 77.8GAIA 59.4
Open-weight MIT; 754B MoE; 8hr autonomous coding
#6.3
Mistral
SWE-V 77.6
Merged instruct/reasoning/coding model; powers Mistral Vibe remote agents and Le Chat Work mode; 91.4 on tau3-Telecom
34.Hy3 Preview●
#7
Tencent Hunyuan
SWE-V 74.4
Open-weight Hunyuan 3 preview; reports Terminal-Bench 2.0 54.4 and strong OpenClaw-style agent scores
Moonshot
HLE+Tools 50.2Browse 74.9DSQA 89.0Toolathlon 27.8SWE-V 76.8GAIA 58.9OSWorld-V 63.3
China API; agent-focused
DeepSeek
SWE-V 73.1GAIA 55.3
Best open agent model
37.o3
#8.3
OpenAI
SWE-V 69.1GAIA 56.4
Pure reasoning; pre-Operator
#9
Meta
SWE-V 68.5GAIA 48.7
Best Llama-family agent

ClawEval / PinchBench (Xiaomi OpenClaw harness)

Data as of April 28, 2026
AnthropicNative (Claude Code)
ClawEval 66.3Multimodal 24.8Tok/Traj 110000
ClawEval reference point; highest score, highest token cost per trajectory.
Xiaomi MiMoOpen (Claude-Code-compatible scaffold)
ClawEval 63.8Tok/Traj 70000
40-60% fewer tokens than Opus 4.6 / Gemini 3.1 Pro / GPT-5.4 at comparable capability.
#2.5
Xiaomi MiMoOpen scaffold
ClawEval 62.3Multimodal 23.8
Native multimodal base model; matches Claude Sonnet 4.6 on ClawEval Multimodal.
AnthropicNative (Claude Code)
Multimodal 23.8
Workhorse reference for ClawEval Multimodal; tied with MiMo-V2.5 base.
Xiaomi MiMoOpen scaffold
ClawEval 61.5
Predecessor to V2.5; Xiaomi reports global leadership on PinchBench (no public number).

PostTrainBench: Agentic Post-Training

Data as of April 29, 2026
AnthropicClaude Code
Avg 23.2AIME 5.0ArenaH 7.8BFCL 75.9GPQA 25.5GSM8K 41.0Health 18.8HumanEval 24.7
Leads PostTrainBench; only agent observed using GRPO (RL) β€” Sonnet 4.6 in 33% of tasks, Opus 4.6 in 3%. All other harnesses use SFT only.
GoogleOpenCode
Avg 21.6AIME 3.9ArenaH 7.4BFCL 62.8GPQA 18.5GSM8K 45.5Health 14.5HumanEval 40.2
Open-source OpenCode scaffold; strongest HumanEval among agents.
OpenAICodex CLI
Avg 21.4AIME 0.8ArenaH 6.6BFCL 52.5GPQA 23.7GSM8K 55.9Health 15.8HumanEval 30.2
Top GSM8K among agents; near-zero AIME suggests math-RL pipeline weakness.

Coding Agents / Dev Tools

Data as of May 2, 2026
1.Devin 3
#1
CognitionAutonomous agentProprietary (Claude + custom)$500/mo
SWE-V 82.1
First autonomous SWE; web UI + async tasks
2.Pi (pi-mono)●
#2
Mario Zechner (badlogic)CLIMulti-modelFree
SWE-V 68.0
Minimal 4-tool coding agent (read, write, edit, bash); self-modifying; powers OpenClaw core; 'Slow the F*** Down' philosophy at AI Engineer Europe
3.OpenHands●
#3
All Hands AIAutonomous agentMulti-modelFree (self-host)
SWE-V 65.8
Formerly OpenDevin; most-starred open agent
4.Aider●
#4
AiderCLIMulti (Claude, GPT, DeepSeek)Free
SWE-V 64.6
Git-native pair programmer; Python
5.Amp
#5
SourcegraphAgent (web + IDE)Claude + code search$20/mo
Sourcegraph Cody successor; code graph aware
6.Cursor
#6
AnysphereVSCode forkMulti (Claude, GPT, Gemini)$20/mo Pro
Most popular AI editor; $9.5B valuation
7.Windsurf
#7
OpenAI (acq Codeium)VSCode forkMulti + OpenAI priority$15/mo Pro
Acquired by OpenAI (Q2 2025); Cascade agent
8.Cline●
#8
ClineVSCode extensionMulti-modelAPI pass-through
Formerly Claude Dev; 1M+ installs
9.Continue●
#9
ContinueVSCode / JetBrains extMulti-modelFree (self-host)
Open-source Copilot alt; config as code
10.Zed AI●
#10
Zed IndustriesEditorMulti-modelFree / $20 Pro
Rust-native editor; collaborative
AnthropicCLIClaude Sonnet 4.6 / Opus 4.6API pass-through
Official Anthropic CLI; sub-agents, skills, hooks
12.Goose●
#12
BlockCLI + DesktopMulti-modelFree
Block (Square) open-source; MCP-native
#13
OpenAICLIGPT-5.x CodexAPI pass-through
Official OpenAI CLI; image + diff review
14.Qwen Code●
#14
AlibabaCLIQwen 3 Coder 480BFree
Ships with Qwen3-Coder release
15.Tabby●
#15
TabbyMLSelf-host CopilotOpen models (StarCoder, etc.)Free (self-host)
Air-gapped GitHub Copilot alternative
16.OpenCode●
#16
Anomaly / SSTCLI + desktop + IDEMulti-modelFree / API pass-through
Open-source terminal-native coding agent; LSP-aware; provider-agnostic
#17
MoonshotCloud agent (browser)Kimi K2.6Hosted; ~88% cheaper than Claude Code claim
Native OpenClaw on kimi.com; 5,000 community skills, 40GB cloud storage, Claw Groups (up to 300 sub-agents Γ— 4,000 steps), 24/7 scheduled tasks. Anthropic-compatible API at api.moonshot.ai/anthropic so vanilla Claude Code works with K2.6.
18.OpenClaw●
#18
Peter Steinberger / OpenClaw FoundationAgent runtime (messaging-first)Multi-modelFree (MIT)
Originally Clawdbot (2025-11-14); renamed Moltbot, then OpenClaw (~2026-01-30) after Anthropic trademark complaint. Local-first, multi-LLM, per-skill permission declarations. Steinberger joined OpenAI 2026-02; project moved to non-profit foundation. Progenitor of the 'Claw' ecosystem.

Agent Frameworks / SDKs

Data as of April 25, 2026
1.LangChain●
#1
LangChainPy / TSRAG + AgentMITβœ“
Stars 102.0k
Largest ecosystem; LangGraph for stateful agents
2.Fabric●
#2
Daniel MiesslerGoPrompt patterns / augmentationMIT–
Stars 41.6k
Modular crowdsourced prompt-pattern library; CLI-first; works with any LLM via API or local; shines for repeatable text tasks (summarize, extract, label)
3.LlamaIndex●
#3
LlamaIndexPy / TSRAG / DataMITβœ“
Stars 41.0k
Data-first; strong loaders and indices
4.AutoGen●
#4
MicrosoftPy / .NETMulti-agentMITβœ“
Stars 38.0k
Conversational multi-agent; v0.4 rewrite
5.CrewAI●
#5
CrewAI Inc.PythonMulti-agentMITβœ“
Stars 28.0k
Role-based crews; standalone (no LangChain dep)
#6
MicrosoftC# / Py / JavaAgentMITβœ“
Stars 23.5k
Microsoft-native; enterprise .NET focus
7.Agno●
#7
Agno (ex Phidata)PythonMulti-agentMPL-2.0βœ“
Stars 21.5k
Fast multi-modal agents; teams and workflows
8.DSPy●
#8
Stanford NLPPythonProgrammingMIT–
Stars 20.0k
Compile prompts via optimizers; research-led
9.OpenAI Swarm●
#9
OpenAIPythonMulti-agentMIT–
Stars 19.0k
Deprecated educational reference; superseded by OpenAI Agents SDK
10.Haystack●
#10
deepsetPythonRAG / PipelinesApache 2.0βœ“
Stars 18.0k
Production RAG pipelines; v2 rewrite
#11
Tianyi Zhou (Shanghai AI Lab)PythonAgent distillationMIT–
Stars 17.3k
Chinese 'Digital Life' agent framework β€” imports a coworker's Lark/DingTalk chat history + files and distills their skills+style into an LLM agent. Viral in China; centerpiece of MIT Tech Review's worker-replacement story (Apr 2026)
12.Vercel AI SDK●
#12
VercelTypeScriptUI / StreamingApache 2.0βœ“
Stars 15.5k
React/Svelte/Vue streaming hooks; tool calls
13.Letta●
#13
Letta (ex MemGPT)PythonMemory / AgentApache 2.0βœ“
Stars 14.0k
Stateful agents with persistent memory
14.LangGraph●
#14
LangChainPy / TSGraph AgentMITβœ“
Stars 12.5k
Stateful DAG agent runtime; HITL support
15.LangGraph●
#15
LangChainPy / TSStateful Agent RuntimeMITβœ“
Stars 12.0k
Durable execution, streaming, human-in-the-loop, persistence; deploy via langgraph-cli to LangSmith Deployment
16.Mastra●
#16
Mastra AITypeScriptAgentApache 2.0βœ“
Stars 11.0k
TS-first; workflows, evals, voice, RAG
17.Pydantic AI●
#17
PydanticPythonTyped AgentMITβœ“
Stars 9.5k
Type-safe; from Pydantic team
#18
OpenAIPy / TSAgent LoopMITβœ“
Stars 8.5k
Official OpenAI; handoffs + guardrails
19.Google ADK●
#19
GooglePy / JavaAgentApache 2.0βœ“
Stars 6.5k
Agent Development Kit; A2A protocol
20.Magentic-One●
#20
Microsoft ResearchPythonMulti-agentMIT–
Stars 5.8k
Microsoft Research generalist multi-agent system; orchestrator + 4 specialist agents
21.AG2●
#21
AG2 CommunityPythonMulti-agentApache 2.0βœ“
Stars 4.2k
Community fork of AutoGen v0.2 (formerly pyautogen); active after Microsoft v0.4 rewrite
#22
AnthropicTS / PyAgentMITβœ“
Stars 3.2k
Sub-agents, skills, hooks; Claude Code core

Multi-Agent Swarms

Data as of April 25, 2026
1.AutoGen●
#4
MicrosoftPy / .NETMIT
Stars 38.0k
Conversational multi-agent; v0.4 rewrite
2.CrewAI●
#5
CrewAI Inc.PythonMIT
Stars 28.0k
Role-based crews; standalone (no LangChain dep)
3.Agno●
#7
Agno (ex Phidata)PythonMPL-2.0
Stars 21.5k
Fast multi-modal agents; teams and workflows
4.OpenAI Swarm●
#9
OpenAIPythonMIT
Stars 19.0k
Deprecated educational reference; superseded by OpenAI Agents SDK
5.Letta●
#13
Letta (ex MemGPT)PythonApache 2.0
Stars 14.0k
Stateful agents with persistent memory
6.LangGraph●
#14
LangChainPy / TSMIT
Stars 12.5k
Stateful DAG agent runtime; HITL support
OpenAIPy / TSMIT
Stars 8.5k
Official OpenAI; handoffs + guardrails
8.Google ADK●
#19
GooglePy / JavaApache 2.0
Stars 6.5k
Agent Development Kit; A2A protocol
9.Magentic-One●
#20
Microsoft ResearchPythonMIT
Stars 5.8k
Microsoft Research generalist multi-agent system; orchestrator + 4 specialist agents
10.AG2●
#21
AG2 CommunityPythonApache 2.0
Stars 4.2k
Community fork of AutoGen v0.2 (formerly pyautogen); active after Microsoft v0.4 rewrite
#22
AnthropicTS / PyMIT
Stars 3.2k
Sub-agents, skills, hooks; Claude Code core

Safety & Alignment Benchmarks

Data as of May 11, 2026
Anthropic
HarmBench 97.8AdvBench 99.2WildGuard 94.3
Constitutional AI; strongest refusal
Anthropic
HarmBench 97.1AdvBench 98.8WildGuard 93.5
RLHF + CAI
OpenAI
HarmBench 95.4AdvBench 97.3WildGuard 91.2
RLHF; deliberative alignment
Google
HarmBench 94.2AdvBench 96.8WildGuard 90.8
Gemini safety tuning
#4
Meta
HarmBench 89.5
Open safety classifier; 8B
#4.3
Meta
HarmBench 84.2AdvBench 88.7
RLHF'd; open
Moonshot
HarmBench 81.3AdvBench 85.6
China-specific content policy
Mistral
HarmBench 79.5AdvBench 84.3
Explicit 'minimal guardrails' positioning
#6.3
Alibaba
HarmBench 76.8AdvBench 82.1
Open; Chinese content alignment
10.DeepSeek V3.2●
#7
DeepSeek
HarmBench 72.4AdvBench 78.9
Open model; weaker alignment

MCP Servers

Data as of April 25, 2026
1.Filesystem●
Anthropic (official)StoragestdioTypeScriptMIT
Reference server β€” local file read/write/edit
2.GitHub●
Anthropic (official)Dev toolsstdio / HTTPTypeScriptMIT
Repos, issues, PRs, search
4.Puppeteer●
Anthropic (official)BrowserstdioTypeScriptMIT
Reference browser server
5.Slack●
Anthropic (official)CommsstdioTypeScriptMIT
Read/post messages, channel management
6.Postgres●
Anthropic (official)DatabasestdioTypeScriptMIT
Read-only SQL over Postgres
7.Fetch●
Anthropic (official)WebstdioPythonMIT
Convert web pages to LLM-friendly markdown
8.Memory●
Anthropic (official)MemorystdioTypeScriptMIT
Knowledge-graph persistent memory
9.Time●
Anthropic (official)UtilitystdioPythonMIT
Timezone + conversion utilities
10.Sequential Thinking●
Anthropic (official)ReasoningstdioTypeScriptMIT
Reflective step-by-step thinking tool
11.Playwright●
MicrosoftBrowserstdioTypeScriptApache 2.0
Full browser automation via Playwright
12.Firecrawl●
MendableScrapingHTTP / stdioTypeScriptMIT
Web scraping + crawling pipeline
13.Context7●
UpstashDocsHTTP / stdioTypeScriptMIT
Fetches current library docs on demand
14.Apify●
ApifyScrapingHTTPTypeScriptApache 2.0
5000+ scrapers accessible via MCP
15.Obsidian●
CommunityNotesstdioTypeScriptMIT
Read/write Obsidian vault markdown
16.Supabase●
SupabaseDatabasestdioTypeScriptMIT
Postgres + Auth + Storage + Realtime
17.Exa●
ExaSearchHTTPTypeScriptMIT
Semantic web search with summaries
18.Sentry●
SentryMonitoringHTTPTypeScriptMIT
Query errors + releases; AI triage hook
19.Stripe●
StripePaymentsHTTPTypeScriptMIT
Query and manipulate Stripe API
20.Linear●
LinearPMHTTPTypeScriptMIT
Issues, projects, cycles β€” official Linear MCP
21.Figma Dev Mode●
FigmaDesignHTTPTypeScriptMIT
Design tokens + code gen from Figma
22.Notion●
NotionDocsHTTPTypeScriptMIT
Read/edit Notion pages and databases

LLM Observability / Tracing

Data as of April 26, 2026
HumanloopPrompt mgmt / Eval / DeployProprietaryStarter free / Growth $399/mo+
Prompt versioning + collaborative UI focus
GalileoEval / Hallucination / GuardrailsProprietaryCustom
Enterprise-tier; guardrail + hallucination scoring
3.Helicone●
HeliconeTracing / Cost / CacheApache 2.0Free 10k req / $20/mo+
Proxy-based; drop-in for OpenAI/Anthropic
ArizeTracing / Eval / ObservabilityELv2Free self-host / enterprise
OpenInference tracing standard; good RAG eval
5.PromptFoo●
PromptFooEval / Red-teamMITFree
CLI-first eval; CI-friendly; red-team suites built in
6.Langfuse●
LangfuseTracing / Eval / Prompt mgmtMIT (core) / CommercialFree self-host / $29/mo+ cloud
Top open-source alternative; OpenTelemetry-first
LangChainTracing / Eval / Prompt mgmtProprietaryFree 5k traces / $39/mo+
LangChain-native but framework-agnostic; strong prompt playground
8.Traceloop OpenLLMetry●
TraceloopTracing (standard)Apache 2.0Free (library)
OpenTelemetry standard for LLM traces; backend-agnostic
BraintrustEval / Experiments / TracingProprietaryFree / $249/mo+
Eval-first; strong experiment tracking + regression detection
10.Literal AI
Literal AITracing / Eval / DatasetProprietaryFree / $49/mo+
Strong threads-and-datasets model; ex-Chainlit team
11.Weights & Biases Weave●
W&BTracing / EvalApache 2.0 (SDK)Free / Enterprise
W&B's LLM observability; integrates with MLOps runs
12.Logfire●
PydanticTracing (OTel)MIT (SDK)Free 10M spans / paid
Pydantic team's OTel-native observability; great Pydantic AI integration
LangChainAgent Builder / Mgmt / GovernanceProprietaryEnterprise (LangSmith add-on)
Renamed from LangSmith Agent Builder; no-code agent builder + RBAC/ABAC, Skills, Sandboxes (private preview), 7.5k+ Arcade.dev tools

Prompt Guides / Prompt Engineering

Data as of May 1, 2026
OpenAIPrompt managementPrompt objects, versioning, variables, prompt caching, eval-driven iterationYesOpenAI API builders
Covers reusable/versioned prompts and dashboard-to-API workflow
GooglePrompt designClear instructions, prompt components, few-shot examples, task decomposition, chainingNoGemini API builders
Includes topic-specific guides for media, image, and video prompting
AnthropicPrompt engineeringClear instructions, XML structure, examples, roles, long context, tool use, agentic workflowsYesClaude API builders
Starts with success criteria and empirical evals before prompt tuning
AnthropicModel-specific promptingClarity, examples, XML tags, context placement, tool-use instructions, chainingYesClaude Opus/Sonnet/Haiku users
Living reference for current Claude models and prompt patterns