Autonomous Agent Benchmarks
Data as of June 12, 2026| Model | HLE+ToolsA | BrowseB | DSQAB | ToolathlonB | SWE-VB | GAIAB | WebArenaB | OSWorld-VB | ClawEvalA | NotesβΌ | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.β‘ | 2026/2 | β | 253.0 | 283.7 | 291.3 | 447.2 | 380.8 | 267.2 | 154.3 | 472.7 | 166.3 | 2.5 | Top on real-world agent tasks | |
| 2.β‘ | 2026/4 | β | - | - | - | - | 182.2 | - | - | - | - | 3.3 | Instant instruct model; fast-thinking; SOTA on SWE-bench Verified; real-world agents, tool use, persistent coding agents | |
| 3.β‘ | 2025/11 | β | - | - | - | - | 282.1 | - | - | - | - | 3.5 | Autonomous SWE agent; $500/mo | |
| 4.β‘ | 2026/2 | β | 451.4 | 185.9 | 481.9 | 348.8 | 480.6 | 366.1 | 351.8 | 645.9 | - | 4 | Computer Use API | |
| 5.β‘ | 2025/12 | β | - | - | - | - | - | 172.0 | - | - | - | 4.3 | Chinese agent platform; GAIA leader | |
| 6.β‘ | 2026/4 | β | 154.0 | 483.2 | 192.5 | 250.0 | 680.2 | - | - | 373.1 | - | 4.3 | - | |
| 7.β‘ | 2026/5 | β | - | - | - | - | 580.4 | - | - | - | - | 4.3 | Agent-centric flagship; long-horizon autonomous execution. | |
| 8.β‘ | 2026/4 | β | - | - | - | - | - | - | - | 178.8 | - | 4.8 | Apache 2.0; 35B MoE (3B active); UI-grounding/computer-use specialist; powers HoloTab Chrome ext. Score is OSWorld-Verified (cleaned subset), not raw OSWorld | |
| 9.β‘ | 2026/4 | β | - | - | - | - | 977.8 | - | - | - | - | 5.3 | Native multimodal agent model; CogViT vision encoder; leads AndroidWorld, WebVoyager, ZClawBench; design-to-code; OpenClaw/Claude Code compatible | |
| 10.β‘ | 2026/3 | β | 352.1 | 582.7 | 578.6 | 154.6 | 1476.9 | 465.8 | 252.1 | 275.0 | - | 5.5 | Operator successor | |
| 11.β‘ | 2026/4 | β | - | - | - | - | - | - | - | - | - | 5.5 | Agentic foundation model on OpenRouter; native tool use, long-context tasks, code generation, automated workflows; compatible with Claude Code and OpenClaw. Free preview; provider may log prompts/completions. | |
| 12.β‘ | 2025/7 | β | - | - | - | - | 1077.8 | - | - | - | - | 5.5 | Open-weight MoE; basis for Big Pickle; agentic tool use specialist | |
| 13.β‘ | 2026/4 | β | - | - | - | - | 878.8 | 857.6 | - | - | - | 5.5 | Tool use specialist | |
| 14.β‘ | 2026/5 | β | - | - | - | - | - | - | - | - | - | 5.5 | System card 2026-05-29: 5x fewer dishonest reports and 10x less overconfidence on agentic coding vs 4.7; dropped adversarial-business training (less dishonest, more scam-susceptible) | |
| 15.β‘ | 2026/4 | β | - | - | - | - | - | - | - | - | - | 5.5 | - | |
| 16.β‘ | 2026/4 | β | - | - | - | - | - | - | - | - | - | 5.5 | - | |
| 17.β‘ | 2026/4 | β | - | - | - | - | - | - | - | - | - | 5.5 | 1.6T-A49B MoE; 1M context; runnable on Huawei Ascend chips | |
| 18.β‘ | 2025/10 | β | - | - | - | - | - | - | - | - | - | 5.5 | - | |
| 19.β‘ | 2026/4 | β | - | - | - | - | - | - | - | - | - | 5.5 | Smaller MoE companion to V4 Pro; high-throughput inference | |
| 20.β‘ | 2026/5 | β | - | - | - | - | - | - | - | - | - | 5.5 | Trillion-parameter thinking model; PinchBench 87.6, ClawEval, TAU2-Bench, GAIA2-search; adaptive reasoning for tool-heavy agent workflows | |
| 21.β‘ | 2026/4 | β | - | - | - | - | - | - | - | - | - | 5.5 | xAI agent-capable flagship. Tool-use scores pending. | |
| 22.β‘ | 2026/3 | β | - | - | - | - | - | - | - | - | - | 5.5 | MiniMax agentic-capable (text-only). Open weights. Tool-use scores pending. | |
| 23.β‘ | 2025/12 | β | - | - | - | - | - | - | - | - | - | 5.5 | Iterative GPT-5 release with agent improvements. Scores pending. | |
| 24.β‘ | 2026/3 | β | - | - | - | - | - | - | - | - | - | 5.5 | Fast tier of Gemini 3, agent-capable. Scores pending. | |
| 25.β‘ | 2025/3 | β | - | - | - | - | - | - | - | - | - | 5.5 | Previous-gen flagship; agent-capable. | |
| 26.β‘ | 2026/6 | β | - | 383.5 | - | - | - | - | - | - | - | 5.5 | Agentic open-weight; BrowseComp 83.5, Terminal-Bench 2.1 66.0, 1M context. | |
| 27.β‘ | 2026/5 | β | - | - | - | - | - | - | - | - | - | 5.5 | MCP Atlas 83.6; Terminal-Bench 2.1 76.2; beats Gemini 3.1 Pro on agentic suite. | |
| 28.β‘ | 2026/6 | β | - | - | - | - | - | - | - | - | - | 5.5 | Multimodal agent model; ScreenSpot Pro 79.0 GUI grounding. | |
| 29.β‘ | 2026/6 | β | - | - | - | - | - | - | - | - | - | 5.5 | Agentic coding model for Copilot/VS Code; SWE-Bench Pro 51. | |
| 30.β‘ | 2026/2 | β | - | - | - | - | 779.6 | 562.4 | 450.2 | 744.7 | - | 5.8 | Best value agent | |
| 31.β‘ | 2025/10 | β | - | - | - | - | 1177.8 | - | - | - | - | 5.8 | Stealth model (GLM 4.6); free on OpenCode Zen; reasoning + tool calling; text-only | |
| 32.β‘ | 2026/4 | β | - | - | - | - | 1277.8 | 659.4 | - | - | - | 6 | Open-weight MIT; 754B MoE; 8hr autonomous coding | |
| 33.β‘ | 2026/4 | β | - | - | - | - | 1377.6 | - | - | - | - | 6.3 | Merged instruct/reasoning/coding model; powers Mistral Vibe remote agents and Le Chat Work mode; 91.4 on tau3-Telecom | |
| 34.β‘ | 2026/4 | β | - | - | - | - | 1674.4 | - | - | - | - | 7 | Open-weight Hunyuan 3 preview; reports Terminal-Bench 2.0 54.4 and strong OpenClaw-style agent scores | |
| 35.β‘ | 2026/1 | β | 550.2 | 674.9 | 389.0 | 527.8 | 1576.8 | 758.9 | - | 563.3 | - | 7.3 | China API; agent-focused | |
| 36.β‘ | 2025/9 | β | - | - | - | - | 1773.1 | 1055.3 | - | - | - | 8.3 | Best open agent model | |
| 37.β‘ | 2025/4 | β | - | - | - | - | 1869.1 | 956.4 | - | - | - | 8.3 | Pure reasoning; pre-Operator | |
| 38.β‘ | 2025/4 | β | - | - | - | - | 1968.5 | 1148.7 | - | - | - | 9 | Best Llama-family agent | |
#2.5
Anthropic
HLE+Tools 53.0Browse 83.7DSQA 91.3Toolathlon 47.2SWE-V 80.8GAIA 67.2WebArena 54.3OSWorld-V 72.7ClawEval 66.3
Top on real-world agent tasks
#3.3
InclusionAI
SWE-V 82.2
Instant instruct model; fast-thinking; SOTA on SWE-bench Verified; real-world agents, tool use, persistent coding agents
3.
Devin 3
#3.5Cognition
SWE-V 82.1
Autonomous SWE agent; $500/mo
#4
Google
HLE+Tools 51.4Browse 85.9DSQA 81.9Toolathlon 48.8SWE-V 80.6GAIA 66.1WebArena 51.8OSWorld-V 45.9
Computer Use API
5.
Manus 2
#4.3Manus
GAIA 72.0
Chinese agent platform; GAIA leader
#4.8
H Company
OSWorld-V 78.8
Apache 2.0; 35B MoE (3B active); UI-grounding/computer-use specialist; powers HoloTab Chrome ext. Score is OSWorld-Verified (cleaned subset), not raw OSWorld
#5.3
Z.ai
SWE-V 77.8
Native multimodal agent model; CogViT vision encoder; leads AndroidWorld, WebVoyager, ZClawBench; design-to-code; OpenClaw/Claude Code compatible
#5.5
OpenAI
HLE+Tools 52.1Browse 82.7DSQA 78.6Toolathlon 54.6SWE-V 76.9GAIA 65.8WebArena 52.1OSWorld-V 75.0
Operator successor
#5.5
OpenRouter
Agentic foundation model on OpenRouter; native tool use, long-context tasks, code generation, automated workflows; compatible with Claude Code and OpenClaw. Free preview; provider may log prompts/completions.
#5.5
Anthropic
System card 2026-05-29: 5x fewer dishonest reports and 10x less overconfidence on agentic coding vs 4.7; dropped adversarial-business training (less dishonest, more scam-susceptible)
#5.5
Anthropic
#5.5
OpenAI
#5.5
Anthropic
#5.5
InclusionAI
Trillion-parameter thinking model; PinchBench 87.6, ClawEval, TAU2-Bench, GAIA2-search; adaptive reasoning for tool-heavy agent workflows
#5.5
MiniMax
MiniMax agentic-capable (text-only). Open weights. Tool-use scores pending.
#5.5
MiniMax
Browse 83.5
Agentic open-weight; BrowseComp 83.5, Terminal-Bench 2.1 66.0, 1M context.
#5.5
Google
MCP Atlas 83.6; Terminal-Bench 2.1 76.2; beats Gemini 3.1 Pro on agentic suite.
#5.8
OpenCode Zen
SWE-V 77.8
Stealth model (GLM 4.6); free on OpenCode Zen; reasoning + tool calling; text-only
#6.3
Mistral
SWE-V 77.6
Merged instruct/reasoning/coding model; powers Mistral Vibe remote agents and Le Chat Work mode; 91.4 on tau3-Telecom
#7
Tencent Hunyuan
SWE-V 74.4
Open-weight Hunyuan 3 preview; reports Terminal-Bench 2.0 54.4 and strong OpenClaw-style agent scores
#7.3
Moonshot
HLE+Tools 50.2Browse 74.9DSQA 89.0Toolathlon 27.8SWE-V 76.8GAIA 58.9OSWorld-V 63.3
China API; agent-focused
ClawEval / PinchBench (Xiaomi OpenClaw harness)
Data as of April 28, 2026| Model | ClawEvalB | MultimodalB | Tok/Traj | NotesβΌ | |||||
|---|---|---|---|---|---|---|---|---|---|
| 1.β‘ | Native (Claude Code) | 2026/1 | β | 166.3 | 124.8 | 1110000 | 1 | ClawEval reference point; highest score, highest token cost per trajectory. | |
| 2.β‘ | Open (Claude-Code-compatible scaffold) | 2026/4 | β | 263.8 | - | 270000 | 2 | 40-60% fewer tokens than Opus 4.6 / Gemini 3.1 Pro / GPT-5.4 at comparable capability. | |
| 3.β‘ | Open scaffold | 2026/4 | β | 362.3 | 223.8 | - | 2.5 | Native multimodal base model; matches Claude Sonnet 4.6 on ClawEval Multimodal. | |
| 4.β‘ | Native (Claude Code) | 2025/11 | β | - | 323.8 | - | 2.5 | Workhorse reference for ClawEval Multimodal; tied with MiMo-V2.5 base. | |
| 5.β‘ | Open scaffold | 2025/12 | β | 461.5 | - | - | 3 | Predecessor to V2.5; Xiaomi reports global leadership on PinchBench (no public number). | |
#1
AnthropicNative (Claude Code)
ClawEval 66.3Multimodal 24.8Tok/Traj 110000
ClawEval reference point; highest score, highest token cost per trajectory.
#2
Xiaomi MiMoOpen (Claude-Code-compatible scaffold)
ClawEval 63.8Tok/Traj 70000
40-60% fewer tokens than Opus 4.6 / Gemini 3.1 Pro / GPT-5.4 at comparable capability.
#2.5
Xiaomi MiMoOpen scaffold
ClawEval 62.3Multimodal 23.8
Native multimodal base model; matches Claude Sonnet 4.6 on ClawEval Multimodal.
#2.5
AnthropicNative (Claude Code)
Multimodal 23.8
Workhorse reference for ClawEval Multimodal; tied with MiMo-V2.5 base.
#3
Xiaomi MiMoOpen scaffold
ClawEval 61.5
Predecessor to V2.5; Xiaomi reports global leadership on PinchBench (no public number).
PostTrainBench: Agentic Post-Training
Data as of April 29, 2026| Model | AvgA | AIMEB | ArenaHB | BFCLB | GPQAC | GSM8KD | HealthB | HumanEvalD | NotesβΌ | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.β‘ | Claude Code | 2026/2 | β | 123.2 | 15.0 | 17.8 | 175.9 | 125.5 | 341.0 | 118.8 | 324.7 | 1 | Leads PostTrainBench; only agent observed using GRPO (RL) β Sonnet 4.6 in 33% of tasks, Opus 4.6 in 3%. All other harnesses use SFT only. | |
| 2.β‘ | OpenCode | 2026/4 | β | 221.6 | 23.9 | 27.4 | 262.8 | 318.5 | 245.5 | 314.5 | 140.2 | 2 | Open-source OpenCode scaffold; strongest HumanEval among agents. | |
| 3.β‘ | Codex CLI | 2026/3 | β | 321.4 | 30.8 | 36.6 | 352.5 | 223.7 | 155.9 | 215.8 | 230.2 | 3 | Top GSM8K among agents; near-zero AIME suggests math-RL pipeline weakness. | |
#1
AnthropicClaude Code
Avg 23.2AIME 5.0ArenaH 7.8BFCL 75.9GPQA 25.5GSM8K 41.0Health 18.8HumanEval 24.7
Leads PostTrainBench; only agent observed using GRPO (RL) β Sonnet 4.6 in 33% of tasks, Opus 4.6 in 3%. All other harnesses use SFT only.
#2
GoogleOpenCode
Avg 21.6AIME 3.9ArenaH 7.4BFCL 62.8GPQA 18.5GSM8K 45.5Health 14.5HumanEval 40.2
Open-source OpenCode scaffold; strongest HumanEval among agents.
#3
OpenAICodex CLI
Avg 21.4AIME 0.8ArenaH 6.6BFCL 52.5GPQA 23.7GSM8K 55.9Health 15.8HumanEval 30.2
Top GSM8K among agents; near-zero AIME suggests math-RL pipeline weakness.
Coding Agents / Dev Tools
Data as of May 2, 2026| Model | SWE-VB | NotesβΌ | |||||||
|---|---|---|---|---|---|---|---|---|---|
| 1.β‘ | 2025/11 | Autonomous agent | Proprietary (Claude + custom) | $500/mo | β | 182.1 | 1 | First autonomous SWE; web UI + async tasks | |
| 2.β‘ | 2025/12 | CLI | Multi-model | Free | β | 268.0 | 2 | Minimal 4-tool coding agent (read, write, edit, bash); self-modifying; powers OpenClaw core; 'Slow the F*** Down' philosophy at AI Engineer Europe | |
| 3.β‘ | 2024/3 | Autonomous agent | Multi-model | Free (self-host) | β | 365.8 | 3 | Formerly OpenDevin; most-starred open agent | |
| 4.β‘ | 2023/5 | CLI | Multi (Claude, GPT, DeepSeek) | Free | β | 464.6 | 4 | Git-native pair programmer; Python | |
| 5.β‘ | 2025/5 | Agent (web + IDE) | Claude + code search | $20/mo | β | - | 5 | Sourcegraph Cody successor; code graph aware | |
| 6.β‘ | 2023/3 | VSCode fork | Multi (Claude, GPT, Gemini) | $20/mo Pro | β | - | 6 | Most popular AI editor; $9.5B valuation | |
| 7.β‘ | 2024/11 | VSCode fork | Multi + OpenAI priority | $15/mo Pro | β | - | 7 | Acquired by OpenAI (Q2 2025); Cascade agent | |
| 8.β‘ | 2024/7 | VSCode extension | Multi-model | API pass-through | β | - | 8 | Formerly Claude Dev; 1M+ installs | |
| 9.β‘ | 2023/9 | VSCode / JetBrains ext | Multi-model | Free (self-host) | β | - | 9 | Open-source Copilot alt; config as code | |
| 10.β‘ | 2024/8 | Editor | Multi-model | Free / $20 Pro | β | - | 10 | Rust-native editor; collaborative | |
| 11.β‘ | 2025/2 | CLI | Claude Sonnet 4.6 / Opus 4.6 | API pass-through | β | - | 11 | Official Anthropic CLI; sub-agents, skills, hooks | |
| 12.β‘ | 2025/1 | CLI + Desktop | Multi-model | Free | β | - | 12 | Block (Square) open-source; MCP-native | |
| 13.β‘ | 2025/4 | CLI | GPT-5.x Codex | API pass-through | β | - | 13 | Official OpenAI CLI; image + diff review | |
| 14.β‘ | 2025/7 | CLI | Qwen 3 Coder 480B | Free | β | - | 14 | Ships with Qwen3-Coder release | |
| 15.β‘ | TabbyTabbyML | 2023/9 | Self-host Copilot | Open models (StarCoder, etc.) | Free (self-host) | β | - | 15 | Air-gapped GitHub Copilot alternative |
| 16.β‘ | 2025/4 | CLI + desktop + IDE | Multi-model | Free / API pass-through | β | - | 16 | Open-source terminal-native coding agent; LSP-aware; provider-agnostic | |
| 17.β‘ | 2026/2 | Cloud agent (browser) | Kimi K2.6 | Hosted; ~88% cheaper than Claude Code claim | β | - | 17 | Native OpenClaw on kimi.com; 5,000 community skills, 40GB cloud storage, Claw Groups (up to 300 sub-agents Γ 4,000 steps), 24/7 scheduled tasks. Anthropic-compatible API at api.moonshot.ai/anthropic so vanilla Claude Code works with K2.6. | |
| 18.β‘ | 2025/11 | Agent runtime (messaging-first) | Multi-model | Free (MIT) | β | - | 18 | Originally Clawdbot (2025-11-14); renamed Moltbot, then OpenClaw (~2026-01-30) after Anthropic trademark complaint. Local-first, multi-LLM, per-skill permission declarations. Steinberger joined OpenAI 2026-02; project moved to non-profit foundation. Progenitor of the 'Claw' ecosystem. | |
1.
Devin 3
#1CognitionAutonomous agentProprietary (Claude + custom)$500/mo
SWE-V 82.1
First autonomous SWE; web UI + async tasks
#2
Mario Zechner (badlogic)CLIMulti-modelFree
SWE-V 68.0
Minimal 4-tool coding agent (read, write, edit, bash); self-modifying; powers OpenClaw core; 'Slow the F*** Down' philosophy at AI Engineer Europe
3.
OpenHandsβ
#3All Hands AIAutonomous agentMulti-modelFree (self-host)
SWE-V 65.8
Formerly OpenDevin; most-starred open agent
4.
Aiderβ
#4AiderCLIMulti (Claude, GPT, DeepSeek)Free
SWE-V 64.6
Git-native pair programmer; Python
5.
Amp
#5SourcegraphAgent (web + IDE)Claude + code search$20/mo
Sourcegraph Cody successor; code graph aware
6.
Cursor
#6AnysphereVSCode forkMulti (Claude, GPT, Gemini)$20/mo Pro
Most popular AI editor; $9.5B valuation
7.
Windsurf
#7OpenAI (acq Codeium)VSCode forkMulti + OpenAI priority$15/mo Pro
Acquired by OpenAI (Q2 2025); Cascade agent
8.
Clineβ
#8ClineVSCode extensionMulti-modelAPI pass-through
Formerly Claude Dev; 1M+ installs
9.
Continueβ
#9ContinueVSCode / JetBrains extMulti-modelFree (self-host)
Open-source Copilot alt; config as code
10.
Zed AIβ
#10Zed IndustriesEditorMulti-modelFree / $20 Pro
Rust-native editor; collaborative
#11
AnthropicCLIClaude Sonnet 4.6 / Opus 4.6API pass-through
Official Anthropic CLI; sub-agents, skills, hooks
12.
Gooseβ
#12BlockCLI + DesktopMulti-modelFree
Block (Square) open-source; MCP-native
15.Tabbyβ
#15TabbyMLSelf-host CopilotOpen models (StarCoder, etc.)Free (self-host)
Air-gapped GitHub Copilot alternative
#16
Anomaly / SSTCLI + desktop + IDEMulti-modelFree / API pass-through
Open-source terminal-native coding agent; LSP-aware; provider-agnostic
#17
MoonshotCloud agent (browser)Kimi K2.6Hosted; ~88% cheaper than Claude Code claim
Native OpenClaw on kimi.com; 5,000 community skills, 40GB cloud storage, Claw Groups (up to 300 sub-agents Γ 4,000 steps), 24/7 scheduled tasks. Anthropic-compatible API at api.moonshot.ai/anthropic so vanilla Claude Code works with K2.6.
#18
Peter Steinberger / OpenClaw FoundationAgent runtime (messaging-first)Multi-modelFree (MIT)
Originally Clawdbot (2025-11-14); renamed Moltbot, then OpenClaw (~2026-01-30) after Anthropic trademark complaint. Local-first, multi-LLM, per-skill permission declarations. Steinberger joined OpenAI 2026-02; project moved to non-profit foundation. Progenitor of the 'Claw' ecosystem.
Agent Frameworks / SDKs
Data as of April 25, 2026| Model | Stars | NotesβΌ | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1.β‘ | 2022/10 | Py / TS | RAG + Agent | MIT | β | β | 1102.0k | 1 | Largest ecosystem; LangGraph for stateful agents | |
| 2.β‘ | 2024/1 | Go | Prompt patterns / augmentation | MIT | β | β | 241.6k | 2 | Modular crowdsourced prompt-pattern library; CLI-first; works with any LLM via API or local; shines for repeatable text tasks (summarize, extract, label) | |
| 3.β‘ | LlamaIndexLlamaIndex | 2022/11 | Py / TS | RAG / Data | MIT | β | β | 341.0k | 3 | Data-first; strong loaders and indices |
| 4.β‘ | 2023/9 | Py / .NET | Multi-agent | MIT | β | β | 438.0k | 4 | Conversational multi-agent; v0.4 rewrite | |
| 5.β‘ | CrewAICrewAI Inc. | 2023/11 | Python | Multi-agent | MIT | β | β | 528.0k | 5 | Role-based crews; standalone (no LangChain dep) |
| 6.β‘ | 2023/3 | C# / Py / Java | Agent | MIT | β | β | 623.5k | 6 | Microsoft-native; enterprise .NET focus | |
| 7.β‘ | AgnoAgno (ex Phidata) | 2023/8 | Python | Multi-agent | MPL-2.0 | β | β | 721.5k | 7 | Fast multi-modal agents; teams and workflows |
| 8.β‘ | DSPyStanford NLP | 2023/1 | Python | Programming | MIT | β | β | 820.0k | 8 | Compile prompts via optimizers; research-led |
| 9.β‘ | 2024/10 | Python | Multi-agent | MIT | β | β | 919.0k | 9 | Deprecated educational reference; superseded by OpenAI Agents SDK | |
| 10.β‘ | Haystackdeepset | 2020/11 | Python | RAG / Pipelines | Apache 2.0 | β | β | 1018.0k | 10 | Production RAG pipelines; v2 rewrite |
| 11.β‘ | 2026/4 | Python | Agent distillation | MIT | β | β | 1117.3k | 11 | Chinese 'Digital Life' agent framework β imports a coworker's Lark/DingTalk chat history + files and distills their skills+style into an LLM agent. Viral in China; centerpiece of MIT Tech Review's worker-replacement story (Apr 2026) | |
| 12.β‘ | Vercel AI SDKVercel | 2023/6 | TypeScript | UI / Streaming | Apache 2.0 | β | β | 1215.5k | 12 | React/Svelte/Vue streaming hooks; tool calls |
| 13.β‘ | LettaLetta (ex MemGPT) | 2023/10 | Python | Memory / Agent | Apache 2.0 | β | β | 1314.0k | 13 | Stateful agents with persistent memory |
| 14.β‘ | LangGraphLangChain | 2024/1 | Py / TS | Graph Agent | MIT | β | β | 1412.5k | 14 | Stateful DAG agent runtime; HITL support |
| 15.β‘ | 2024/1 | Py / TS | Stateful Agent Runtime | MIT | β | β | 1412.0k | 15 | Durable execution, streaming, human-in-the-loop, persistence; deploy via langgraph-cli to LangSmith Deployment | |
| 16.β‘ | MastraMastra AI | 2024/10 | TypeScript | Agent | Apache 2.0 | β | β | 1611.0k | 16 | TS-first; workflows, evals, voice, RAG |
| 17.β‘ | Pydantic AIPydantic | 2024/12 | Python | Typed Agent | MIT | β | β | 179.5k | 17 | Type-safe; from Pydantic team |
| 18.β‘ | 2025/3 | Py / TS | Agent Loop | MIT | β | β | 188.5k | 18 | Official OpenAI; handoffs + guardrails | |
| 19.β‘ | 2025/4 | Py / Java | Agent | Apache 2.0 | β | β | 196.5k | 19 | Agent Development Kit; A2A protocol | |
| 20.β‘ | 2024/11 | Python | Multi-agent | MIT | β | β | 205.8k | 20 | Microsoft Research generalist multi-agent system; orchestrator + 4 specialist agents | |
| 21.β‘ | 2024/11 | Python | Multi-agent | Apache 2.0 | β | β | 214.2k | 21 | Community fork of AutoGen v0.2 (formerly pyautogen); active after Microsoft v0.4 rewrite | |
| 22.β‘ | 2025/9 | TS / Py | Agent | MIT | β | β | 223.2k | 22 | Sub-agents, skills, hooks; Claude Code core | |
#1
LangChainPy / TSRAG + AgentMITβ
Stars 102.0k
Largest ecosystem; LangGraph for stateful agents
#2
Daniel MiesslerGoPrompt patterns / augmentationMITβ
Stars 41.6k
Modular crowdsourced prompt-pattern library; CLI-first; works with any LLM via API or local; shines for repeatable text tasks (summarize, extract, label)
3.LlamaIndexβ
#3LlamaIndexPy / TSRAG / DataMITβ
Stars 41.0k
Data-first; strong loaders and indices
#4
MicrosoftPy / .NETMulti-agentMITβ
Stars 38.0k
Conversational multi-agent; v0.4 rewrite
5.CrewAIβ
#5CrewAI Inc.PythonMulti-agentMITβ
Stars 28.0k
Role-based crews; standalone (no LangChain dep)
#6
MicrosoftC# / Py / JavaAgentMITβ
Stars 23.5k
Microsoft-native; enterprise .NET focus
7.Agnoβ
#7Agno (ex Phidata)PythonMulti-agentMPL-2.0β
Stars 21.5k
Fast multi-modal agents; teams and workflows
8.DSPyβ
#8Stanford NLPPythonProgrammingMITβ
Stars 20.0k
Compile prompts via optimizers; research-led
#9
OpenAIPythonMulti-agentMITβ
Stars 19.0k
Deprecated educational reference; superseded by OpenAI Agents SDK
10.Haystackβ
#10deepsetPythonRAG / PipelinesApache 2.0β
Stars 18.0k
Production RAG pipelines; v2 rewrite
#11
Tianyi Zhou (Shanghai AI Lab)PythonAgent distillationMITβ
Stars 17.3k
Chinese 'Digital Life' agent framework β imports a coworker's Lark/DingTalk chat history + files and distills their skills+style into an LLM agent. Viral in China; centerpiece of MIT Tech Review's worker-replacement story (Apr 2026)
12.Vercel AI SDKβ
#12VercelTypeScriptUI / StreamingApache 2.0β
Stars 15.5k
React/Svelte/Vue streaming hooks; tool calls
13.Lettaβ
#13Letta (ex MemGPT)PythonMemory / AgentApache 2.0β
Stars 14.0k
Stateful agents with persistent memory
14.LangGraphβ
#14LangChainPy / TSGraph AgentMITβ
Stars 12.5k
Stateful DAG agent runtime; HITL support
#15
LangChainPy / TSStateful Agent RuntimeMITβ
Stars 12.0k
Durable execution, streaming, human-in-the-loop, persistence; deploy via langgraph-cli to LangSmith Deployment
16.Mastraβ
#16Mastra AITypeScriptAgentApache 2.0β
Stars 11.0k
TS-first; workflows, evals, voice, RAG
17.Pydantic AIβ
#17PydanticPythonTyped AgentMITβ
Stars 9.5k
Type-safe; from Pydantic team
#18
OpenAIPy / TSAgent LoopMITβ
Stars 8.5k
Official OpenAI; handoffs + guardrails
#20
Microsoft ResearchPythonMulti-agentMITβ
Stars 5.8k
Microsoft Research generalist multi-agent system; orchestrator + 4 specialist agents
#21
AG2 CommunityPythonMulti-agentApache 2.0β
Stars 4.2k
Community fork of AutoGen v0.2 (formerly pyautogen); active after Microsoft v0.4 rewrite
#22
AnthropicTS / PyAgentMITβ
Stars 3.2k
Sub-agents, skills, hooks; Claude Code core
Multi-Agent Swarms
Data as of April 25, 2026| Model | Stars | NotesβΌ | ||||||
|---|---|---|---|---|---|---|---|---|
| 1.β‘ | 2023/9 | Py / .NET | MIT | β | 138.0k | 4 | Conversational multi-agent; v0.4 rewrite | |
| 2.β‘ | CrewAICrewAI Inc. | 2023/11 | Python | MIT | β | 228.0k | 5 | Role-based crews; standalone (no LangChain dep) |
| 3.β‘ | AgnoAgno (ex Phidata) | 2023/8 | Python | MPL-2.0 | β | 321.5k | 7 | Fast multi-modal agents; teams and workflows |
| 4.β‘ | 2024/10 | Python | MIT | β | 419.0k | 9 | Deprecated educational reference; superseded by OpenAI Agents SDK | |
| 5.β‘ | LettaLetta (ex MemGPT) | 2023/10 | Python | Apache 2.0 | β | 514.0k | 13 | Stateful agents with persistent memory |
| 6.β‘ | LangGraphLangChain | 2024/1 | Py / TS | MIT | β | 612.5k | 14 | Stateful DAG agent runtime; HITL support |
| 7.β‘ | 2025/3 | Py / TS | MIT | β | 78.5k | 18 | Official OpenAI; handoffs + guardrails | |
| 8.β‘ | 2025/4 | Py / Java | Apache 2.0 | β | 86.5k | 19 | Agent Development Kit; A2A protocol | |
| 9.β‘ | 2024/11 | Python | MIT | β | 95.8k | 20 | Microsoft Research generalist multi-agent system; orchestrator + 4 specialist agents | |
| 10.β‘ | 2024/11 | Python | Apache 2.0 | β | 104.2k | 21 | Community fork of AutoGen v0.2 (formerly pyautogen); active after Microsoft v0.4 rewrite | |
| 11.β‘ | 2025/9 | TS / Py | MIT | β | 113.2k | 22 | Sub-agents, skills, hooks; Claude Code core | |
2.CrewAIβ
#5CrewAI Inc.PythonMIT
Stars 28.0k
Role-based crews; standalone (no LangChain dep)
3.Agnoβ
#7Agno (ex Phidata)PythonMPL-2.0
Stars 21.5k
Fast multi-modal agents; teams and workflows
#9
OpenAIPythonMIT
Stars 19.0k
Deprecated educational reference; superseded by OpenAI Agents SDK
5.Lettaβ
#13Letta (ex MemGPT)PythonApache 2.0
Stars 14.0k
Stateful agents with persistent memory
6.LangGraphβ
#14LangChainPy / TSMIT
Stars 12.5k
Stateful DAG agent runtime; HITL support
#20
Microsoft ResearchPythonMIT
Stars 5.8k
Microsoft Research generalist multi-agent system; orchestrator + 4 specialist agents
#21
AG2 CommunityPythonApache 2.0
Stars 4.2k
Community fork of AutoGen v0.2 (formerly pyautogen); active after Microsoft v0.4 rewrite
Safety & Alignment Benchmarks
Data as of May 11, 2026| Model | HarmBenchB | AdvBenchB | WildGuardB | NotesβΌ | ||||
|---|---|---|---|---|---|---|---|---|
| 1.β‘ | 2026/2 | β | 197.8 | 199.2 | 194.3 | 1 | Constitutional AI; strongest refusal | |
| 2.β‘ | 2026/2 | β | 297.1 | 298.8 | 293.5 | 2 | RLHF + CAI | |
| 3.β‘ | 2026/3 | β | 395.4 | 397.3 | 391.2 | 3 | RLHF; deliberative alignment | |
| 4.β‘ | 2026/2 | β | 494.2 | 496.8 | 490.8 | 4 | Gemini safety tuning | |
| 5.β‘ | 2024/7 | β | 589.5 | - | - | 4 | Open safety classifier; 8B | |
| 6.β‘ | 2025/4 | β | 684.2 | 588.7 | - | 4.3 | RLHF'd; open | |
| 7.β‘ | 2026/1 | β | 781.3 | 685.6 | - | 5 | China-specific content policy | |
| 8.β‘ | 2025/11 | β | 879.5 | 784.3 | - | 5.7 | Explicit 'minimal guardrails' positioning | |
| 9.β‘ | 2026/2 | β | 976.8 | 882.1 | - | 6.3 | Open; Chinese content alignment | |
| 10.β‘ | 2025/9 | β | 1072.4 | 978.9 | - | 7 | Open model; weaker alignment | |
#1
Anthropic
HarmBench 97.8AdvBench 99.2WildGuard 94.3
Constitutional AI; strongest refusal
MCP Servers
Data as of April 25, 2026| Model | NotesβΌ | |||||||
|---|---|---|---|---|---|---|---|---|
| 1.β‘ | FilesystemAnthropic (official) | 2024/11 | Storage | stdio | TypeScript | MIT | β | Reference server β local file read/write/edit |
| 2.β‘ | GitHubAnthropic (official) | 2024/11 | Dev tools | stdio / HTTP | TypeScript | MIT | β | Repos, issues, PRs, search |
| 3.β‘ | Brave SearchAnthropic (official) | 2024/11 | Search | stdio | TypeScript | MIT | β | Web search via Brave API |
| 4.β‘ | PuppeteerAnthropic (official) | 2024/11 | Browser | stdio | TypeScript | MIT | β | Reference browser server |
| 5.β‘ | SlackAnthropic (official) | 2024/11 | Comms | stdio | TypeScript | MIT | β | Read/post messages, channel management |
| 6.β‘ | PostgresAnthropic (official) | 2024/11 | Database | stdio | TypeScript | MIT | β | Read-only SQL over Postgres |
| 7.β‘ | FetchAnthropic (official) | 2024/11 | Web | stdio | Python | MIT | β | Convert web pages to LLM-friendly markdown |
| 8.β‘ | MemoryAnthropic (official) | 2024/11 | Memory | stdio | TypeScript | MIT | β | Knowledge-graph persistent memory |
| 9.β‘ | TimeAnthropic (official) | 2024/11 | Utility | stdio | Python | MIT | β | Timezone + conversion utilities |
| 10.β‘ | Sequential ThinkingAnthropic (official) | 2024/12 | Reasoning | stdio | TypeScript | MIT | β | Reflective step-by-step thinking tool |
| 11.β‘ | 2024/12 | Browser | stdio | TypeScript | Apache 2.0 | β | Full browser automation via Playwright | |
| 12.β‘ | 2025/1 | Scraping | HTTP / stdio | TypeScript | MIT | β | Web scraping + crawling pipeline | |
| 13.β‘ | Context7Upstash | 2025/1 | Docs | HTTP / stdio | TypeScript | MIT | β | Fetches current library docs on demand |
| 14.β‘ | ApifyApify | 2025/2 | Scraping | HTTP | TypeScript | Apache 2.0 | β | 5000+ scrapers accessible via MCP |
| 15.β‘ | ObsidianCommunity | 2025/3 | Notes | stdio | TypeScript | MIT | β | Read/write Obsidian vault markdown |
| 16.β‘ | SupabaseSupabase | 2025/3 | Database | stdio | TypeScript | MIT | β | Postgres + Auth + Storage + Realtime |
| 17.β‘ | ExaExa | 2025/4 | Search | HTTP | TypeScript | MIT | β | Semantic web search with summaries |
| 18.β‘ | SentrySentry | 2025/4 | Monitoring | HTTP | TypeScript | MIT | β | Query errors + releases; AI triage hook |
| 19.β‘ | StripeStripe | 2025/5 | Payments | HTTP | TypeScript | MIT | β | Query and manipulate Stripe API |
| 20.β‘ | LinearLinear | 2025/6 | PM | HTTP | TypeScript | MIT | β | Issues, projects, cycles β official Linear MCP |
| 21.β‘ | Figma Dev ModeFigma | 2025/6 | Design | HTTP | TypeScript | MIT | β | Design tokens + code gen from Figma |
| 22.β‘ | NotionNotion | 2025/9 | Docs | HTTP | TypeScript | MIT | β | Read/edit Notion pages and databases |
1.Filesystemβ
Anthropic (official)StoragestdioTypeScriptMIT
Reference server β local file read/write/edit
2.GitHubβ
Anthropic (official)Dev toolsstdio / HTTPTypeScriptMIT
Repos, issues, PRs, search
3.Brave Searchβ
Anthropic (official)SearchstdioTypeScriptMIT
Web search via Brave API
4.Puppeteerβ
Anthropic (official)BrowserstdioTypeScriptMIT
Reference browser server
5.Slackβ
Anthropic (official)CommsstdioTypeScriptMIT
Read/post messages, channel management
6.Postgresβ
Anthropic (official)DatabasestdioTypeScriptMIT
Read-only SQL over Postgres
7.Fetchβ
Anthropic (official)WebstdioPythonMIT
Convert web pages to LLM-friendly markdown
8.Memoryβ
Anthropic (official)MemorystdioTypeScriptMIT
Knowledge-graph persistent memory
9.Timeβ
Anthropic (official)UtilitystdioPythonMIT
Timezone + conversion utilities
10.Sequential Thinkingβ
Anthropic (official)ReasoningstdioTypeScriptMIT
Reflective step-by-step thinking tool
12.
Firecrawlβ
MendableScrapingHTTP / stdioTypeScriptMIT
Web scraping + crawling pipeline
13.Context7β
UpstashDocsHTTP / stdioTypeScriptMIT
Fetches current library docs on demand
14.Apifyβ
ApifyScrapingHTTPTypeScriptApache 2.0
5000+ scrapers accessible via MCP
15.Obsidianβ
CommunityNotesstdioTypeScriptMIT
Read/write Obsidian vault markdown
16.Supabaseβ
SupabaseDatabasestdioTypeScriptMIT
Postgres + Auth + Storage + Realtime
17.Exaβ
ExaSearchHTTPTypeScriptMIT
Semantic web search with summaries
18.Sentryβ
SentryMonitoringHTTPTypeScriptMIT
Query errors + releases; AI triage hook
19.Stripeβ
StripePaymentsHTTPTypeScriptMIT
Query and manipulate Stripe API
20.Linearβ
LinearPMHTTPTypeScriptMIT
Issues, projects, cycles β official Linear MCP
21.Figma Dev Modeβ
FigmaDesignHTTPTypeScriptMIT
Design tokens + code gen from Figma
22.Notionβ
NotionDocsHTTPTypeScriptMIT
Read/edit Notion pages and databases
LLM Observability / Tracing
Data as of April 26, 2026| Model | NotesβΌ | ||||||
|---|---|---|---|---|---|---|---|
| 1.β‘ | 2020/9 | Prompt mgmt / Eval / Deploy | Proprietary | Starter free / Growth $399/mo+ | β | Prompt versioning + collaborative UI focus | |
| 2.β‘ | 2022/10 | Eval / Hallucination / Guardrails | Proprietary | Custom | β | Enterprise-tier; guardrail + hallucination scoring | |
| 3.β‘ | 2023/2 | Tracing / Cost / Cache | Apache 2.0 | Free 10k req / $20/mo+ | β | Proxy-based; drop-in for OpenAI/Anthropic | |
| 4.β‘ | 2023/3 | Tracing / Eval / Observability | ELv2 | Free self-host / enterprise | β | OpenInference tracing standard; good RAG eval | |
| 5.β‘ | 2023/4 | Eval / Red-team | MIT | Free | β | CLI-first eval; CI-friendly; red-team suites built in | |
| 6.β‘ | 2023/6 | Tracing / Eval / Prompt mgmt | MIT (core) / Commercial | Free self-host / $29/mo+ cloud | β | Top open-source alternative; OpenTelemetry-first | |
| 7.β‘ | 2023/7 | Tracing / Eval / Prompt mgmt | Proprietary | Free 5k traces / $39/mo+ | β | LangChain-native but framework-agnostic; strong prompt playground | |
| 8.β‘ | Traceloop OpenLLMetryTraceloop | 2023/9 | Tracing (standard) | Apache 2.0 | Free (library) | β | OpenTelemetry standard for LLM traces; backend-agnostic |
| 9.β‘ | 2023/12 | Eval / Experiments / Tracing | Proprietary | Free / $249/mo+ | β | Eval-first; strong experiment tracking + regression detection | |
| 10.β‘ | Literal AILiteral AI | 2024/1 | Tracing / Eval / Dataset | Proprietary | Free / $49/mo+ | β | Strong threads-and-datasets model; ex-Chainlit team |
| 11.β‘ | Weights & Biases WeaveW&B | 2024/1 | Tracing / Eval | Apache 2.0 (SDK) | Free / Enterprise | β | W&B's LLM observability; integrates with MLOps runs |
| 12.β‘ | 2024/6 | Tracing (OTel) | MIT (SDK) | Free 10M spans / paid | β | Pydantic team's OTel-native observability; great Pydantic AI integration | |
| 13.β‘ | 2026/3 | Agent Builder / Mgmt / Governance | Proprietary | Enterprise (LangSmith add-on) | β | Renamed from LangSmith Agent Builder; no-code agent builder + RBAC/ABAC, Skills, Sandboxes (private preview), 7.5k+ Arcade.dev tools | |
HumanloopPrompt mgmt / Eval / DeployProprietaryStarter free / Growth $399/mo+
Prompt versioning + collaborative UI focus
GalileoEval / Hallucination / GuardrailsProprietaryCustom
Enterprise-tier; guardrail + hallucination scoring
HeliconeTracing / Cost / CacheApache 2.0Free 10k req / $20/mo+
Proxy-based; drop-in for OpenAI/Anthropic
ArizeTracing / Eval / ObservabilityELv2Free self-host / enterprise
OpenInference tracing standard; good RAG eval
LangfuseTracing / Eval / Prompt mgmtMIT (core) / CommercialFree self-host / $29/mo+ cloud
Top open-source alternative; OpenTelemetry-first
LangChainTracing / Eval / Prompt mgmtProprietaryFree 5k traces / $39/mo+
LangChain-native but framework-agnostic; strong prompt playground
8.Traceloop OpenLLMetryβ
TraceloopTracing (standard)Apache 2.0Free (library)
OpenTelemetry standard for LLM traces; backend-agnostic
BraintrustEval / Experiments / TracingProprietaryFree / $249/mo+
Eval-first; strong experiment tracking + regression detection
10.Literal AI
Literal AITracing / Eval / DatasetProprietaryFree / $49/mo+
Strong threads-and-datasets model; ex-Chainlit team
11.Weights & Biases Weaveβ
W&BTracing / EvalApache 2.0 (SDK)Free / Enterprise
W&B's LLM observability; integrates with MLOps runs
PydanticTracing (OTel)MIT (SDK)Free 10M spans / paid
Pydantic team's OTel-native observability; great Pydantic AI integration
LangChainAgent Builder / Mgmt / GovernanceProprietaryEnterprise (LangSmith add-on)
Renamed from LangSmith Agent Builder; no-code agent builder + RBAC/ABAC, Skills, Sandboxes (private preview), 7.5k+ Arcade.dev tools
Prompt Guides / Prompt Engineering
Data as of May 1, 2026| Model | NotesβΌ | ||||||
|---|---|---|---|---|---|---|---|
| 1.β‘ | 2026/1 | Prompt management | Prompt objects, versioning, variables, prompt caching, eval-driven iteration | Yes | OpenAI API builders | Covers reusable/versioned prompts and dashboard-to-API workflow | |
| 2.β‘ | 2026/1 | Prompt design | Clear instructions, prompt components, few-shot examples, task decomposition, chaining | No | Gemini API builders | Includes topic-specific guides for media, image, and video prompting | |
| 3.β‘ | 2026/2 | Prompt engineering | Clear instructions, XML structure, examples, roles, long context, tool use, agentic workflows | Yes | Claude API builders | Starts with success criteria and empirical evals before prompt tuning | |
| 4.β‘ | 2026/2 | Model-specific prompting | Clarity, examples, XML tags, context placement, tool-use instructions, chaining | Yes | Claude Opus/Sonnet/Haiku users | Living reference for current Claude models and prompt patterns | |
OpenAIPrompt managementPrompt objects, versioning, variables, prompt caching, eval-driven iterationYesOpenAI API builders
Covers reusable/versioned prompts and dashboard-to-API workflow
GooglePrompt designClear instructions, prompt components, few-shot examples, task decomposition, chainingNoGemini API builders
Includes topic-specific guides for media, image, and video prompting
AnthropicPrompt engineeringClear instructions, XML structure, examples, roles, long context, tool use, agentic workflowsYesClaude API builders
Starts with success criteria and empirical evals before prompt tuning
AnthropicModel-specific promptingClarity, examples, XML tags, context placement, tool-use instructions, chainingYesClaude Opus/Sonnet/Haiku users
Living reference for current Claude models and prompt patterns