Autonomous Agent Benchmarks

Data as of June 12, 2026

Model				HLE+ToolsA	BrowseB	DSQAB	ToolathlonB	SWE-VB	GAIAB	WebArenaB	OSWorld-VB	ClawEvalA		Notes▼
1.□	Claude Opus 4.6Anthropic	2026/2	○	253.0	283.7	291.3	447.2	380.8	267.2	154.3	472.7	166.3	2.5	Top on real-world agent tasks
2.□	Ling-2.6-1TInclusionAI	2026/4	●	-	-	-	-	182.2	-	-	-	-	3.3	Instant instruct model; fast-thinking; SOTA on SWE-bench Verified; real-world agents, tool use, persistent coding agents
3.□	Devin 3Cognition	2025/11	○	-	-	-	-	282.1	-	-	-	-	3.5	Autonomous SWE agent; $500/mo
4.□	Gemini 3.1 ProGoogle	2026/2	○	451.4	185.9	481.9	348.8	480.6	366.1	351.8	645.9	-	4	Computer Use API
5.□	Manus 2Manus	2025/12	○	-	-	-	-	-	172.0	-	-	-	4.3	Chinese agent platform; GAIA leader
6.□	Kimi K2.6Moonshot	2026/4	●	154.0	483.2	192.5	250.0	680.2	-	-	373.1	-	4.3	-
7.□	Qwen3.7 MaxAlibaba	2026/5	○	-	-	-	-	580.4	-	-	-	-	4.3	Agent-centric flagship; long-horizon autonomous execution.
8.□	Holo3-35B-A3BH Company	2026/4	●	-	-	-	-	-	-	-	178.8	-	4.8	Apache 2.0; 35B MoE (3B active); UI-grounding/computer-use specialist; powers HoloTab Chrome ext. Score is OSWorld-Verified (cleaned subset), not raw OSWorld
9.□	GLM 5V TurboZ.ai	2026/4	○	-	-	-	-	977.8	-	-	-	-	5.3	Native multimodal agent model; CogViT vision encoder; leads AndroidWorld, WebVoyager, ZClawBench; design-to-code; OpenClaw/Claude Code compatible
10.□	GPT-5.4OpenAI	2026/3	○	352.1	582.7	578.6	154.6	1476.9	465.8	252.1	275.0	-	5.5	Operator successor
11.□	Owl AlphaOpenRouter	2026/4	○	-	-	-	-	-	-	-	-	-	5.5	Agentic foundation model on OpenRouter; native tool use, long-context tasks, code generation, automated workflows; compatible with Claude Code and OpenClaw. Free preview; provider may log prompts/completions.
12.□	GLM 4.6Z.ai	2025/7	●	-	-	-	-	1077.8	-	-	-	-	5.5	Open-weight MoE; basis for Big Pickle; agentic tool use specialist
13.□	Qwen 3.6 PlusAlibaba	2026/4	○	-	-	-	-	878.8	857.6	-	-	-	5.5	Tool use specialist
14.□	Claude Opus 4.8Anthropic	2026/5	○	-	-	-	-	-	-	-	-	-	5.5	System card 2026-05-29: 5x fewer dishonest reports and 10x less overconfidence on agentic coding vs 4.7; dropped adversarial-business training (less dishonest, more scam-susceptible)
15.□	Claude Opus 4.7Anthropic	2026/4	○	-	-	-	-	-	-	-	-	-	5.5	-
16.□	GPT-5.5OpenAI	2026/4	○	-	-	-	-	-	-	-	-	-	5.5	-
17.□	DeepSeek V4 ProDeepSeek	2026/4	●	-	-	-	-	-	-	-	-	-	5.5	1.6T-A49B MoE; 1M context; runnable on Huawei Ascend chips
18.□	Claude Haiku 4.5Anthropic	2025/10	○	-	-	-	-	-	-	-	-	-	5.5	-
19.□	DeepSeek V4 FlashDeepSeek	2026/4	●	-	-	-	-	-	-	-	-	-	5.5	Smaller MoE companion to V4 Pro; high-throughput inference
20.□	Ring-2.6-1TInclusionAI	2026/5	●	-	-	-	-	-	-	-	-	-	5.5	Trillion-parameter thinking model; PinchBench 87.6, ClawEval, TAU2-Bench, GAIA2-search; adaptive reasoning for tool-heavy agent workflows
21.□	Grok 4.20xAI	2026/4	○	-	-	-	-	-	-	-	-	-	5.5	xAI agent-capable flagship. Tool-use scores pending.
22.□	MiniMax M2.7MiniMax	2026/3	●	-	-	-	-	-	-	-	-	-	5.5	MiniMax agentic-capable (text-only). Open weights. Tool-use scores pending.
23.□	GPT-5.1OpenAI	2025/12	○	-	-	-	-	-	-	-	-	-	5.5	Iterative GPT-5 release with agent improvements. Scores pending.
24.□	Gemini 3 FlashGoogle	2026/3	○	-	-	-	-	-	-	-	-	-	5.5	Fast tier of Gemini 3, agent-capable. Scores pending.
25.□	Gemini 2.5 ProGoogle	2025/3	○	-	-	-	-	-	-	-	-	-	5.5	Previous-gen flagship; agent-capable.
26.□	MiniMax M3MiniMax	2026/6	●	-	383.5	-	-	-	-	-	-	-	5.5	Agentic open-weight; BrowseComp 83.5, Terminal-Bench 2.1 66.0, 1M context.
27.□	Gemini 3.5 FlashGoogle	2026/5	○	-	-	-	-	-	-	-	-	-	5.5	MCP Atlas 83.6; Terminal-Bench 2.1 76.2; beats Gemini 3.1 Pro on agentic suite.
28.□	Qwen3.7 PlusAlibaba	2026/6	○	-	-	-	-	-	-	-	-	-	5.5	Multimodal agent model; ScreenSpot Pro 79.0 GUI grounding.
29.□	MAI-Code-1-FlashMicrosoft AI	2026/6	○	-	-	-	-	-	-	-	-	-	5.5	Agentic coding model for Copilot/VS Code; SWE-Bench Pro 51.
30.□	Claude Sonnet 4.6Anthropic	2026/2	○	-	-	-	-	779.6	562.4	450.2	744.7	-	5.8	Best value agent
31.□	Big PickleOpenCode Zen	2025/10	○	-	-	-	-	1177.8	-	-	-	-	5.8	Stealth model (GLM 4.6); free on OpenCode Zen; reasoning + tool calling; text-only
32.□	GLM 5.1Z.ai	2026/4	●	-	-	-	-	1277.8	659.4	-	-	-	6	Open-weight MIT; 754B MoE; 8hr autonomous coding
33.□	Mistral Medium 3.5Mistral	2026/4	●	-	-	-	-	1377.6	-	-	-	-	6.3	Merged instruct/reasoning/coding model; powers Mistral Vibe remote agents and Le Chat Work mode; 91.4 on tau3-Telecom
34.□	Hy3 PreviewTencent Hunyuan	2026/4	●	-	-	-	-	1674.4	-	-	-	-	7	Open-weight Hunyuan 3 preview; reports Terminal-Bench 2.0 54.4 and strong OpenClaw-style agent scores
35.□	Kimi K2.5 ThinkingMoonshot	2026/1	○	550.2	674.9	389.0	527.8	1576.8	758.9	-	563.3	-	7.3	China API; agent-focused
36.□	DeepSeek V3.2 ThinkingDeepSeek	2025/9	●	-	-	-	-	1773.1	1055.3	-	-	-	8.3	Best open agent model
37.□	o3OpenAI	2025/4	○	-	-	-	-	1869.1	956.4	-	-	-	8.3	Pure reasoning; pre-Operator
38.□	Llama 4 MaverickMeta	2025/4	●	-	-	-	-	1968.5	1148.7	-	-	-	9	Best Llama-family agent

Claude Opus 4.6

#2.5

Anthropic

HLE+Tools 53.0Browse 83.7DSQA 91.3Toolathlon 47.2SWE-V 80.8GAIA 67.2WebArena 54.3OSWorld-V 72.7ClawEval 66.3

Top on real-world agent tasks

Ling-2.6-1T●

#3.3

InclusionAI

SWE-V 82.2

Instant instruct model; fast-thinking; SOTA on SWE-bench Verified; real-world agents, tool use, persistent coding agents

Devin 3

#3.5

Cognition

SWE-V 82.1

Autonomous SWE agent; $500/mo

Gemini 3.1 Pro

Google

HLE+Tools 51.4Browse 85.9DSQA 81.9Toolathlon 48.8SWE-V 80.6GAIA 66.1WebArena 51.8OSWorld-V 45.9

Computer Use API

Manus 2

#4.3

Manus

GAIA 72.0

Chinese agent platform; GAIA leader

Kimi K2.6●

#4.3

Moonshot

HLE+Tools 54.0Browse 83.2DSQA 92.5Toolathlon 50.0SWE-V 80.2OSWorld-V 73.1

Qwen3.7 Max

#4.3

Alibaba

SWE-V 80.4

Agent-centric flagship; long-horizon autonomous execution.

Holo3-35B-A3B●

#4.8

H Company

OSWorld-V 78.8

Apache 2.0; 35B MoE (3B active); UI-grounding/computer-use specialist; powers HoloTab Chrome ext. Score is OSWorld-Verified (cleaned subset), not raw OSWorld

GLM 5V Turbo

#5.3

Z.ai

SWE-V 77.8

Native multimodal agent model; CogViT vision encoder; leads AndroidWorld, WebVoyager, ZClawBench; design-to-code; OpenClaw/Claude Code compatible

10.

GPT-5.4

#5.5

OpenAI

HLE+Tools 52.1Browse 82.7DSQA 78.6Toolathlon 54.6SWE-V 76.9GAIA 65.8WebArena 52.1OSWorld-V 75.0

Operator successor

11.

Owl Alpha

#5.5

OpenRouter

Agentic foundation model on OpenRouter; native tool use, long-context tasks, code generation, automated workflows; compatible with Claude Code and OpenClaw. Free preview; provider may log prompts/completions.

12.

GLM 4.6●

#5.5

Z.ai

SWE-V 77.8

Open-weight MoE; basis for Big Pickle; agentic tool use specialist

13.

Qwen 3.6 Plus

#5.5

Alibaba

SWE-V 78.8GAIA 57.6

Tool use specialist

14.

Claude Opus 4.8

#5.5

Anthropic

System card 2026-05-29: 5x fewer dishonest reports and 10x less overconfidence on agentic coding vs 4.7; dropped adversarial-business training (less dishonest, more scam-susceptible)

15.

Claude Opus 4.7

#5.5

Anthropic

16.

GPT-5.5

#5.5

OpenAI

17.

DeepSeek V4 Pro●

#5.5

DeepSeek

1.6T-A49B MoE; 1M context; runnable on Huawei Ascend chips

18.

Claude Haiku 4.5

#5.5

Anthropic

19.

DeepSeek V4 Flash●

#5.5

DeepSeek

Smaller MoE companion to V4 Pro; high-throughput inference

20.

Ring-2.6-1T●

#5.5

InclusionAI

Trillion-parameter thinking model; PinchBench 87.6, ClawEval, TAU2-Bench, GAIA2-search; adaptive reasoning for tool-heavy agent workflows

21.

Grok 4.20

#5.5

xAI

xAI agent-capable flagship. Tool-use scores pending.

22.

MiniMax M2.7●

#5.5

MiniMax

MiniMax agentic-capable (text-only). Open weights. Tool-use scores pending.

23.

GPT-5.1

#5.5

OpenAI

Iterative GPT-5 release with agent improvements. Scores pending.

24.

Gemini 3 Flash

#5.5

Google

Fast tier of Gemini 3, agent-capable. Scores pending.

25.

Gemini 2.5 Pro

#5.5

Google

Previous-gen flagship; agent-capable.

26.

MiniMax M3●

#5.5

MiniMax

Browse 83.5

Agentic open-weight; BrowseComp 83.5, Terminal-Bench 2.1 66.0, 1M context.

27.

Gemini 3.5 Flash

#5.5

Google

MCP Atlas 83.6; Terminal-Bench 2.1 76.2; beats Gemini 3.1 Pro on agentic suite.

28.

Qwen3.7 Plus

#5.5

Alibaba

Multimodal agent model; ScreenSpot Pro 79.0 GUI grounding.

29.

MAI-Code-1-Flash

#5.5

Microsoft AI

Agentic coding model for Copilot/VS Code; SWE-Bench Pro 51.

30.

Claude Sonnet 4.6

#5.8

Anthropic

SWE-V 79.6GAIA 62.4WebArena 50.2OSWorld-V 44.7

Best value agent

31.

Big Pickle

#5.8

OpenCode Zen

SWE-V 77.8

Stealth model (GLM 4.6); free on OpenCode Zen; reasoning + tool calling; text-only

32.

GLM 5.1●

Z.ai

SWE-V 77.8GAIA 59.4

Open-weight MIT; 754B MoE; 8hr autonomous coding

33.

Mistral Medium 3.5●

#6.3

Mistral

SWE-V 77.6

Merged instruct/reasoning/coding model; powers Mistral Vibe remote agents and Le Chat Work mode; 91.4 on tau3-Telecom

34.

Hy3 Preview●

Tencent Hunyuan

SWE-V 74.4

Open-weight Hunyuan 3 preview; reports Terminal-Bench 2.0 54.4 and strong OpenClaw-style agent scores

35.

Kimi K2.5 Thinking

#7.3

Moonshot

HLE+Tools 50.2Browse 74.9DSQA 89.0Toolathlon 27.8SWE-V 76.8GAIA 58.9OSWorld-V 63.3

China API; agent-focused

36.

DeepSeek V3.2 Thinking●

#8.3

DeepSeek

SWE-V 73.1GAIA 55.3

Best open agent model

37.

#8.3

OpenAI

SWE-V 69.1GAIA 56.4

Pure reasoning; pre-Operator

38.

Llama 4 Maverick●

ClawEval / PinchBench (Xiaomi OpenClaw harness)

Data as of April 28, 2026

Model					ClawEvalB	MultimodalB	Tok/Traj		Notes▼
1.□	Claude Opus 4.6Anthropic	Native (Claude Code)	2026/1	○	166.3	124.8	1110000	1	ClawEval reference point; highest score, highest token cost per trajectory.
2.□	MiMo-V2.5-ProXiaomi MiMo	Open (Claude-Code-compatible scaffold)	2026/4	○	263.8	-	270000	2	40-60% fewer tokens than Opus 4.6 / Gemini 3.1 Pro / GPT-5.4 at comparable capability.
3.□	MiMo-V2.5Xiaomi MiMo	Open scaffold	2026/4	○	362.3	223.8	-	2.5	Native multimodal base model; matches Claude Sonnet 4.6 on ClawEval Multimodal.
4.□	Claude Sonnet 4.6Anthropic	Native (Claude Code)	2025/11	○	-	323.8	-	2.5	Workhorse reference for ClawEval Multimodal; tied with MiMo-V2.5 base.
5.□	MiMo-V2-ProXiaomi MiMo	Open scaffold	2025/12	○	461.5	-	-	3	Predecessor to V2.5; Xiaomi reports global leadership on PinchBench (no public number).

Claude Opus 4.6

AnthropicNative (Claude Code)

ClawEval 66.3Multimodal 24.8Tok/Traj 110000

ClawEval reference point; highest score, highest token cost per trajectory.

MiMo-V2.5-Pro

Xiaomi MiMoOpen (Claude-Code-compatible scaffold)

ClawEval 63.8Tok/Traj 70000

40-60% fewer tokens than Opus 4.6 / Gemini 3.1 Pro / GPT-5.4 at comparable capability.

MiMo-V2.5

#2.5

Xiaomi MiMoOpen scaffold

ClawEval 62.3Multimodal 23.8

Native multimodal base model; matches Claude Sonnet 4.6 on ClawEval Multimodal.

Claude Sonnet 4.6

#2.5

AnthropicNative (Claude Code)

Multimodal 23.8

Workhorse reference for ClawEval Multimodal; tied with MiMo-V2.5 base.

MiMo-V2-Pro

Xiaomi MiMoOpen scaffold

ClawEval 61.5

Predecessor to V2.5; Xiaomi reports global leadership on PinchBench (no public number).

PostTrainBench: Agentic Post-Training

Data as of April 29, 2026

Model					AvgA	AIMEB	ArenaHB	BFCLB	GPQAC	GSM8KD	HealthB	HumanEvalD		Notes▼
1.□	Opus 4.6Anthropic	Claude Code	2026/2	○	123.2	15.0	17.8	175.9	125.5	341.0	118.8	324.7	1	Leads PostTrainBench; only agent observed using GRPO (RL) — Sonnet 4.6 in 33% of tasks, Opus 4.6 in 3%. All other harnesses use SFT only.
2.□	Gemini 3.1 ProGoogle	OpenCode	2026/4	○	221.6	23.9	27.4	262.8	318.5	245.5	314.5	140.2	2	Open-source OpenCode scaffold; strongest HumanEval among agents.
3.□	GPT-5.2OpenAI	Codex CLI	2026/3	○	321.4	30.8	36.6	352.5	223.7	155.9	215.8	230.2	3	Top GSM8K among agents; near-zero AIME suggests math-RL pipeline weakness.

Opus 4.6

AnthropicClaude Code

Avg 23.2AIME 5.0ArenaH 7.8BFCL 75.9GPQA 25.5GSM8K 41.0Health 18.8HumanEval 24.7

Leads PostTrainBench; only agent observed using GRPO (RL) — Sonnet 4.6 in 33% of tasks, Opus 4.6 in 3%. All other harnesses use SFT only.

Gemini 3.1 Pro

GoogleOpenCode

Avg 21.6AIME 3.9ArenaH 7.4BFCL 62.8GPQA 18.5GSM8K 45.5Health 14.5HumanEval 40.2

Open-source OpenCode scaffold; strongest HumanEval among agents.

GPT-5.2

OpenAICodex CLI

Avg 21.4AIME 0.8ArenaH 6.6BFCL 52.5GPQA 23.7GSM8K 55.9Health 15.8HumanEval 30.2

Top GSM8K among agents; near-zero AIME suggests math-RL pipeline weakness.

Coding Agents / Dev Tools

Data as of May 2, 2026

Model							SWE-VB		Notes▼
1.□	Devin 3Cognition	2025/11	Autonomous agent	Proprietary (Claude + custom)	$500/mo	○	182.1	1	First autonomous SWE; web UI + async tasks
2.□	Pi (pi-mono)Mario Zechner (badlogic)	2025/12	CLI	Multi-model	Free	●	268.0	2	Minimal 4-tool coding agent (read, write, edit, bash); self-modifying; powers OpenClaw core; 'Slow the F*** Down' philosophy at AI Engineer Europe
3.□	OpenHandsAll Hands AI	2024/3	Autonomous agent	Multi-model	Free (self-host)	●	365.8	3	Formerly OpenDevin; most-starred open agent
4.□	AiderAider	2023/5	CLI	Multi (Claude, GPT, DeepSeek)	Free	●	464.6	4	Git-native pair programmer; Python
5.□	AmpSourcegraph	2025/5	Agent (web + IDE)	Claude + code search	$20/mo	○	-	5	Sourcegraph Cody successor; code graph aware
6.□	CursorAnysphere	2023/3	VSCode fork	Multi (Claude, GPT, Gemini)	$20/mo Pro	○	-	6	Most popular AI editor; $9.5B valuation
7.□	WindsurfOpenAI (acq Codeium)	2024/11	VSCode fork	Multi + OpenAI priority	$15/mo Pro	○	-	7	Acquired by OpenAI (Q2 2025); Cascade agent
8.□	ClineCline	2024/7	VSCode extension	Multi-model	API pass-through	●	-	8	Formerly Claude Dev; 1M+ installs
9.□	ContinueContinue	2023/9	VSCode / JetBrains ext	Multi-model	Free (self-host)	●	-	9	Open-source Copilot alt; config as code
10.□	Zed AIZed Industries	2024/8	Editor	Multi-model	Free / $20 Pro	●	-	10	Rust-native editor; collaborative
11.□	Claude CodeAnthropic	2025/2	CLI	Claude Sonnet 4.6 / Opus 4.6	API pass-through	○	-	11	Official Anthropic CLI; sub-agents, skills, hooks
12.□	GooseBlock	2025/1	CLI + Desktop	Multi-model	Free	●	-	12	Block (Square) open-source; MCP-native
13.□	Codex CLIOpenAI	2025/4	CLI	GPT-5.x Codex	API pass-through	○	-	13	Official OpenAI CLI; image + diff review
14.□	Qwen CodeAlibaba	2025/7	CLI	Qwen 3 Coder 480B	Free	●	-	14	Ships with Qwen3-Coder release
15.□	TabbyTabbyML	2023/9	Self-host Copilot	Open models (StarCoder, etc.)	Free (self-host)	●	-	15	Air-gapped GitHub Copilot alternative
16.□	OpenCodeAnomaly / SST	2025/4	CLI + desktop + IDE	Multi-model	Free / API pass-through	●	-	16	Open-source terminal-native coding agent; LSP-aware; provider-agnostic
17.□	Kimi ClawMoonshot	2026/2	Cloud agent (browser)	Kimi K2.6	Hosted; ~88% cheaper than Claude Code claim	○	-	17	Native OpenClaw on kimi.com; 5,000 community skills, 40GB cloud storage, Claw Groups (up to 300 sub-agents × 4,000 steps), 24/7 scheduled tasks. Anthropic-compatible API at api.moonshot.ai/anthropic so vanilla Claude Code works with K2.6.
18.□	OpenClawPeter Steinberger / OpenClaw Foundation	2025/11	Agent runtime (messaging-first)	Multi-model	Free (MIT)	●	-	18	Originally Clawdbot (2025-11-14); renamed Moltbot, then OpenClaw (~2026-01-30) after Anthropic trademark complaint. Local-first, multi-LLM, per-skill permission declarations. Steinberger joined OpenAI 2026-02; project moved to non-profit foundation. Progenitor of the 'Claw' ecosystem.

Devin 3

CognitionAutonomous agentProprietary (Claude + custom)$500/mo

SWE-V 82.1

First autonomous SWE; web UI + async tasks

Pi (pi-mono)●

Mario Zechner (badlogic)CLIMulti-modelFree

SWE-V 68.0

Minimal 4-tool coding agent (read, write, edit, bash); self-modifying; powers OpenClaw core; 'Slow the F*** Down' philosophy at AI Engineer Europe

OpenHands●

All Hands AIAutonomous agentMulti-modelFree (self-host)

SWE-V 65.8

Formerly OpenDevin; most-starred open agent

Aider●

AiderCLIMulti (Claude, GPT, DeepSeek)Free

SWE-V 64.6

Git-native pair programmer; Python

Amp

SourcegraphAgent (web + IDE)Claude + code search$20/mo

Sourcegraph Cody successor; code graph aware

Cursor

AnysphereVSCode forkMulti (Claude, GPT, Gemini)$20/mo Pro

Agent Frameworks / SDKs

Data as of April 25, 2026

Model								Stars		Notes▼
1.□	LangChainLangChain	2022/10	Py / TS	RAG + Agent	MIT	✓	●	1102.0k	1	Largest ecosystem; LangGraph for stateful agents
2.□	FabricDaniel Miessler	2024/1	Go	Prompt patterns / augmentation	MIT	–	●	241.6k	2	Modular crowdsourced prompt-pattern library; CLI-first; works with any LLM via API or local; shines for repeatable text tasks (summarize, extract, label)
3.□	LlamaIndexLlamaIndex	2022/11	Py / TS	RAG / Data	MIT	✓	●	341.0k	3	Data-first; strong loaders and indices
4.□	AutoGenMicrosoft	2023/9	Py / .NET	Multi-agent	MIT	✓	●	438.0k	4	Conversational multi-agent; v0.4 rewrite
5.□	CrewAICrewAI Inc.	2023/11	Python	Multi-agent	MIT	✓	●	528.0k	5	Role-based crews; standalone (no LangChain dep)
6.□	Semantic KernelMicrosoft	2023/3	C# / Py / Java	Agent	MIT	✓	●	623.5k	6	Microsoft-native; enterprise .NET focus
7.□	AgnoAgno (ex Phidata)	2023/8	Python	Multi-agent	MPL-2.0	✓	●	721.5k	7	Fast multi-modal agents; teams and workflows
8.□	DSPyStanford NLP	2023/1	Python	Programming	MIT	–	●	820.0k	8	Compile prompts via optimizers; research-led
9.□	OpenAI SwarmOpenAI	2024/10	Python	Multi-agent	MIT	–	●	919.0k	9	Deprecated educational reference; superseded by OpenAI Agents SDK
10.□	Haystackdeepset	2020/11	Python	RAG / Pipelines	Apache 2.0	✓	●	1018.0k	10	Production RAG pipelines; v2 rewrite
11.□	Colleague SkillTianyi Zhou (Shanghai AI Lab)	2026/4	Python	Agent distillation	MIT	–	●	1117.3k	11	Chinese 'Digital Life' agent framework — imports a coworker's Lark/DingTalk chat history + files and distills their skills+style into an LLM agent. Viral in China; centerpiece of MIT Tech Review's worker-replacement story (Apr 2026)
12.□	Vercel AI SDKVercel	2023/6	TypeScript	UI / Streaming	Apache 2.0	✓	●	1215.5k	12	React/Svelte/Vue streaming hooks; tool calls
13.□	LettaLetta (ex MemGPT)	2023/10	Python	Memory / Agent	Apache 2.0	✓	●	1314.0k	13	Stateful agents with persistent memory
14.□	LangGraphLangChain	2024/1	Py / TS	Graph Agent	MIT	✓	●	1412.5k	14	Stateful DAG agent runtime; HITL support
15.□	LangGraphLangChain	2024/1	Py / TS	Stateful Agent Runtime	MIT	✓	●	1412.0k	15	Durable execution, streaming, human-in-the-loop, persistence; deploy via langgraph-cli to LangSmith Deployment
16.□	MastraMastra AI	2024/10	TypeScript	Agent	Apache 2.0	✓	●	1611.0k	16	TS-first; workflows, evals, voice, RAG
17.□	Pydantic AIPydantic	2024/12	Python	Typed Agent	MIT	✓	●	179.5k	17	Type-safe; from Pydantic team
18.□	OpenAI Agents SDKOpenAI	2025/3	Py / TS	Agent Loop	MIT	✓	●	188.5k	18	Official OpenAI; handoffs + guardrails
19.□	Google ADKGoogle	2025/4	Py / Java	Agent	Apache 2.0	✓	●	196.5k	19	Agent Development Kit; A2A protocol
20.□	Magentic-OneMicrosoft Research	2024/11	Python	Multi-agent	MIT	–	●	205.8k	20	Microsoft Research generalist multi-agent system; orchestrator + 4 specialist agents
21.□	AG2AG2 Community	2024/11	Python	Multi-agent	Apache 2.0	✓	●	214.2k	21	Community fork of AutoGen v0.2 (formerly pyautogen); active after Microsoft v0.4 rewrite
22.□	Claude Agent SDKAnthropic	2025/9	TS / Py	Agent	MIT	✓	●	223.2k	22	Sub-agents, skills, hooks; Claude Code core

LangChain●

LangChainPy / TSRAG + AgentMIT✓

Stars 102.0k

Largest ecosystem; LangGraph for stateful agents

Fabric●

Daniel MiesslerGoPrompt patterns / augmentationMIT–

Stars 41.6k

Modular crowdsourced prompt-pattern library; CLI-first; works with any LLM via API or local; shines for repeatable text tasks (summarize, extract, label)

3.LlamaIndex●

LlamaIndexPy / TSRAG / DataMIT✓

Stars 41.0k

Data-first; strong loaders and indices

AutoGen●

MicrosoftPy / .NETMulti-agentMIT✓

Stars 38.0k

Conversational multi-agent; v0.4 rewrite

5.CrewAI●

CrewAI Inc.PythonMulti-agentMIT✓

Stars 28.0k

Role-based crews; standalone (no LangChain dep)

Semantic Kernel●

MicrosoftC# / Py / JavaAgentMIT✓

Stars 23.5k

Microsoft-native; enterprise .NET focus

7.Agno●

Agno (ex Phidata)PythonMulti-agentMPL-2.0✓

Stars 21.5k

Fast multi-modal agents; teams and workflows

8.DSPy●

Stanford NLPPythonProgrammingMIT–

Stars 20.0k

Compile prompts via optimizers; research-led

OpenAI Swarm●

OpenAIPythonMulti-agentMIT–

Stars 19.0k

Deprecated educational reference; superseded by OpenAI Agents SDK

10.Haystack●

#10

deepsetPythonRAG / PipelinesApache 2.0✓

Stars 18.0k

Production RAG pipelines; v2 rewrite

11.

Colleague Skill●

#11

Tianyi Zhou (Shanghai AI Lab)PythonAgent distillationMIT–

Stars 17.3k

Chinese 'Digital Life' agent framework — imports a coworker's Lark/DingTalk chat history + files and distills their skills+style into an LLM agent. Viral in China; centerpiece of MIT Tech Review's worker-replacement story (Apr 2026)

12.Vercel AI SDK●

#12

VercelTypeScriptUI / StreamingApache 2.0✓

Stars 15.5k

React/Svelte/Vue streaming hooks; tool calls

13.Letta●

#13

Letta (ex MemGPT)PythonMemory / AgentApache 2.0✓

Stars 14.0k

Stateful agents with persistent memory

14.LangGraph●

#14

LangChainPy / TSGraph AgentMIT✓

Stars 12.5k

Stateful DAG agent runtime; HITL support

15.

LangGraph●

#15

LangChainPy / TSStateful Agent RuntimeMIT✓

Stars 12.0k

Durable execution, streaming, human-in-the-loop, persistence; deploy via langgraph-cli to LangSmith Deployment

16.Mastra●

#16

Mastra AITypeScriptAgentApache 2.0✓

Stars 11.0k

TS-first; workflows, evals, voice, RAG

17.Pydantic AI●

#17

PydanticPythonTyped AgentMIT✓

Stars 9.5k

Type-safe; from Pydantic team

18.

OpenAI Agents SDK●

#18

OpenAIPy / TSAgent LoopMIT✓

Stars 8.5k

Official OpenAI; handoffs + guardrails

19.

Google ADK●

#19

GooglePy / JavaAgentApache 2.0✓

Stars 6.5k

Agent Development Kit; A2A protocol

20.

Magentic-One●

#20

Microsoft ResearchPythonMulti-agentMIT–

Stars 5.8k

Microsoft Research generalist multi-agent system; orchestrator + 4 specialist agents

21.

AG2●

#21

AG2 CommunityPythonMulti-agentApache 2.0✓

Stars 4.2k

Community fork of AutoGen v0.2 (formerly pyautogen); active after Microsoft v0.4 rewrite

22.

Claude Agent SDK●

#22

AnthropicTS / PyAgentMIT✓

Stars 3.2k

Sub-agents, skills, hooks; Claude Code core

Multi-Agent Swarms

Data as of April 25, 2026

Model						Stars		Notes▼
1.□	AutoGenMicrosoft	2023/9	Py / .NET	MIT	●	138.0k	4	Conversational multi-agent; v0.4 rewrite
2.□	CrewAICrewAI Inc.	2023/11	Python	MIT	●	228.0k	5	Role-based crews; standalone (no LangChain dep)
3.□	AgnoAgno (ex Phidata)	2023/8	Python	MPL-2.0	●	321.5k	7	Fast multi-modal agents; teams and workflows
4.□	OpenAI SwarmOpenAI	2024/10	Python	MIT	●	419.0k	9	Deprecated educational reference; superseded by OpenAI Agents SDK
5.□	LettaLetta (ex MemGPT)	2023/10	Python	Apache 2.0	●	514.0k	13	Stateful agents with persistent memory
6.□	LangGraphLangChain	2024/1	Py / TS	MIT	●	612.5k	14	Stateful DAG agent runtime; HITL support
7.□	OpenAI Agents SDKOpenAI	2025/3	Py / TS	MIT	●	78.5k	18	Official OpenAI; handoffs + guardrails
8.□	Google ADKGoogle	2025/4	Py / Java	Apache 2.0	●	86.5k	19	Agent Development Kit; A2A protocol
9.□	Magentic-OneMicrosoft Research	2024/11	Python	MIT	●	95.8k	20	Microsoft Research generalist multi-agent system; orchestrator + 4 specialist agents
10.□	AG2AG2 Community	2024/11	Python	Apache 2.0	●	104.2k	21	Community fork of AutoGen v0.2 (formerly pyautogen); active after Microsoft v0.4 rewrite
11.□	Claude Agent SDKAnthropic	2025/9	TS / Py	MIT	●	113.2k	22	Sub-agents, skills, hooks; Claude Code core

AutoGen●

MicrosoftPy / .NETMIT

Stars 38.0k

Conversational multi-agent; v0.4 rewrite

2.CrewAI●

CrewAI Inc.PythonMIT

Stars 28.0k

Role-based crews; standalone (no LangChain dep)

3.Agno●

Agno (ex Phidata)PythonMPL-2.0

Stars 21.5k

Fast multi-modal agents; teams and workflows

OpenAI Swarm●

OpenAIPythonMIT

Stars 19.0k

Deprecated educational reference; superseded by OpenAI Agents SDK

5.Letta●

#13

Letta (ex MemGPT)PythonApache 2.0

Stars 14.0k

Stateful agents with persistent memory

6.LangGraph●

#14

LangChainPy / TSMIT

Stars 12.5k

Stateful DAG agent runtime; HITL support

OpenAI Agents SDK●

#18

OpenAIPy / TSMIT

Stars 8.5k

Official OpenAI; handoffs + guardrails

Google ADK●

#19

GooglePy / JavaApache 2.0

Stars 6.5k

Agent Development Kit; A2A protocol

Magentic-One●

#20

Microsoft ResearchPythonMIT

Stars 5.8k

Microsoft Research generalist multi-agent system; orchestrator + 4 specialist agents

10.

AG2●

#21

AG2 CommunityPythonApache 2.0

Stars 4.2k

Community fork of AutoGen v0.2 (formerly pyautogen); active after Microsoft v0.4 rewrite

11.

Claude Agent SDK●

#22

AnthropicTS / PyMIT

Stars 3.2k

Sub-agents, skills, hooks; Claude Code core

Safety & Alignment Benchmarks

Data as of May 11, 2026

Model				HarmBenchB	AdvBenchB	WildGuardB		Notes▼
1.□	Claude Opus 4.6Anthropic	2026/2	○	197.8	199.2	194.3	1	Constitutional AI; strongest refusal
2.□	Claude Sonnet 4.6Anthropic	2026/2	○	297.1	298.8	293.5	2	RLHF + CAI
3.□	GPT-5.4OpenAI	2026/3	○	395.4	397.3	391.2	3	RLHF; deliberative alignment
4.□	Gemini 3.1 ProGoogle	2026/2	○	494.2	496.8	490.8	4	Gemini safety tuning
5.□	Llama Guard 3Meta	2024/7	●	589.5	-	-	4	Open safety classifier; 8B
6.□	Llama 4 ScoutMeta	2025/4	●	684.2	588.7	-	4.3	RLHF'd; open
7.□	Kimi K2.5Moonshot	2026/1	○	781.3	685.6	-	5	China-specific content policy
8.□	Mistral Large 3Mistral	2025/11	○	879.5	784.3	-	5.7	Explicit 'minimal guardrails' positioning
9.□	Qwen 3.5 397BAlibaba	2026/2	●	976.8	882.1	-	6.3	Open; Chinese content alignment
10.□	DeepSeek V3.2DeepSeek	2025/9	●	1072.4	978.9	-	7	Open model; weaker alignment

Claude Opus 4.6

Anthropic

HarmBench 97.8AdvBench 99.2WildGuard 94.3

Constitutional AI; strongest refusal

Claude Sonnet 4.6

Anthropic

HarmBench 97.1AdvBench 98.8WildGuard 93.5

RLHF + CAI

GPT-5.4

OpenAI

HarmBench 95.4AdvBench 97.3WildGuard 91.2

RLHF; deliberative alignment

Gemini 3.1 Pro

Google

HarmBench 94.2AdvBench 96.8WildGuard 90.8

Gemini safety tuning

Llama Guard 3●

MCP Servers

Data as of April 25, 2026

Model								Notes▼
1.□	FilesystemAnthropic (official)	2024/11	Storage	stdio	TypeScript	MIT	●	Reference server — local file read/write/edit
2.□	GitHubAnthropic (official)	2024/11	Dev tools	stdio / HTTP	TypeScript	MIT	●	Repos, issues, PRs, search
3.□	Brave SearchAnthropic (official)	2024/11	Search	stdio	TypeScript	MIT	●	Web search via Brave API
4.□	PuppeteerAnthropic (official)	2024/11	Browser	stdio	TypeScript	MIT	●	Reference browser server
5.□	SlackAnthropic (official)	2024/11	Comms	stdio	TypeScript	MIT	●	Read/post messages, channel management
6.□	PostgresAnthropic (official)	2024/11	Database	stdio	TypeScript	MIT	●	Read-only SQL over Postgres
7.□	FetchAnthropic (official)	2024/11	Web	stdio	Python	MIT	●	Convert web pages to LLM-friendly markdown
8.□	MemoryAnthropic (official)	2024/11	Memory	stdio	TypeScript	MIT	●	Knowledge-graph persistent memory
9.□	TimeAnthropic (official)	2024/11	Utility	stdio	Python	MIT	●	Timezone + conversion utilities
10.□	Sequential ThinkingAnthropic (official)	2024/12	Reasoning	stdio	TypeScript	MIT	●	Reflective step-by-step thinking tool
11.□	PlaywrightMicrosoft	2024/12	Browser	stdio	TypeScript	Apache 2.0	●	Full browser automation via Playwright
12.□	FirecrawlMendable	2025/1	Scraping	HTTP / stdio	TypeScript	MIT	●	Web scraping + crawling pipeline
13.□	Context7Upstash	2025/1	Docs	HTTP / stdio	TypeScript	MIT	●	Fetches current library docs on demand
14.□	ApifyApify	2025/2	Scraping	HTTP	TypeScript	Apache 2.0	●	5000+ scrapers accessible via MCP
15.□	ObsidianCommunity	2025/3	Notes	stdio	TypeScript	MIT	●	Read/write Obsidian vault markdown
16.□	SupabaseSupabase	2025/3	Database	stdio	TypeScript	MIT	●	Postgres + Auth + Storage + Realtime
17.□	ExaExa	2025/4	Search	HTTP	TypeScript	MIT	●	Semantic web search with summaries
18.□	SentrySentry	2025/4	Monitoring	HTTP	TypeScript	MIT	●	Query errors + releases; AI triage hook
19.□	StripeStripe	2025/5	Payments	HTTP	TypeScript	MIT	●	Query and manipulate Stripe API
20.□	LinearLinear	2025/6	PM	HTTP	TypeScript	MIT	●	Issues, projects, cycles — official Linear MCP
21.□	Figma Dev ModeFigma	2025/6	Design	HTTP	TypeScript	MIT	●	Design tokens + code gen from Figma
22.□	NotionNotion	2025/9	Docs	HTTP	TypeScript	MIT	●	Read/edit Notion pages and databases

1.Filesystem●

Anthropic (official)StoragestdioTypeScriptMIT

Reference server — local file read/write/edit

2.GitHub●

Anthropic (official)Dev toolsstdio / HTTPTypeScriptMIT

Repos, issues, PRs, search

3.Brave Search●

Anthropic (official)SearchstdioTypeScriptMIT

Web search via Brave API

4.Puppeteer●

Anthropic (official)BrowserstdioTypeScriptMIT

Reference browser server

5.Slack●

Anthropic (official)CommsstdioTypeScriptMIT

Read/post messages, channel management

6.Postgres●

Anthropic (official)DatabasestdioTypeScriptMIT

Read-only SQL over Postgres

7.Fetch●

Anthropic (official)WebstdioPythonMIT

Convert web pages to LLM-friendly markdown

8.Memory●

Anthropic (official)MemorystdioTypeScriptMIT

Knowledge-graph persistent memory

9.Time●

Anthropic (official)UtilitystdioPythonMIT

Timezone + conversion utilities

10.Sequential Thinking●

Anthropic (official)ReasoningstdioTypeScriptMIT

Reflective step-by-step thinking tool

11.

Playwright●

MicrosoftBrowserstdioTypeScriptApache 2.0

Full browser automation via Playwright

12.

Firecrawl●

MendableScrapingHTTP / stdioTypeScriptMIT

Web scraping + crawling pipeline

13.Context7●

UpstashDocsHTTP / stdioTypeScriptMIT

Fetches current library docs on demand

14.Apify●

ApifyScrapingHTTPTypeScriptApache 2.0

5000+ scrapers accessible via MCP

15.Obsidian●

CommunityNotesstdioTypeScriptMIT

Read/write Obsidian vault markdown

16.Supabase●

SupabaseDatabasestdioTypeScriptMIT

Postgres + Auth + Storage + Realtime

17.Exa●

ExaSearchHTTPTypeScriptMIT

Semantic web search with summaries

18.Sentry●

SentryMonitoringHTTPTypeScriptMIT

Query errors + releases; AI triage hook

19.Stripe●

StripePaymentsHTTPTypeScriptMIT

Query and manipulate Stripe API

20.Linear●

LinearPMHTTPTypeScriptMIT

Issues, projects, cycles — official Linear MCP

21.Figma Dev Mode●

FigmaDesignHTTPTypeScriptMIT

Design tokens + code gen from Figma

22.Notion●

NotionDocsHTTPTypeScriptMIT

Read/edit Notion pages and databases

LLM Observability / Tracing

Data as of April 26, 2026

Model							Notes▼
1.□	HumanloopHumanloop	2020/9	Prompt mgmt / Eval / Deploy	Proprietary	Starter free / Growth $399/mo+	○	Prompt versioning + collaborative UI focus
2.□	GalileoGalileo	2022/10	Eval / Hallucination / Guardrails	Proprietary	Custom	○	Enterprise-tier; guardrail + hallucination scoring
3.□	HeliconeHelicone	2023/2	Tracing / Cost / Cache	Apache 2.0	Free 10k req / $20/mo+	●	Proxy-based; drop-in for OpenAI/Anthropic
4.□	Arize PhoenixArize	2023/3	Tracing / Eval / Observability	ELv2	Free self-host / enterprise	●	OpenInference tracing standard; good RAG eval
5.□	PromptFooPromptFoo	2023/4	Eval / Red-team	MIT	Free	●	CLI-first eval; CI-friendly; red-team suites built in
6.□	LangfuseLangfuse	2023/6	Tracing / Eval / Prompt mgmt	MIT (core) / Commercial	Free self-host / $29/mo+ cloud	●	Top open-source alternative; OpenTelemetry-first
7.□	LangSmithLangChain	2023/7	Tracing / Eval / Prompt mgmt	Proprietary	Free 5k traces / $39/mo+	○	LangChain-native but framework-agnostic; strong prompt playground
8.□	Traceloop OpenLLMetryTraceloop	2023/9	Tracing (standard)	Apache 2.0	Free (library)	●	OpenTelemetry standard for LLM traces; backend-agnostic
9.□	BraintrustBraintrust	2023/12	Eval / Experiments / Tracing	Proprietary	Free / $249/mo+	○	Eval-first; strong experiment tracking + regression detection
10.□	Literal AILiteral AI	2024/1	Tracing / Eval / Dataset	Proprietary	Free / $49/mo+	○	Strong threads-and-datasets model; ex-Chainlit team
11.□	Weights & Biases WeaveW&B	2024/1	Tracing / Eval	Apache 2.0 (SDK)	Free / Enterprise	●	W&B's LLM observability; integrates with MLOps runs
12.□	LogfirePydantic	2024/6	Tracing (OTel)	MIT (SDK)	Free 10M spans / paid	●	Pydantic team's OTel-native observability; great Pydantic AI integration
13.□	LangSmith FleetLangChain	2026/3	Agent Builder / Mgmt / Governance	Proprietary	Enterprise (LangSmith add-on)	○	Renamed from LangSmith Agent Builder; no-code agent builder + RBAC/ABAC, Skills, Sandboxes (private preview), 7.5k+ Arcade.dev tools

Humanloop

HumanloopPrompt mgmt / Eval / DeployProprietaryStarter free / Growth $399/mo+

Prompt versioning + collaborative UI focus

Galileo

GalileoEval / Hallucination / GuardrailsProprietaryCustom

Enterprise-tier; guardrail + hallucination scoring

Helicone●

HeliconeTracing / Cost / CacheApache 2.0Free 10k req / $20/mo+

Proxy-based; drop-in for OpenAI/Anthropic

Arize Phoenix●

ArizeTracing / Eval / ObservabilityELv2Free self-host / enterprise

OpenInference tracing standard; good RAG eval

PromptFoo●

PromptFooEval / Red-teamMITFree

CLI-first eval; CI-friendly; red-team suites built in

Langfuse●

LangfuseTracing / Eval / Prompt mgmtMIT (core) / CommercialFree self-host / $29/mo+ cloud

Top open-source alternative; OpenTelemetry-first

LangSmith

LangChainTracing / Eval / Prompt mgmtProprietaryFree 5k traces / $39/mo+

LangChain-native but framework-agnostic; strong prompt playground

8.Traceloop OpenLLMetry●

TraceloopTracing (standard)Apache 2.0Free (library)

OpenTelemetry standard for LLM traces; backend-agnostic

Braintrust

BraintrustEval / Experiments / TracingProprietaryFree / $249/mo+

Eval-first; strong experiment tracking + regression detection

10.Literal AI

Literal AITracing / Eval / DatasetProprietaryFree / $49/mo+

Strong threads-and-datasets model; ex-Chainlit team

11.Weights & Biases Weave●

W&BTracing / EvalApache 2.0 (SDK)Free / Enterprise

W&B's LLM observability; integrates with MLOps runs

12.

Logfire●

PydanticTracing (OTel)MIT (SDK)Free 10M spans / paid

Pydantic team's OTel-native observability; great Pydantic AI integration

13.

LangSmith Fleet

LangChainAgent Builder / Mgmt / GovernanceProprietaryEnterprise (LangSmith add-on)

Renamed from LangSmith Agent Builder; no-code agent builder + RBAC/ABAC, Skills, Sandboxes (private preview), 7.5k+ Arcade.dev tools

Prompt Guides / Prompt Engineering

Data as of May 1, 2026

Model							Notes▼
1.□	OpenAI PromptingOpenAI	2026/1	Prompt management	Prompt objects, versioning, variables, prompt caching, eval-driven iteration	Yes	OpenAI API builders	Covers reusable/versioned prompts and dashboard-to-API workflow
2.□	Gemini Prompt Design StrategiesGoogle	2026/1	Prompt design	Clear instructions, prompt components, few-shot examples, task decomposition, chaining	No	Gemini API builders	Includes topic-specific guides for media, image, and video prompting
3.□	Claude Prompt EngineeringAnthropic	2026/2	Prompt engineering	Clear instructions, XML structure, examples, roles, long context, tool use, agentic workflows	Yes	Claude API builders	Starts with success criteria and empirical evals before prompt tuning
4.□	Claude Prompting Best PracticesAnthropic	2026/2	Model-specific prompting	Clarity, examples, XML tags, context placement, tool-use instructions, chaining	Yes	Claude Opus/Sonnet/Haiku users	Living reference for current Claude models and prompt patterns

OpenAI Prompting

OpenAIPrompt managementPrompt objects, versioning, variables, prompt caching, eval-driven iterationYesOpenAI API builders

Covers reusable/versioned prompts and dashboard-to-API workflow

Gemini Prompt Design Strategies

GooglePrompt designClear instructions, prompt components, few-shot examples, task decomposition, chainingNoGemini API builders

Includes topic-specific guides for media, image, and video prompting

Claude Prompt Engineering

AnthropicPrompt engineeringClear instructions, XML structure, examples, roles, long context, tool use, agentic workflowsYesClaude API builders

Starts with success criteria and empirical evals before prompt tuning

Claude Prompting Best Practices

AnthropicModel-specific promptingClarity, examples, XML tags, context placement, tool-use instructions, chainingYesClaude Opus/Sonnet/Haiku users

Living reference for current Claude models and prompt patterns