tl;dr: LLMs suck at writing code to use APIs.
We ran 630 integration tests across 21 common APIs (Stripe, Slack, GitHub, etc.) using 6 different LLMs. Here are our key findings: - Best general LLM: 68% success rate. That's 1 in 3 API calls failing. Would you ship that? - Our integration layer scored a 91% success rate, showing us that just throwing bigger/better LLMs at the problem won't solve it. - Only 6 out of 21 APIs worked 100% of the time, every other API had failures. - Anthropic’s models are significantly better at building API integrations than other providers.
What makes LLMs fail hard: - Lack of context (LLMs are just not great at understanding what API endpoints exist and what they do, even if you give them documentation which we did) - Multi-step workflows (chaining API calls) - Complex API design: APIs like Square, PostHog, Asana (Forcing project selection among other things trips llms over)
We've open-sourced the benchmark so you can test any API and see where it ranks: https://github.com/superglue-ai/superglue/tree/main/packages...
Check out the repo, consider giving it a star, or see the full ranking at https://superglue.ai/api-ranking/
If you're building agents that need reliable API access, we'd love to hear your approach - or you can try our integration layer at superglue.ai.
Next up: benchmarking MCP.