Readit News logoReadit News
adinagoerres commented on Show HN: LLMs suck at writing integration code… for now   github.com/superglue-ai/s... · Posted by u/sfaist
hoerzu · a month ago
Love the benchmarks. Is better to use single LLM for performance or would always advise to add a self reflection step
adinagoerres · a month ago
self-reflection is very important for both humans and LLMs, indeed
adinagoerres commented on Show HN: LLMs suck at writing integration code… for now   github.com/superglue-ai/s... · Posted by u/sfaist
adinagoerres · a month ago
Hey HN, I'm Adina, Stefan's co-founder at superglue. When we started working on LLM-powered integrations about a year ago, the models were barely good enough to handle simple mappings. We started benchmarking our performance as an internal evals project and thought it would be fun to open source it, to create more transparency around LLM performance. Our goal here is to understand how we can make agents production-ready and improve reliability across the board.
adinagoerres commented on Show HN: We let agents use APIs to find out if they can actually...do things?   superglue.ai/api-ranking/... · Posted by u/adinagoerres
adinagoerres · a month ago
Hi HN! Adina here from superglue. Today I’d like to share a new benchmark we’ve just open sourced: an Agent-API Benchmark, in which we test how well LLMs handle APIs.

tl;dr: LLMs suck at writing code to use APIs.

We ran 630 integration tests across 21 common APIs (Stripe, Slack, GitHub, etc.) using 6 different LLMs. Here are our key findings: - Best general LLM: 68% success rate. That's 1 in 3 API calls failing. Would you ship that? - Our integration layer scored a 91% success rate, showing us that just throwing bigger/better LLMs at the problem won't solve it. - Only 6 out of 21 APIs worked 100% of the time, every other API had failures. - Anthropic’s models are significantly better at building API integrations than other providers.

What makes LLMs fail hard: - Lack of context (LLMs are just not great at understanding what API endpoints exist and what they do, even if you give them documentation which we did) - Multi-step workflows (chaining API calls) - Complex API design: APIs like Square, PostHog, Asana (Forcing project selection among other things trips llms over)

We've open-sourced the benchmark so you can test any API and see where it ranks: https://github.com/superglue-ai/superglue/tree/main/packages...

Check out the repo, consider giving it a star, or see the full ranking at https://superglue.ai/api-ranking/

If you're building agents that need reliable API access, we'd love to hear your approach - or you can try our integration layer at superglue.ai.

Next up: benchmarking MCP.

adinagoerres commented on Show HN: Superglue – open source API connector that writes its own code   github.com/superglue-ai/s... · Posted by u/adinagoerres
jmvldz · 6 months ago
Very cool! Love the demo.
adinagoerres · 6 months ago
thank you!
adinagoerres commented on Show HN: Superglue – open source API connector that writes its own code   github.com/superglue-ai/s... · Posted by u/adinagoerres
jmvldz · 6 months ago
have you looked at Browser Use? https://browser-use.com/
adinagoerres · 6 months ago
they are in our YC batch! great product
adinagoerres commented on Show HN: Superglue – open source API connector that writes its own code   github.com/superglue-ai/s... · Posted by u/adinagoerres
a-dub · 6 months ago
this is VERY cool!
adinagoerres · 6 months ago
thank you!
adinagoerres commented on Show HN: Superglue – open source API connector that writes its own code   github.com/superglue-ai/s... · Posted by u/adinagoerres
npollock · 6 months ago
something like this that runs as a browser agent, allowing me to extract structured data from websites (whitelisted) using natural language queries
adinagoerres · 6 months ago
huh interesting. we're exploring extraction from html
adinagoerres commented on Show HN: Superglue – open source API connector that writes its own code   github.com/superglue-ai/s... · Posted by u/adinagoerres
dboreham · 6 months ago
Doesn't someone own a trademark in that general area?
adinagoerres · 6 months ago
not that we're aware of!

u/adinagoerres

KarmaCake day50March 19, 2023View Original