How do you pay for those SWE-bench runs?
I am trying to run a benchmark but it is too expensive to run enough runs to get a fair comparison.
How do you pay for those SWE-bench runs?
I am trying to run a benchmark but it is too expensive to run enough runs to get a fair comparison.
I think it's a great way to dive into the agent world
(given that IQuestLab published their SWE-Bench Verified trajectory data, I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking)
https://www.reddit.com/r/LocalLLaMA/comments/1q1ura1/iquestl...
If you run SWE-bench evals, just make sure to use the most up-to-date code from our repo and the updated docker images
Yup, this will absolutely be a big driver of gains in AI for coding in the near future. We actually built a benchmark based on this exact principle: https://algotune.io/
Dead Comment
This issue had affected a tiny fraction of existing agents in a tiny fraction of their runs. And we've now issued a fix.
This is a natural part of running a benchmark, I'm sure tiny things like this will keep on getting discovered and we'll keep on fixing them. This doesn't change the overall picture or trends at all.
We have agents implement agents that play games against each other- so Claude isn't playing against GPT, but an agent written by Claude plays poker against an agent written by GPT, and this really tough task leads to very interesting findings on AI for coding.
https://codeclash.ai/