The demo shows a very clearly written bug report about a matrix operation that’s producing an unexpected output. Umm… no. Most bug reports you get in the wild are more along the lines of “I clicked on on X and Y happened” then if you’re lucky they’ll say “and I expected Z”. Usually the Z expectation is left for the reader to fill in because as human users we understand the expectations.
The difficulty in fixing a bug is in figuring out what’s causing the bug. If you know it’s caused by an incorrect operation, and we know that LLMs can fix simple defects like this, what does this prove?
Has anyone dug through the paper yet to see what the rest of the issues look like? And what the diffs look like? I suppose I’ll try when I have a sec.
> Most bug reports you get in the wild are more along the lines of
Since this fixes 12% of the bugs, the authors of the paper probably agree with you that 100-12= 88%, and hence "most bugs" don't have nicely written bug reports.
I suppose I should nail down my point. No one would ever write a big report like this. A bug generally has an unknown cause. Once you found the cause of the bug, you’d fix it. Nowadays, you could just cut and paste the problem into ChatGPT and get the answer right then. So why would anyone ever log this bug? All this demo proves that they automated a process that didn’t need automation.
Maybe it would be better if the agent would help people submit better reports instead of trying to fix it. E.g. it could ask them to add missing information, test different combinations of inputs, etc. I could also learn which maintainer to ping according to the type of issue.
Maybe it just needs another, independent tool. One that detects poorly written bug reports and rejects them.
A cool thing about LLM is they have infinite patience. They can go back and forth with the user until they either sort out how to make a useable bug report, or give up.
I agree that bugs aren't as well specified as the example. But a specification for a new feature certainly can be.
I'm going to give it a try on my side project and see if it can at least provide a hint or some guidance on the development of small new features in an existing well structured project.
Agreed. I have never encountered a simple math bug in the wild.
To a non-programmer, putting in tests for myfunc(x) {return x + 2;} sounds useful but in reality computers do not tend to have any issues performing basic algebra.
On the contrary, it’s worse than useless. If it could fix 12% of bugs (it can’t — it only fixes 12% of their benchmark suite), you’d still have to figure out which 12% of the responses it gave were good. So, 88% of the time you’d have wasted time confirming a “fix” that doesn’t work. But it’s worse than that. Because even on the fixes it got right, you’d still have to fully vet it, because this tool doesn’t know when it can’t solve something, or ask for clarification. It just gives a wrong answer.
So you didn’t save 12% of your effort, you wasted probably more than double your effort checking the work of a tool that is wrong eight out of nine times.
That’s not what I said and you know it. I’m not saying LLMs are useless. I’m not even saying this tool is useless. I’m saying I’m not impressed with this tool, at least as represented in the demo.
I've experimented in this direction previously, but found agentic behavior is often chaotic and leads to long expensive sessions that go down a wrong rabbit hole and ultimately fail.
It's great that you succeed on 12% of swe-bench, but what happens the other 88% of the time? Is it useless wasted work and token costs? Or does it make useful progress that can be salvaged?
Also, I think swe-bench is from your group, right? Have you made any attempt to characterize a "skilled human upper bound" score?
I randomly sampled a dozen swe-bench tasks myself, and found that many were basically impossible for a skilled human to “solve”. Mainly because the tasks were under specified wrt to the hidden test cases that determine passing. The tests were checking implementation specific details from the repo’s PR that weren't actually stated requirements of the task.
Personally, I'd just use one of my local MacBook models (e.g. Mixtral 8x7b) and forget about any wasted branches & cents. My debugging time costs many orders of magnitude more than SWE-agent, so even a 5% backlog savings would be spectacular!
> My debugging time costs many orders of magnitude more than SWE-agent
Unless your job is primarily to clean up somebody else's mess, your debugging time is a key part of a career-long feedback loop that improves your craft. Be careful not to shrug it off as something less. Many many people are spending a lot of money to let you forget it, and once you do, you'll be right there in the ranks of the cheaply replaceble.
(And on the odd chance that cleaning up other people's mess is your job, you should probably be the one doing it; and for largely the same reasons)
I’ve tried this with another similar system. FOSS LLMs including Mixtral are currently too weak to handle something like this. For me they run out of steam after only a few turns and start going in circles unproductively
That's assuming that the other 95% stays the same with this new agent (vs creating more work for you to now also have to parse what the model is saying).
Given that they got 12% with GPT-4, which is vastly better than any open model, I doubt this would be particularly productive. And powering compute at full load is going to add up.
For anyone who didn't bother looking deeper, the SWEbench benchmark contains only Python code projects, so it is not representative of all the programing languages and frameworks.
I'm working on a more general SWE task eval framework in JS for arbitrary language and framework now (for starter JS/TS, SQL and Python), for my own prompt engineering product.
Their demo is so similar to the Devin one I had to go look up the Devin one to check I wasnt watching the same demo. I feel like there might be a reason they both picked Sympy. Also I rarely put weight into demos. They are usually cherry-picked at best and outright fabricated at worst. I want to hear what 3rd parties have to say after trying these things.
Not a chance. If AI-generated pull requests become popular, GitHub will automatically offer them in response to opened issues. Case in point: they already are popular for dependency upgrades.
It’ll likely keep getting better, if it gets to 30-40% I’d say that’s a decent trade off. Also could you boost your chances by having the AI do a 2nd pass and double check the work? I’d be curious what the success rate of an LLM “determining whether a bug fix is valid” is
Friendly suggestion to the authors: success rates aren't meaningful to all but a handful of researchers. They should add a few examples of tests SWE-agent passed and did not pass to the README.
The difficulty in fixing a bug is in figuring out what’s causing the bug. If you know it’s caused by an incorrect operation, and we know that LLMs can fix simple defects like this, what does this prove?
Has anyone dug through the paper yet to see what the rest of the issues look like? And what the diffs look like? I suppose I’ll try when I have a sec.
Since this fixes 12% of the bugs, the authors of the paper probably agree with you that 100-12= 88%, and hence "most bugs" don't have nicely written bug reports.
A cool thing about LLM is they have infinite patience. They can go back and forth with the user until they either sort out how to make a useable bug report, or give up.
I'm going to give it a try on my side project and see if it can at least provide a hint or some guidance on the development of small new features in an existing well structured project.
To a non-programmer, putting in tests for myfunc(x) {return x + 2;} sounds useful but in reality computers do not tend to have any issues performing basic algebra.
So you didn’t save 12% of your effort, you wasted probably more than double your effort checking the work of a tool that is wrong eight out of nine times.
I've experimented in this direction previously, but found agentic behavior is often chaotic and leads to long expensive sessions that go down a wrong rabbit hole and ultimately fail.
It's great that you succeed on 12% of swe-bench, but what happens the other 88% of the time? Is it useless wasted work and token costs? Or does it make useful progress that can be salvaged?
Also, I think swe-bench is from your group, right? Have you made any attempt to characterize a "skilled human upper bound" score?
I randomly sampled a dozen swe-bench tasks myself, and found that many were basically impossible for a skilled human to “solve”. Mainly because the tasks were under specified wrt to the hidden test cases that determine passing. The tests were checking implementation specific details from the repo’s PR that weren't actually stated requirements of the task.
Unless your job is primarily to clean up somebody else's mess, your debugging time is a key part of a career-long feedback loop that improves your craft. Be careful not to shrug it off as something less. Many many people are spending a lot of money to let you forget it, and once you do, you'll be right there in the ranks of the cheaply replaceble.
(And on the odd chance that cleaning up other people's mess is your job, you should probably be the one doing it; and for largely the same reasons)
https://github.com/princeton-nlp/SWE-agent/blob/main/config/...
Once I’m back on desktop I want to look at the gut history of this file.
Think of it as a meta-prompt optimizer, it uses a LLM to optimize your prompts, to optimize your LLM.
I'm working on a more general SWE task eval framework in JS for arbitrary language and framework now (for starter JS/TS, SQL and Python), for my own prompt engineering product.
Hit me up if you are interested.
(not because bugs will be gone - because the cost of reviewing the PR vs the benefit gained to the project will be a substantial net loss)
At least from a maintainability perspective.
I would like to see if this implementation is less destructive or at least more suitable for a red-green-refactor workflow.
Deleted Comment
Hello world is 10GB, but even grandma can make hello worlds now.