Readit News logoReadit News
dimal · 2 years ago
The demo shows a very clearly written bug report about a matrix operation that’s producing an unexpected output. Umm… no. Most bug reports you get in the wild are more along the lines of “I clicked on on X and Y happened” then if you’re lucky they’ll say “and I expected Z”. Usually the Z expectation is left for the reader to fill in because as human users we understand the expectations.

The difficulty in fixing a bug is in figuring out what’s causing the bug. If you know it’s caused by an incorrect operation, and we know that LLMs can fix simple defects like this, what does this prove?

Has anyone dug through the paper yet to see what the rest of the issues look like? And what the diffs look like? I suppose I’ll try when I have a sec.

drcode · 2 years ago
> Most bug reports you get in the wild are more along the lines of

Since this fixes 12% of the bugs, the authors of the paper probably agree with you that 100-12= 88%, and hence "most bugs" don't have nicely written bug reports.

medellin · 2 years ago
In my 15 years i would say less than 1% of bug reports are like this. If you know the bug to this level most people just would fix it themselves
aiauthoritydev2 · 2 years ago
12% is a very very large number for that kind of problem. I doubt even 0.1% of bug reports in the wild are that well written.
skywhopper · 2 years ago
It fixes 12% of their benchmark suite, not 12% of bug reports.
dimal · 2 years ago
I suppose I should nail down my point. No one would ever write a big report like this. A bug generally has an unknown cause. Once you found the cause of the bug, you’d fix it. Nowadays, you could just cut and paste the problem into ChatGPT and get the answer right then. So why would anyone ever log this bug? All this demo proves that they automated a process that didn’t need automation.
stingraycharles · 2 years ago
It appears that they’re using the PRs from the top5000 most popular PyPi packages for their bench: https://github.com/princeton-nlp/SWE-bench/tree/main/swebenc...
jcarrano · 2 years ago
Maybe it would be better if the agent would help people submit better reports instead of trying to fix it. E.g. it could ask them to add missing information, test different combinations of inputs, etc. I could also learn which maintainer to ping according to the type of issue.
bee_rider · 2 years ago
Maybe it just needs another, independent tool. One that detects poorly written bug reports and rejects them.

A cool thing about LLM is they have infinite patience. They can go back and forth with the user until they either sort out how to make a useable bug report, or give up.

bfdm · 2 years ago
While it might tickle metrics the right way, frustrating a user into giving up because your bot was not satisfied is not solving their problem.
codeonline · 2 years ago
I agree that bugs aren't as well specified as the example. But a specification for a new feature certainly can be.

I'm going to give it a try on my side project and see if it can at least provide a hint or some guidance on the development of small new features in an existing well structured project.

chinchilla2020 · 2 years ago
Agreed. I have never encountered a simple math bug in the wild.

To a non-programmer, putting in tests for myfunc(x) {return x + 2;} sounds useful but in reality computers do not tend to have any issues performing basic algebra.

megablast · 2 years ago
Exactly. This is not perfect and doesn't fix every report so it is useless.
skywhopper · 2 years ago
On the contrary, it’s worse than useless. If it could fix 12% of bugs (it can’t — it only fixes 12% of their benchmark suite), you’d still have to figure out which 12% of the responses it gave were good. So, 88% of the time you’d have wasted time confirming a “fix” that doesn’t work. But it’s worse than that. Because even on the fixes it got right, you’d still have to fully vet it, because this tool doesn’t know when it can’t solve something, or ask for clarification. It just gives a wrong answer.

So you didn’t save 12% of your effort, you wasted probably more than double your effort checking the work of a tool that is wrong eight out of nine times.

dimal · 2 years ago
That’s not what I said and you know it. I’m not saying LLMs are useless. I’m not even saying this tool is useless. I’m saying I’m not impressed with this tool, at least as represented in the demo.
gorjusborg · 2 years ago
If the bug report needs to be of a certain quality to work, they've just invented issue-oriented programming.
forty · 2 years ago
The trick is that people would use LLM to write very long and detailed bug reports :p
anotherpaulg · 2 years ago
Very cool project!

I've experimented in this direction previously, but found agentic behavior is often chaotic and leads to long expensive sessions that go down a wrong rabbit hole and ultimately fail.

It's great that you succeed on 12% of swe-bench, but what happens the other 88% of the time? Is it useless wasted work and token costs? Or does it make useful progress that can be salvaged?

Also, I think swe-bench is from your group, right? Have you made any attempt to characterize a "skilled human upper bound" score?

I randomly sampled a dozen swe-bench tasks myself, and found that many were basically impossible for a skilled human to “solve”. Mainly because the tasks were under specified wrt to the hidden test cases that determine passing. The tests were checking implementation specific details from the repo’s PR that weren't actually stated requirements of the task.

a_wild_dandan · 2 years ago
Personally, I'd just use one of my local MacBook models (e.g. Mixtral 8x7b) and forget about any wasted branches & cents. My debugging time costs many orders of magnitude more than SWE-agent, so even a 5% backlog savings would be spectacular!
swatcoder · 2 years ago
> My debugging time costs many orders of magnitude more than SWE-agent

Unless your job is primarily to clean up somebody else's mess, your debugging time is a key part of a career-long feedback loop that improves your craft. Be careful not to shrug it off as something less. Many many people are spending a lot of money to let you forget it, and once you do, you'll be right there in the ranks of the cheaply replaceble.

(And on the odd chance that cleaning up other people's mess is your job, you should probably be the one doing it; and for largely the same reasons)

ein0p · 2 years ago
I’ve tried this with another similar system. FOSS LLMs including Mixtral are currently too weak to handle something like this. For me they run out of steam after only a few turns and start going in circles unproductively
Aperocky · 2 years ago
That's assuming that the other 95% stays the same with this new agent (vs creating more work for you to now also have to parse what the model is saying).
int_19h · 2 years ago
Given that they got 12% with GPT-4, which is vastly better than any open model, I doubt this would be particularly productive. And powering compute at full load is going to add up.
senko · 2 years ago
If you don't mind me asking, which agentic tools/frameworks have you tried for code fixing/generation, with which LLMs?
matthewaveryusa · 2 years ago
Very neat. Uses the langchain method, here are some of the prompts:

https://github.com/princeton-nlp/SWE-agent/blob/main/config/...

toddmorey · 2 years ago
I’m always fascinated to read the system prompts & I always wonder what sort of gains can be made optimizing them further.

Once I’m back on desktop I want to look at the gut history of this file.

clement_b · 2 years ago
I have a git feeling this comment was written on mobile.
hazn · 2 years ago
DSPy is the best tool for optimizing prompts [0]: https://github.com/stanfordnlp/dspy

Think of it as a meta-prompt optimizer, it uses a LLM to optimize your prompts, to optimize your LLM.

paradite · 2 years ago
For anyone who didn't bother looking deeper, the SWEbench benchmark contains only Python code projects, so it is not representative of all the programing languages and frameworks.

I'm working on a more general SWE task eval framework in JS for arbitrary language and framework now (for starter JS/TS, SQL and Python), for my own prompt engineering product.

Hit me up if you are interested.

barfbagginus · 2 years ago
Assuming the data set is proprietary, else please share the repo
lispisok · 2 years ago
Their demo is so similar to the Devin one I had to go look up the Devin one to check I wasnt watching the same demo. I feel like there might be a reason they both picked Sympy. Also I rarely put weight into demos. They are usually cherry-picked at best and outright fabricated at worst. I want to hear what 3rd parties have to say after trying these things.
lewhoo · 2 years ago
Maybe that's the point of this research. Hey look, we reproduced the way to game the stats a bit. I really can't tell anymore.
JonChesterfield · 2 years ago
If AI generated pull requests become a popular thing we'll see the end of public bug trackers.

(not because bugs will be gone - because the cost of reviewing the PR vs the benefit gained to the project will be a substantial net loss)

CGamesPlay · 2 years ago
Not a chance. If AI-generated pull requests become popular, GitHub will automatically offer them in response to opened issues. Case in point: they already are popular for dependency upgrades.
JonChesterfield · 2 years ago
And thus issues will no longer be opened
itsgrimetime · 2 years ago
It’ll likely keep getting better, if it gets to 30-40% I’d say that’s a decent trade off. Also could you boost your chances by having the AI do a 2nd pass and double check the work? I’d be curious what the success rate of an LLM “determining whether a bug fix is valid” is
bwestergard · 2 years ago
Friendly suggestion to the authors: success rates aren't meaningful to all but a handful of researchers. They should add a few examples of tests SWE-agent passed and did not pass to the README.
nyrikki · 2 years ago
Yes please, the code quality on Devin was incredibly poor in all examples I traced down.

At least from a maintainability perspective.

I would like to see if this implementation is less destructive or at least more suitable for a red-green-refactor workflow.

NegativeLatency · 2 years ago
Unless you weren't actually that successful but need to publish a "successful" result

Deleted Comment

rwmj · 2 years ago
Do we know how much extra work it created for the real people who had to review the proposed fixes?
r0ze-at-hn · 2 years ago
Ah, well let me tell you about my pull request reviewer LLM project.
ActionHank · 2 years ago
Jokes on you, let me tell you about my prompt to binary LLM project.

Hello world is 10GB, but even grandma can make hello worlds now.