Readit News logoReadit News
gronky_ commented on Qodo CLI agent scores 71.2% on SWE-bench Verified   qodo.ai/blog/qodo-command... · Posted by u/bobismyuncle
gronky_ · 13 days ago
This ok from your perspective then?

def make_pass@1_agent(agent, n):

    def retry_agent(problem):

        for attempt in range(n):

            result = agent(problem)

            if result.success:

                return result

        return result

    return retry_agent

gronky_ · 13 days ago
Keep in mind that this isn’t about users - the top agents on the leaderboard aren’t running an actual product on the benchmark.

If they are running their production product as is, then of course whatever is built into the product is fine.

gronky_ commented on Qodo CLI agent scores 71.2% on SWE-bench Verified   qodo.ai/blog/qodo-command... · Posted by u/bobismyuncle
terminalshort · 13 days ago
From my perspective as a potential user the number of attempts is the number of times I have to tell it what to do. If you have an agent that makes a single attempt and is 60% accurate vs another that makes 5 attempts and is 80% accurate, why would you care that each individual attempt of the 2nd model is less accurate than the first?
gronky_ · 13 days ago
This ok from your perspective then?

def make_pass@1_agent(agent, n):

    def retry_agent(problem):

        for attempt in range(n):

            result = agent(problem)

            if result.success:

                return result

        return result

    return retry_agent

gronky_ commented on Qodo CLI agent scores 71.2% on SWE-bench Verified   qodo.ai/blog/qodo-command... · Posted by u/bobismyuncle
eddd-ddde · 13 days ago
I think multiple attempts are completely understandable and even expected? How is that defeating the purpose of the benchmark?
gronky_ · 13 days ago
It’s a pass@1 benchmark. When submitting you need to check a box that there was only 1 attempt per problem. See here for example: https://github.com/SWE-bench/experiments/pull/219

Building multiple attempts into your agent is stretching the rules, even if technically it’s acceptable

gronky_ commented on Qodo CLI agent scores 71.2% on SWE-bench Verified   qodo.ai/blog/qodo-command... · Posted by u/bobismyuncle
gronky_ · 13 days ago
I’ve been running a bunch of coding agents on benchmarks recently as part of consulting, and this is actually much more impressive than it seems at first glance.

71.2% puts it at 5th, which is 4 points below the leader (four points is a lot) and just over 1% lower than Anthropic’s own submission for Claude Sonnet 4 - the same model these guys are running.

But the top rated submissions aren’t running production products. They generally have extensive scaffolding or harnesses that were built *specifically for SWE bench*, which kind of defeats the whole purpose of the benchmark.

Take for example Refact which is at #2 with 74.4%, they built a 2k lines of code framework around their agent specifically for SWE bench (https://github.com/smallcloudai/refact-bench/). It’s pretty elaborate, orchestrating multiple agents, with a debug agent that kicks in if the main agent fails. The debug agent analyzes the failure and gives insights to the main agent which tries again, so it’s effectively multiple attempts per problem.

If the results can be reproduced “out-of-the-box” with their coding agent like they claim, it puts it up there as one of the top 2-3 CLI agents available right now.

gronky_ commented on Biomni: A General-Purpose Biomedical AI Agent   github.com/snap-stanford/... · Posted by u/GavCo
SalmoShalazar · 2 months ago
Not to take away from this or its usefulness (not my intent), but it is wild to me how many pieces of software of this type are being developed. We’re seeing endless waves of specialized wrappers around LLM API calls. There’s very little innovation happening beyond specializing around particular niches and invoking LLMs in slightly different ways with carefully directed context and prompts.
gronky_ · 2 months ago
I see it a bit differently - LLMs are an incredible innovation but it’s hard to do anything useful with them without the right wrapper.

A good wrapper has deep domain knowledge baked into it, combined with automation and expert use of the LLM.

It maybe isn’t super innovative but it’s a bit of an art form and unlocks the utility of the underlying LLM

gronky_ commented on Introducing Qodo Gen CLI: Build and Run Coding Agents Anywhere in the SDLC   qodo.ai/blog/introducing-... · Posted by u/benocodes
mdaniel · 2 months ago
I'm not in marketing, but what message is the sexy anteater supposed to convey about integrating this into my workflow?
gronky_ · 2 months ago
It will catch those sneaky bugs
gronky_ commented on Leaked data reveals Israeli govt campaign to remove pro-Palestine posts on Meta   dropsitenews.com/p/leaked... · Posted by u/jbegley
gronky_ · 4 months ago
Stating that Israel doesn’t have a right to exist has been recognized to be an antisemitic statement by many prominent institutions.

It’s a radical statement that effectively denies the rights of millions of people to exist and is especially problematic given the historical context of the establishment of Israel.

The statement gets thrown around so much in certain circles that it’s gotten normalized. You’ve apparently lost sight of or never stopped to think what actually means, to the point where you’re providing it as an example of an innocent statement that got you banned for no reason. Taking this statement out of radical activist circles and into the real world won’t go well.

Take some time to educate yourself and reflect on what it actually means.

u/gronky_

KarmaCake day882December 6, 2022View Original