Readit News logoReadit News
gronky_ commented on Show HN: NanoClaw – “Clawdbot” in 500 lines of TS with Apple container isolation   github.com/gavrielc/nanoc... · Posted by u/jimminyx
dceddia · 12 days ago
I went down this rabbit hole a bit recently trying to use claude inside fence[0] and it seems that on macOS, claude stores this token inside Keychain. I'm not sure there's a way to expose that to a container... my guess would be no, especially since it seems the container is Linux, and also because keeping the Keychain out of reach of containers seems like it would be paramount. But someone might know better!

0: https://github.com/Use-Tusk/fence

gronky_ · 12 days ago
True. There’s a setting for Claude code though where you can add apiKeyHelper which is a script you add that gets the token for Claude Code. I imagine you can use that but haven’t quite figured out how to wire it up
gronky_ commented on Qodo CLI agent scores 71.2% on SWE-bench Verified   qodo.ai/blog/qodo-command... · Posted by u/bobismyuncle
gronky_ · 6 months ago
This ok from your perspective then?

def make_pass@1_agent(agent, n):

    def retry_agent(problem):

        for attempt in range(n):

            result = agent(problem)

            if result.success:

                return result

        return result

    return retry_agent

gronky_ · 6 months ago
Keep in mind that this isn’t about users - the top agents on the leaderboard aren’t running an actual product on the benchmark.

If they are running their production product as is, then of course whatever is built into the product is fine.

gronky_ commented on Qodo CLI agent scores 71.2% on SWE-bench Verified   qodo.ai/blog/qodo-command... · Posted by u/bobismyuncle
terminalshort · 6 months ago
From my perspective as a potential user the number of attempts is the number of times I have to tell it what to do. If you have an agent that makes a single attempt and is 60% accurate vs another that makes 5 attempts and is 80% accurate, why would you care that each individual attempt of the 2nd model is less accurate than the first?
gronky_ · 6 months ago
This ok from your perspective then?

def make_pass@1_agent(agent, n):

    def retry_agent(problem):

        for attempt in range(n):

            result = agent(problem)

            if result.success:

                return result

        return result

    return retry_agent

gronky_ commented on Qodo CLI agent scores 71.2% on SWE-bench Verified   qodo.ai/blog/qodo-command... · Posted by u/bobismyuncle
eddd-ddde · 6 months ago
I think multiple attempts are completely understandable and even expected? How is that defeating the purpose of the benchmark?
gronky_ · 6 months ago
It’s a pass@1 benchmark. When submitting you need to check a box that there was only 1 attempt per problem. See here for example: https://github.com/SWE-bench/experiments/pull/219

Building multiple attempts into your agent is stretching the rules, even if technically it’s acceptable

gronky_ commented on Qodo CLI agent scores 71.2% on SWE-bench Verified   qodo.ai/blog/qodo-command... · Posted by u/bobismyuncle
gronky_ · 6 months ago
I’ve been running a bunch of coding agents on benchmarks recently as part of consulting, and this is actually much more impressive than it seems at first glance.

71.2% puts it at 5th, which is 4 points below the leader (four points is a lot) and just over 1% lower than Anthropic’s own submission for Claude Sonnet 4 - the same model these guys are running.

But the top rated submissions aren’t running production products. They generally have extensive scaffolding or harnesses that were built *specifically for SWE bench*, which kind of defeats the whole purpose of the benchmark.

Take for example Refact which is at #2 with 74.4%, they built a 2k lines of code framework around their agent specifically for SWE bench (https://github.com/smallcloudai/refact-bench/). It’s pretty elaborate, orchestrating multiple agents, with a debug agent that kicks in if the main agent fails. The debug agent analyzes the failure and gives insights to the main agent which tries again, so it’s effectively multiple attempts per problem.

If the results can be reproduced “out-of-the-box” with their coding agent like they claim, it puts it up there as one of the top 2-3 CLI agents available right now.

gronky_ commented on Biomni: A General-Purpose Biomedical AI Agent   github.com/snap-stanford/... · Posted by u/GavCo
SalmoShalazar · 7 months ago
Not to take away from this or its usefulness (not my intent), but it is wild to me how many pieces of software of this type are being developed. We’re seeing endless waves of specialized wrappers around LLM API calls. There’s very little innovation happening beyond specializing around particular niches and invoking LLMs in slightly different ways with carefully directed context and prompts.
gronky_ · 7 months ago
I see it a bit differently - LLMs are an incredible innovation but it’s hard to do anything useful with them without the right wrapper.

A good wrapper has deep domain knowledge baked into it, combined with automation and expert use of the LLM.

It maybe isn’t super innovative but it’s a bit of an art form and unlocks the utility of the underlying LLM

gronky_ commented on Introducing Qodo Gen CLI: Build and Run Coding Agents Anywhere in the SDLC   qodo.ai/blog/introducing-... · Posted by u/benocodes
mdaniel · 8 months ago
I'm not in marketing, but what message is the sexy anteater supposed to convey about integrating this into my workflow?
gronky_ · 8 months ago
It will catch those sneaky bugs

u/gronky_

KarmaCake day885December 6, 2022View Original