def make_pass@1_agent(agent, n):
def retry_agent(problem):
for attempt in range(n):
result = agent(problem)
if result.success:
return result
return result
return retry_agentIf they are running their production product as is, then of course whatever is built into the product is fine.
def make_pass@1_agent(agent, n):
def retry_agent(problem):
for attempt in range(n):
result = agent(problem)
if result.success:
return result
return result
return retry_agentBuilding multiple attempts into your agent is stretching the rules, even if technically it’s acceptable
71.2% puts it at 5th, which is 4 points below the leader (four points is a lot) and just over 1% lower than Anthropic’s own submission for Claude Sonnet 4 - the same model these guys are running.
But the top rated submissions aren’t running production products. They generally have extensive scaffolding or harnesses that were built *specifically for SWE bench*, which kind of defeats the whole purpose of the benchmark.
Take for example Refact which is at #2 with 74.4%, they built a 2k lines of code framework around their agent specifically for SWE bench (https://github.com/smallcloudai/refact-bench/). It’s pretty elaborate, orchestrating multiple agents, with a debug agent that kicks in if the main agent fails. The debug agent analyzes the failure and gives insights to the main agent which tries again, so it’s effectively multiple attempts per problem.
If the results can be reproduced “out-of-the-box” with their coding agent like they claim, it puts it up there as one of the top 2-3 CLI agents available right now.
A good wrapper has deep domain knowledge baked into it, combined with automation and expert use of the LLM.
It maybe isn’t super innovative but it’s a bit of an art form and unlocks the utility of the underlying LLM
0: https://github.com/Use-Tusk/fence