ofirpress (u/ofirpress)

ofirpress commented on Advancing AI Benchmarking with Game Arena blog.google/innovation-an... · Posted by u/salkahfi

ofirpress · 14 days ago

This is a good way to benchmark models. We [the SWE-bench team] took the meta-version of this and implemented it as a new benchmark called CodeClash -

We have agents implement agents that play games against each other- so Claude isn't playing against GPT, but an agent written by Claude plays poker against an agent written by GPT, and this really tough task leads to very interesting findings on AI for coding.

https://codeclash.ai/

ofirpress commented on Claude Code daily benchmarks for degradation tracking marginlab.ai/trackers/cla... · Posted by u/qwesr123

mohsen1 · 18 days ago

Hope you don't mind the unrelated question:

How do you pay for those SWE-bench runs?

I am trying to run a benchmark but it is too expensive to run enough runs to get a fair comparison.

https://mafia-arena.com

ofirpress · 18 days ago

Benchmarks can get costly to run- you can reach out to frontier model creators to try and get them to give you free credits, but usually they'll only agree to that once your benchmark is pretty popular.

ofirpress commented on Claude Code daily benchmarks for degradation tracking marginlab.ai/trackers/cla... · Posted by u/qwesr123

ofirpress · 18 days ago

[SWE-bench co-author here] It seems like they run this test on a subset of 50 tasks, and that they only run the test once per day. So a lot of the movement in accuracy could be attributed to that. I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score. Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.

ofirpress commented on How to code Claude Code in 200 lines of code mihaileric.com/The-Empero... · Posted by u/nutellalover

ofirpress · a month ago

We (the SWE-bench team) have a 100 line of code agent that is now pretty popular in both academic and industry labs: https://github.com/SWE-agent/mini-swe-agent

I think it's a great way to dive into the agent world

ofirpress commented on IQuest-Coder: A new open-source code model beats Claude Sonnet 4.5 and GPT 5.1 [pdf] github.com/IQuestLab/IQue... · Posted by u/shenli3514

sabareesh · a month ago

TL;DR is that they didn't clean the repo (.git/ folder), model just reward hacked its way to look up future commits with fixes. Credit goes to everyone in this thread for solving this: https://xcancel.com/xeophon/status/2006969664346501589

(given that IQuestLab published their SWE-Bench Verified trajectory data, I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking)

https://www.reddit.com/r/LocalLLaMA/comments/1q1ura1/iquestl...

ofirpress · a month ago

As John says in that thread, we've fixed this issue in SWE-bench: https://xcancel.com/jyangballin/status/2006987724637757670

If you run SWE-bench evals, just make sure to use the most up-to-date code from our repo and the updated docker images

ofirpress commented on Reflections on AI at the End of 2025 antirez.com/news/157... · Posted by u/danielfalbo

ofirpress · 2 months ago

> There are certain tasks, like improving a given program for speed, for instance, where in theory the model can continue to make progress with a very clear reward signal for a very long time.

Yup, this will absolutely be a big driver of gains in AI for coding in the near future. We actually built a benchmark based on this exact principle: https://algotune.io/

Dead Comment

ofirpress commented on Top model scores may be skewed by Git history leaks in SWE-bench github.com/SWE-bench/SWE-... · Posted by u/mustaphah

ofirpress · 5 months ago

[I'm on the SWE-bench team] Multiple people have looked into this, for example right in that thread: https://github.com/SWE-bench/SWE-bench/issues/465#issuecomme...

This issue had affected a tiny fraction of existing agents in a tiny fraction of their runs. And we've now issued a fix.

This is a natural part of running a benchmark, I'm sure tiny things like this will keep on getting discovered and we'll keep on fixing them. This doesn't change the overall picture or trends at all.

ofirpress commented on Ask HN: How to Learn to Build Agentic AI Systems (Like Claude Code) · Posted by u/hhimanshu

ofirpress · 6 months ago

We (the Princeton SWE-bench team) have a 100 line of code agent that does pretty well, you can read the code here: https://github.com/SWE-agent/mini-swe-agent

ofirpress commented on How to build a coding agent ghuntley.com/agent/... · Posted by u/ghuntley

ofirpress · 6 months ago

We (the Princeton SWE-bench team) built an agent in ~100 lines of code that does pretty well on SWE-bench, you might enjoy it too: https://github.com/SWE-agent/mini-swe-agent

u/ofirpress

KarmaCake day520June 25, 2016

About

http://ofir.io

View Original