71.2% puts it at 5th, which is 4 points below the leader (four points is a lot) and just over 1% lower than Anthropic’s own submission for Claude Sonnet 4 - the same model these guys are running.
But the top rated submissions aren’t running production products. They generally have extensive scaffolding or harnesses that were built *specifically for SWE bench*, which kind of defeats the whole purpose of the benchmark.
Take for example Refact which is at #2 with 74.4%, they built a 2k lines of code framework around their agent specifically for SWE bench (https://github.com/smallcloudai/refact-bench/). It’s pretty elaborate, orchestrating multiple agents, with a debug agent that kicks in if the main agent fails. The debug agent analyzes the failure and gives insights to the main agent which tries again, so it’s effectively multiple attempts per problem.
If the results can be reproduced “out-of-the-box” with their coding agent like they claim, it puts it up there as one of the top 2-3 CLI agents available right now.
That's what I want and look forward one day
I hate using voice for anything. I hate getting voice messages, I hate creating them. I get cold sweats just thinking about having to direct 10 AI Agents via voice. Just give me a keyboard and a bunch of screens, thanks.
Yet I can see that I was , in fact, born into privilege.
Not a privilege of money, but a privilege of priority, skills, and acceptance of risk.
My parents prioritized one single thing above all others. Land. They bought land. Remote land, useless land, land wherever it was cheap.
They could have fixed the car, but instead bought an acre of land. We would go 100 miles from the nearest town to eke out a parcel of land in some Godforsaken place I haven’t been to since.
Because of that, and the skills I learned because I had to do everything myself, I have never had to pay rent. Because I knew how to live without luxury, I built a cabin when I was 16 on my parent’s land with salvaged lumber and fixtures and wire and things I got from demolishing houses. I raised three children in various iterations of that eventually 600 square foot house.
By that time I was successful in infotech, so we bought and rebuilt (ourselves) a 63 foot steel schooner and finished raising our children at many ports in the world, so that they would grow up with the same privilege of mind, but with broader horizons.
But I never forgot land. Land, not a house, land . Land is the key. Just a couple hundred square meters is fine.
You can still do exactly what I did today. You can buy land cheaply in many places in the world, including the USA. I just bought a half acre in Montana for $1200, with road access. (I sometimes buy cheap land sight unseen halfway across the world when drunk and bored at 3am, the results are kinda hit and miss, but it makes for a good excuse to travel to see what happens) On eBay there are many deals owner financed with nominal or zero down, with payments from 50 to a few hundred dollars a month.
You can still tear down old structures for people and get building supplies. You can get furniture and appliances curbside or on Craigslist, etc. I don’t need to, but I sometimes still do.
Every opportunity I took advantage of is still practical today. You can still buy land on fast food wages, you just won’t be able to live near a big city while you do it. That also was impossible in my youth. The sacrifices were substantial, the discomfort at times severe.
Nothing has changed except the expectations that people have about life and what they can or cannot do.
I was born into privilege for sure, but it was a privilege of a culture of independence and a deep understanding of the value of owning outright a place to stand.
Except those born into poverty in a truly hopeless place in the world, we suffer mostly from our attitudes and lack of knowledge, and belief in our ability to do reasonable things that other people don’t believe we can do, because they are not willing to.
I grew up with a mentality of "you can't do that, there's a rule against that" and had to slowly break out of it as much as I could. Just knowing that there's people like you out there makes me happy. I applaud your freedom.
I literally Pi-Hole Blocked all of YouTube after my son started reading the Bible after a Minecraft Influencer started preaching throughout most of his videos to the point my son became a bit too much interested in the topic.
Not that I'm a rabid atheist or would deny my child such a thing, but if THAT can enter my 8yr olds brain via his short allowed time where he can browse by himself, i'm worried what else is coming his way through it.
I'd love to give him access to valuable videos between rules I describe by natural language and can test myself, but nothing like this exists.
Cursor has been trying to do things to reduce the costs of inference, especially through context pruning. For instance, if you "attach" files to a conversation, Cursor no longer stuffs the code from those files into the prompt. Instead, it'll run function calls to open those files and read bits and pieces of the code until the model feels it has enough information. This seems like a perfectly reasonable strategy until you realize you cannot do the same thing with reasoning models, if you're limiting the reasoning to just the initial prompt.
If you prune out context from the initial prompt, instead of reasoning on richer context, the llm reasons only on the prompt itself (w/ no access to the attached files). After the thinking process, Cursor runs function calls to retrieve more context, which entirely defeats the point of "thinking" and induces the model to create incoherent plans and speculative edits in its thinking process, thus explaining Claude's bizarre over-editing behavior. I suspect this is why so many Cursor users are complaining about Claude 3.7.
On top of this, Cursor has every incentive to keep the thinking effort for both o3-mini and Claude 3.7 to the very minimum so as to reduce server load.
Cursor is being hailed as one of the greatest SAAS growth stories but their $20/mo all-you-can-eat business model puts them in such a bad place.
Imo most of their incentive on context-pruning comes not just from reducing the token amount, but from the perception that you only have to find "the right way"tm to build that context window automatically, to get to coding panacea. They just aren't there yet.
I often wondered how to strike the balance right on these things, since apparently all options can lead to success.
I hate to be the compliance guy, but even from a startup perspective you'd at least want to mention what you promise to do here.