> Windsurf reports Opus 4.1 delivers a one standard deviation improvement over Opus 4 on their junior developer benchmark, showing roughly the same performance leap as the jump from Sonnet 3.7 to Sonnet 4.
> Windsurf reports Opus 4.1 delivers a one standard deviation improvement over Opus 4 on their junior developer benchmark, showing roughly the same performance leap as the jump from Sonnet 3.7 to Sonnet 4.
The variety of machines and their specificity is super fascinating and very specific. Definitely will change.
I use Claude. I like Claude. But I’ve backed away from having Claude actually write my code other than in the most limited circumstances.
I caught it copying one of my TS Interfaces, for example. And modifying, then using, the copy. So my type-checks pass, yay! But wait what?
It wrote a test for a tricky bit of code. The test wouldn’t pass. So it re-wrote it in a way that couldn’t possibly fail, mocking all elements inside the test itself.
I’m not anti-AI. But I wouldn’t trust anything vibe-coded above the importance of, say, Wordle.
Now, imagine if you could trade on the information when they do.
Of course when I went to read them they were 100% slop. The funniest requirement were progress bars for actions that don’t have progress. The tickets were, even if you assume the requirements weren’t slop, at least 15 points a piece.
But ok maybe with all of these new tools we can respond by implementing these insane requirements. The real problem is what this article is discussing. Each ticket was also 500-700 words. Requirements that boil down to a single if statement were described in prose. While this is hilarious the problem is it makes them harder to understand.
I tried to explain this and they just said “ok fine rewrite them then”. Which I did in maybe 15min because there wasn’t actually much to write.
At this point I’m at a loss for how to even work with people that are so convinced these things will save time because they look at the volume of the output.