I honestly I have no idea what benchmarks are benchmarking. I don’t write JavaScript or do anything remotely webdev related.
The idea that all models have very close performance across all domains is a moderately insane take.
At any given moment the best model for my actual projects and my actual work varies.
Quite honestly Opus 4.5 is proof that benchmarks are dumb. When Opus 4.5 released no one was particularly excited. It was better with some slightly large numbers but whatever. It took about a month before everyone realized “holy shit this is a step function improvement in usefulness”. Benchmarks being +15% better on SWE bench didn’t mean a damn thing.
not saying there's a better way but both suck
I’d feel unscientific and broken? Sure maybe why not.
But at the end of the day I’m going to choose what I see with my own two eyes over a number in a table.
Benchmarks are a sometimes useful to. But we are in prime Goodharts Law Territory.
I mean, look at debugging in IntelliJ: https://resources.jetbrains.com/help/img/idea/2025.3/hotswap...
As opposed to the terminal: https://cdn.hashnode.com/res/hashnode/image/upload/v16980433...
RemedyBG: https://remedybg.itch.io/remedybg
RadDbg: https://x.com/rfleury/status/1747756219404779845?s=46
I mostly edit code in VSCode. I mostly debug code in VisualStudio. They don’t have to be the same.
I do concede that one tool to rule them all is appealing. But ultimately I work with many different languages so it’s kind of a multi-tool world no matter how you slice it.
* text editor with intellisense * build system * visual debugger * CLI coding agent
It’s totally fine if those four things are different. In fact I actually probably prefer them to be different. Having an all-in-one IDE is a complete and total non-goal.
People have historically confused the first three as needing to be a single IDE. This has always been wrong. The number of people who think you can’t debug with Visual Studio if the exe wasn’t built from a .sln is shocking. They’re all independent!
How far from grace we have fallen :sob:
> a full 1080 degree camera spin
Do you mean 3 full turns, or do you mean 180 (one half-turn)?
The only test I ever want to see with these frame-gen models is a full 1080 degree camera spin. Miss me with that 30 degree back and forth crap. I want multiple full turns. Some jitter and a little back-and-forth wobble is fine. But I want multiple full spins.
I’m pretty sure I know why they don’t do this :)
Generally people include the former in the sleeper hit category.
For the latter I’m not sure about movies. But some shows have blown up after failing on one platform then moving to Netflix.