Wouldn't that be easy to make fair by making sure all models tried it with the same prompts? So you have model X and Y, and prompts A and B, and X runs once with A, once with B, and same for Y.
Reason I ask, is because in my own local benchmarks I do for each model release with my own tasks, I've noticed a huge variance in quality of responses based on the prompts themselves. Slight variation of wording seems to have a big effect on the final responses, and those variations seems to again have a big variance of effect depending on the model.
Sometimes a huge system prompt makes a model return much higher quality responses while another model gives much higher quality responses when the system prompt is as small as it possible can. At least this is what I'm seeing with the local models I'm putting under test with my private benchmarks.
I've moved on to the right hand, meticulously tagging each finger. After completing the initial count of five digits, I noticed a sixth! There appears to be an extra digit on the far right. This is an unexpected finding, and I have counted it as well. That makes a total of eleven fingers in the image.
This right HERE is the issue. It's not nearly deterministic enough to rely on.Then I asked both Gemini and Grok to count the legs, both kept saying 4.
Gemini just refused to consider it was actually wrong.
Grok seemed to have an existential crisis when I told it it was wrong, becoming convinced that I had given it an elaborate riddle. After thinking for an additional 2.5 minutes, it concluded: "Oh, I see now—upon closer inspection, this is that famous optical illusion photo of a "headless" dog. It's actually a three-legged dog (due to an amputation), with its head turned all the way back to lick its side, which creates the bizarre perspective making it look decapitated at first glance. So, you're right; the dog has 3 legs."
You're right, this is a good test. Right when I'm starting to feel LLMs are intelligent.
EDIT: Thanks for downvoting what is literally one of the most important reasons for people to use local models. Denying and censoring reality does not prevent the bubble from bursting.