Like replacing named concepts with nonsense words in reasoning benchmarks.
Like replacing named concepts with nonsense words in reasoning benchmarks.
I would want to hear more detail about prompts, frameworks, thinking time, etc., but they don't matter too much. The main caveat would be that this is probably on the public test set, so could be in pretraining, and there could even be some ARC-focussed post-training - I think we don't know yet and might never know.
But for any reasonable setup, if no egregious cheating, that is an amazing score on ARC 2.
Also notable which models they include for comparison: Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1. That seems like a minor snub against Grok 4 / Grok 4.1.
I would want to hear more detail about prompts, frameworks, thinking time, etc., but they don't matter too much. The main caveat would be that this is probably on the public test set, so could be in pretraining, and there could even be some ARC-focussed post-training - I think we don't know yet and might never know.
But for any reasonable setup, if no egregious cheating, that is an amazing score on ARC 2.
And actually, if I do have a problem it's quite the opposite of what you're suggesting: I'd like us to give more weight to the lived experience of others even in other contexts and regarding other subject matters.
> (for) most research projects, it would not help to have input from the general public. In fact, it would just be time-consuming, because error checking
Since frontier LLMs make clumsy mistakes, they may fall into this category of 'error-prone' mathematician whose net contributions are actually negative, despite being impressive some of the time.
[0] https://www.youtube.com/watch?v=HUkBz-cdB-k&t=2h59m33s