As one might expect, because the AI isn't actually thinking, it's just spending more tokens on the problem. This sometimes leads to the desired outcome but the phenomenon is very brittle and disappears when the AI is pushed outside the bounds of its training.
To quote their discussion, "CoT is not a mechanism for genuine logical inference but rather a sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training. When pushed even slightly beyond this distribution, its performance degrades significantly, exposing the superficial nature of the “reasoning” it produces."
- https://arcprize.org/leaderboard
- https://aider.chat/docs/leaderboards/
- https://arstechnica.com/ai/2025/07/google-deepmind-earns-gol...
Surely the IMO problems weren't "within the bounds" of Gemini's training data.
One thing it's hard to wrap my head around is that we are giving more and more trust to something we don't understand with the assumption (often unchecked) that it just works. Basically your refrain is used to justify all sorts of odd setup of AIs, agents, etc.
I am much more worried about the problem where LLMs are actively misleading low-info users into thinking they’re people, especially children and old people.
“Did you try running it over and over until you got the results you wanted?”
I would have thought the more obvious approach would be to couple it to some kind of symbolic logic engine. It might transform plain language statements into fragments conforming to a syntax which that engine could then parse deterministically. This is the Platonic ideal of reasoning that the author of the post pooh-poohs, I guess, but it seems to me to be the whole point of reasoning; reasoning is the application of logic in evaluating a proposition. The LLM might be trained to generate elements of the proposition, but it's too random to apply logic.
https://arstechnica.com/ai/2025/07/google-deepmind-earns-gol...
Dumb question but anything like this that’s written about on the internet will ultimately end up as training fodder, no?
https://arstechnica.com/ai/2025/07/google-deepmind-earns-gol...