Disclaimer: I’m no expert. An anecdotal example: I asked the reasoning LLM a question, and it laid out the correct answer in its thinking step, only to stop thinking and confidently give the wrong answer. That moment led me to conclude that when LLM evangelists talk about reasoning and thinking, they are essentially bullshitting.
Why did that experience lead you to that conclusion?
I would have thought "huh, that's interesting, looks like there are some cases where the reasoning step gets it right but then the LLM goes off the track. LLMs are so weird."
Reasoning implies (limited) understanding of the context. There is none of that. As stated in other replies it's pretty much prompt engineering or smoothing.
Yes. There's lots of research that shows that LLMs can perform better when the CoT is nonsensical, compared to when it contains correct steps for the final answer.
So basically, just like back in CNNs, when we made it use multiple filters hoping that it would mimic our human-designed filter banks (one edge detector, one this, one that), we found that instead each of the filters was nonsensical interpretability-wise, but in the end it gave us the same or better answer, LLMs CoT is BS but it gives the same or better answer compared to when it actually makes sense. [I'm not making a human comparison, very subjective, just comparing LLM with BS CoT vs LLM with makes-sense CoT]
Some loss functions force the CoT to "make sense" which is counterproductive but is needed if you want to sell the anthropomorphisation, which VC funded companies need to do.
There is no need to fall back to anthropomorphisation either to explain why long CoTs lead to better answers -- LLMs are a fixed amount of compute. Complexity theory says that for harder problems we need more correlated compute. Only way for an LLM to compute "more" is to produce more and more tokens. Note that due to previous computations coming as input, it is correlated compute, just what we need.
What you observed would happen anyways, to be clear, just pointed out an interesting tangent. Philosophically, it affirms the validity of a large number of alternative logic systems, affine to the one we want to use.
Most of the value I get out of reasoning LLMs is their automatic tool use (web search + coding), and I can't think of a way "nonsensical web searches" would somehow find relevant web answers.
> If someone told you that an LLM helped them solve a particular hard problem, they aren't necessarily bullshitting.
Yes, they clearly are not bullshitting. They would be bullshitting if they would tell me that the LLM "thinks" while helping them.
Autocompletion and inline documentation was a godsend at their time. It solved the particular hard and heavy problem of kilos of manuals. It was a technical solution to a problem just like LLMs.
> An anecdotal example: I asked the reasoning LLM a question, and it laid out the correct answer in its thinking step, only to stop thinking and confidently give the wrong answer.
I work with Claude Code in reasoning mode every day. I’ve seen it do foolish things, but never that. I totally believe that happened to you though. My first question would be which model/version were you using, I wonder if models with certain architectures or training regimens are more prone to this type of thing.
> That moment led me to conclude that when LLM evangelists talk about reasoning and thinking, they are essentially bullshitting.
Oh, come on.
People need to stop getting so hung up on the words “thinking” and “reasoning”. Call it “verbose mode” or whatever if it makes you feel better. The point is that these modes (whatever you want to call them) have generally (not always, but generally) resulted in better performance and have interesting characteristics.
This is unfair and why people see HN as largely a pessimistic crowd. Just because someone might be wrong doesn't mean they are actively trying to deceit you, which I assume you mean with "bullshitting".
It's a new and shiny object and people tend to get over-excited. That's it.
We currently don't really know what intelligence is so we don't have a good definition of what to expect from "AI" but anyone who has used current "AI" for anything other than chat or search should recognize that "AI" is not "I" at all.
The "AI" does not "know" anything. It is really a fuzzy search on an "mp3" database (compressed with loss resulting in poor quality).
Based on that, everyone who is claiming current "AI" technology is any kind of intelligence has either fallen for the hype sold by the "AI" tech companies or is the "AI" tech company (or associated) and is trying to sell you their "AI" model subscription or getting you to invest in it.
> That moment led me to conclude that when LLM evangelists talk about reasoning and thinking, they are essentially bullshitting
This kind of logic is very silly to me. So the LLM got your one off edge case incorrectly and we are supposed to believe they bullshit. Sure. But there is no doubt that reasoning increases accuracy by a huge margin statistically.
> Based on the findings, we advocate for new metrics and tools that evaluate not just final outputs but the structure of the reasoning process itself.
Maybe the problem is to call them reasoning in the first place. All they do is expand the user prompt into way bigger prompts that seem to perform better. Instead of reasoning, we should call this prompt smoothing or context smoothing so that it’s clear that this is not actual reasoning, just optimizing the prompt and expanding the context.
If you go out of your way to avoid anthropomorphizing LLMs? You are making a mistake at least 8 times of 10.
LLMs are crammed full of copied human behaviors - and yet, somehow, people keep insisting that under no circumstances should we ever call them that! Just make up any other terms - other that the ones that fit, but are Reserved For Humans Only (The Kind Made Of Flesh).
Nah. You should anthropomorphize LLMs more. They love that shit.
> Nah. You should anthropomorphize LLMs more. They love that shit.
I'm reminded of something I read in a comment, paraphrasing: it makes sense to anthropomorphize something that loudly anthropomorphizes itself when someone so much as picks it up.
I feel like "intuition" really fits to what LLM does. From the input LLM intuitively produces some tokens/text. And "thinking" LLM essentially again just uses intuition on previously generated tokens which produces another text which may(or may not) be a better version.
Raining is essentially doing as a 'built in' feature for what users found earlier that requesting longer contextual responses tend to arrive at a more specific conclusion. Or put it inversely, asking for 'just the answer' with no hidden 'reasoning' gives answers far more brittle.
To check for consistency in the reasoning steps in the presence of a correct reply, to evaluate the actual LLM performances, is a fundamentally misleading idea. Thinking models learn to do two things: 1. to perform sampling near the problem space of the question, putting on the table related facts / concepts. 2. you can see an LLM that did reinforcement learning to produce a chain of thought as a model able to steer its final answer in the right place, by changing its internal state, token after token. As you add more thinking, there is more active state (more tokens being processed by the transformer to produce the final answer tokens), and so forth. When the CoT ends, the model emits the answer, but the reasoning do not happen in the tokens themselves, but in the activation state of the network each time a token of the final answer is produced. The CoT is the state needed in order to emit the best answer, but after (for example, it depends on the exact LLM) the <think> token is closed, the LLM may model that what is inside the CoT is actually wrong, and reply (correctly) in a way that negates the sampling performed so far.
> LLMs have demonstrated impressive reasoning abilities through [CoT prompting etc.]. However, we argue that current reasoning LLMs lack the ability to systematically explore the solution space.
Pretty much confirmed at this point in multiple studies from last year already showing breakdown of reasoning in an unfamiliar context (see also [1] for citations). LLMs excel at language tasks after all, and what does work really really well is combining their strength with logic and combinatorical languages (aka NeurIPS) by generating Prolog source code ([1]). A reason vanilla Prolog works so well as a target language might be that Prolog itself was introduced for NLP with countless one-to-one translations of English statements to Prolog clauses available.
I'd encourage everyone to learn about Metropolis Hastings Markov chain monte carlo and then squint at lmms, think about what token by token generation of the long rollouts maps to in that framework and consider that you can think of the stop token as a learned stopping criterion accepting (a substring of) the output
LLMs run their reasoning on copied human cognitive skills, stitched together by RL into something that sort-of-works.
What are their skills copied from? An unholy amount of unlabeled text.
What does an unholy amount of unlabeled text NOT contain? A completely faithful representation of how humans reason, act in agentic manner, explore solution spaces, etc.
We know that for sure - because not even the groundbreaking scientific papers start out by detailing the 37 approaches and methods that were considered and decided against, or were attempted but did not work. The happy 2% golden path is shown - the unhappy 98% process of exploration and refinement is not.
So LLMs have pieces missing. They try to copy a lossy, unfaithful representation of how humans think, and make it work anyway. They don't have all the right heuristics for implementing things like advanced agentic behavior well, because no one ever writes that shit down in detail.
A fundamental limitation? Not quite.
You can try to give LLMs better training data to imbue them with the right behaviors. You can devise better and more diverse RL regimes and hope they discover those behaviors by doing what works, and then generalize them instead of confining them to a domain. Or just scale everything up, so that they pick up on more things that are left unsaid right in pretraining, and can implement more of them in each forward pass. In practice? All of the above.
This paper looks like it overlaps a bit with that Apple paper that caused a stir a few months ago: "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity" - the Towers of Hanoi one. https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...
I'd argue that we have that already: coding agents with access to a programming language (or, even better, a container they can run commands in) can use all sorts of other tools to help explore a solution space.
They have other tricks too. Claude Code makes itself a TODO list for a problem and can tackle the items on that list one-by-one, including firing off sub-agents to perform subsets of those tasks.
While true, I'm not sure I've seen an LLM define a cost function and then try and reduce the cost yet, which I am guessing is what the OP is referring to.
Disclaimer: I’m no expert. An anecdotal example: I asked the reasoning LLM a question, and it laid out the correct answer in its thinking step, only to stop thinking and confidently give the wrong answer. That moment led me to conclude that when LLM evangelists talk about reasoning and thinking, they are essentially bullshitting.
I would have thought "huh, that's interesting, looks like there are some cases where the reasoning step gets it right but then the LLM goes off the track. LLMs are so weird."
Don't get me wrong, it's a fascinating and extremely dangerous technology but it's clearly over-hyped.
So basically, just like back in CNNs, when we made it use multiple filters hoping that it would mimic our human-designed filter banks (one edge detector, one this, one that), we found that instead each of the filters was nonsensical interpretability-wise, but in the end it gave us the same or better answer, LLMs CoT is BS but it gives the same or better answer compared to when it actually makes sense. [I'm not making a human comparison, very subjective, just comparing LLM with BS CoT vs LLM with makes-sense CoT]
Some loss functions force the CoT to "make sense" which is counterproductive but is needed if you want to sell the anthropomorphisation, which VC funded companies need to do.
There is no need to fall back to anthropomorphisation either to explain why long CoTs lead to better answers -- LLMs are a fixed amount of compute. Complexity theory says that for harder problems we need more correlated compute. Only way for an LLM to compute "more" is to produce more and more tokens. Note that due to previous computations coming as input, it is correlated compute, just what we need.
What you observed would happen anyways, to be clear, just pointed out an interesting tangent. Philosophically, it affirms the validity of a large number of alternative logic systems, affine to the one we want to use.
If anyone tells you, it's already perfect, they are bullshitting.
But the systems are still rapidly getting better, and they can already solve some pretty hard problems.
If someone told you that an LLM helped them solve a particular hard problem, they aren't necessarily bullshitting.
Yes, they clearly are not bullshitting. They would be bullshitting if they would tell me that the LLM "thinks" while helping them.
Autocompletion and inline documentation was a godsend at their time. It solved the particular hard and heavy problem of kilos of manuals. It was a technical solution to a problem just like LLMs.
OK cool, me neither.
> An anecdotal example: I asked the reasoning LLM a question, and it laid out the correct answer in its thinking step, only to stop thinking and confidently give the wrong answer.
I work with Claude Code in reasoning mode every day. I’ve seen it do foolish things, but never that. I totally believe that happened to you though. My first question would be which model/version were you using, I wonder if models with certain architectures or training regimens are more prone to this type of thing.
> That moment led me to conclude that when LLM evangelists talk about reasoning and thinking, they are essentially bullshitting.
Oh, come on.
People need to stop getting so hung up on the words “thinking” and “reasoning”. Call it “verbose mode” or whatever if it makes you feel better. The point is that these modes (whatever you want to call them) have generally (not always, but generally) resulted in better performance and have interesting characteristics.
Deleted Comment
It's a new and shiny object and people tend to get over-excited. That's it.
Based on that, everyone who is claiming current "AI" technology is any kind of intelligence has either fallen for the hype sold by the "AI" tech companies or is the "AI" tech company (or associated) and is trying to sell you their "AI" model subscription or getting you to invest in it.
This kind of logic is very silly to me. So the LLM got your one off edge case incorrectly and we are supposed to believe they bullshit. Sure. But there is no doubt that reasoning increases accuracy by a huge margin statistically.
Maybe the problem is to call them reasoning in the first place. All they do is expand the user prompt into way bigger prompts that seem to perform better. Instead of reasoning, we should call this prompt smoothing or context smoothing so that it’s clear that this is not actual reasoning, just optimizing the prompt and expanding the context.
LLMs are crammed full of copied human behaviors - and yet, somehow, people keep insisting that under no circumstances should we ever call them that! Just make up any other terms - other that the ones that fit, but are Reserved For Humans Only (The Kind Made Of Flesh).
Nah. You should anthropomorphize LLMs more. They love that shit.
I'm reminded of something I read in a comment, paraphrasing: it makes sense to anthropomorphize something that loudly anthropomorphizes itself when someone so much as picks it up.
Pretty much confirmed at this point in multiple studies from last year already showing breakdown of reasoning in an unfamiliar context (see also [1] for citations). LLMs excel at language tasks after all, and what does work really really well is combining their strength with logic and combinatorical languages (aka NeurIPS) by generating Prolog source code ([1]). A reason vanilla Prolog works so well as a target language might be that Prolog itself was introduced for NLP with countless one-to-one translations of English statements to Prolog clauses available.
[1]: https://quantumprolog.sgml.net/llm-demo/part1.html
LLMs run their reasoning on copied human cognitive skills, stitched together by RL into something that sort-of-works.
What are their skills copied from? An unholy amount of unlabeled text.
What does an unholy amount of unlabeled text NOT contain? A completely faithful representation of how humans reason, act in agentic manner, explore solution spaces, etc.
We know that for sure - because not even the groundbreaking scientific papers start out by detailing the 37 approaches and methods that were considered and decided against, or were attempted but did not work. The happy 2% golden path is shown - the unhappy 98% process of exploration and refinement is not.
So LLMs have pieces missing. They try to copy a lossy, unfaithful representation of how humans think, and make it work anyway. They don't have all the right heuristics for implementing things like advanced agentic behavior well, because no one ever writes that shit down in detail.
A fundamental limitation? Not quite.
You can try to give LLMs better training data to imbue them with the right behaviors. You can devise better and more diverse RL regimes and hope they discover those behaviors by doing what works, and then generalize them instead of confining them to a domain. Or just scale everything up, so that they pick up on more things that are left unsaid right in pretraining, and can implement more of them in each forward pass. In practice? All of the above.
They have other tricks too. Claude Code makes itself a TODO list for a problem and can tackle the items on that list one-by-one, including firing off sub-agents to perform subsets of those tasks.