Reasoning LLMs are wandering solution explorers

> hallucinated or unfaithful conclusions

Disclaimer: I’m no expert. An anecdotal example: I asked the reasoning LLM a question, and it laid out the correct answer in its thinking step, only to stop thinking and confidently give the wrong answer. That moment led me to conclude that when LLM evangelists talk about reasoning and thinking, they are essentially bullshitting.

simonw · 2 months ago

Why did that experience lead you to that conclusion?

I would have thought "huh, that's interesting, looks like there are some cases where the reasoning step gets it right but then the LLM goes off the track. LLMs are so weird."

Lapel2742 · 2 months ago

Because that is not the only experience me an others have with this technology. The whole field seems to be poised by bullshitting.

Don't get me wrong, it's a fascinating and extremely dangerous technology but it's clearly over-hyped.

consp · 2 months ago

Reasoning implies (limited) understanding of the context. There is none of that. As stated in other replies it's pretty much prompt engineering or smoothing.

th0ma5 · 2 months ago

What an amazingly patronizing comment, Simon.

porridgeraisin · 2 months ago

Yes. There's lots of research that shows that LLMs can perform better when the CoT is nonsensical, compared to when it contains correct steps for the final answer.

So basically, just like back in CNNs, when we made it use multiple filters hoping that it would mimic our human-designed filter banks (one edge detector, one this, one that), we found that instead each of the filters was nonsensical interpretability-wise, but in the end it gave us the same or better answer, LLMs CoT is BS but it gives the same or better answer compared to when it actually makes sense. [I'm not making a human comparison, very subjective, just comparing LLM with BS CoT vs LLM with makes-sense CoT]

Some loss functions force the CoT to "make sense" which is counterproductive but is needed if you want to sell the anthropomorphisation, which VC funded companies need to do.

There is no need to fall back to anthropomorphisation either to explain why long CoTs lead to better answers -- LLMs are a fixed amount of compute. Complexity theory says that for harder problems we need more correlated compute. Only way for an LLM to compute "more" is to produce more and more tokens. Note that due to previous computations coming as input, it is correlated compute, just what we need.

What you observed would happen anyways, to be clear, just pointed out an interesting tangent. Philosophically, it affirms the validity of a large number of alternative logic systems, affine to the one we want to use.

astrange · 2 months ago

Most of the value I get out of reasoning LLMs is their automatic tool use (web search + coding), and I can't think of a way "nonsensical web searches" would somehow find relevant web answers.

eru · 2 months ago

> That moment led me to conclude that when LLM evangelists talk about reasoning and thinking, they are essentially bullshitting.

If anyone tells you, it's already perfect, they are bullshitting.

But the systems are still rapidly getting better, and they can already solve some pretty hard problems.

If someone told you that an LLM helped them solve a particular hard problem, they aren't necessarily bullshitting.

Lapel2742 · 2 months ago

> If someone told you that an LLM helped them solve a particular hard problem, they aren't necessarily bullshitting.

Yes, they clearly are not bullshitting. They would be bullshitting if they would tell me that the LLM "thinks" while helping them.

Autocompletion and inline documentation was a godsend at their time. It solved the particular hard and heavy problem of kilos of manuals. It was a technical solution to a problem just like LLMs.

Uehreka · 2 months ago

> Disclaimer: I’m no expert.

OK cool, me neither.

> An anecdotal example: I asked the reasoning LLM a question, and it laid out the correct answer in its thinking step, only to stop thinking and confidently give the wrong answer.

I work with Claude Code in reasoning mode every day. I’ve seen it do foolish things, but never that. I totally believe that happened to you though. My first question would be which model/version were you using, I wonder if models with certain architectures or training regimens are more prone to this type of thing.

> That moment led me to conclude that when LLM evangelists talk about reasoning and thinking, they are essentially bullshitting.

Oh, come on.

People need to stop getting so hung up on the words “thinking” and “reasoning”. Call it “verbose mode” or whatever if it makes you feel better. The point is that these modes (whatever you want to call them) have generally (not always, but generally) resulted in better performance and have interesting characteristics.

typpilol · 2 months ago

Makes a lot more sense when you replace reasoning with generating more context increase the chance of a correct answer

porridgeraisin · 2 months ago

Correct. See my sibling comment.

philipp-gayret · 2 months ago

(The irony in posting this was missed by the reasoning HN user)

anal_reactor · 2 months ago

Reminds me of how I'd complete a difficult proof on a maths test only to write "2+2=3" at the end. AFAIK it's a common bug in biological LLMs

Deleted Comment

dsco · 2 months ago

This is unfair and why people see HN as largely a pessimistic crowd. Just because someone might be wrong doesn't mean they are actively trying to deceit you, which I assume you mean with "bullshitting".

It's a new and shiny object and people tend to get over-excited. That's it.

dmz73 · 2 months ago

We currently don't really know what intelligence is so we don't have a good definition of what to expect from "AI" but anyone who has used current "AI" for anything other than chat or search should recognize that "AI" is not "I" at all. The "AI" does not "know" anything. It is really a fuzzy search on an "mp3" database (compressed with loss resulting in poor quality).

Based on that, everyone who is claiming current "AI" technology is any kind of intelligence has either fallen for the hype sold by the "AI" tech companies or is the "AI" tech company (or associated) and is trying to sell you their "AI" model subscription or getting you to invest in it.

pancsta · 2 months ago

Search engines dont think, search engines search and match.

nathias · 2 months ago

what do you think reasoning is?

simianwords · 2 months ago

> That moment led me to conclude that when LLM evangelists talk about reasoning and thinking, they are essentially bullshitting

This kind of logic is very silly to me. So the LLM got your one off edge case incorrectly and we are supposed to believe they bullshit. Sure. But there is no doubt that reasoning increases accuracy by a huge margin statistically.

> Based on the findings, we advocate for new metrics and tools that evaluate not just final outputs but the structure of the reasoning process itself.

Maybe the problem is to call them reasoning in the first place. All they do is expand the user prompt into way bigger prompts that seem to perform better. Instead of reasoning, we should call this prompt smoothing or context smoothing so that it’s clear that this is not actual reasoning, just optimizing the prompt and expanding the context.

ACCount37 · 2 months ago

If you go out of your way to avoid anthropomorphizing LLMs? You are making a mistake at least 8 times of 10.

LLMs are crammed full of copied human behaviors - and yet, somehow, people keep insisting that under no circumstances should we ever call them that! Just make up any other terms - other that the ones that fit, but are Reserved For Humans Only (The Kind Made Of Flesh).

Nah. You should anthropomorphize LLMs more. They love that shit.

lcnPylGDnU4H9OF · 2 months ago

> Nah. You should anthropomorphize LLMs more. They love that shit.

I'm reminded of something I read in a comment, paraphrasing: it makes sense to anthropomorphize something that loudly anthropomorphizes itself when someone so much as picks it up.

nurettin · 2 months ago

So we should invite them to dinner? Watch movies together? Would they enjoy shopping?

aytigra · 2 months ago

I feel like "intuition" really fits to what LLM does. From the input LLM intuitively produces some tokens/text. And "thinking" LLM essentially again just uses intuition on previously generated tokens which produces another text which may(or may not) be a better version.

aurareturn · 2 months ago

I agree. When I first heard of the term "reasoning" to describe these models, I thought, "wait, I thought normal models also reason pretty well".

neumann · 2 months ago

Raining is essentially doing as a 'built in' feature for what users found earlier that requesting longer contextual responses tend to arrive at a more specific conclusion. Or put it inversely, asking for 'just the answer' with no hidden 'reasoning' gives answers far more brittle.

leptons · 2 months ago

It seems to be something like brute forcing. Throw even more words at the LLM until something useful pops out.

airstrike · 2 months ago

The process of choosing which of the available tools to use is, to me, the only part of AI that I'm comfortable referring to as "reasoning" today.

brrrrrm · 2 months ago

what about "test time scaling"?