You cannot test reasoning when you don't know what's in the training set. You have to be able to differentiate reasoning from memorization, and that's not trivial.
Moreso, the results look to confirm that at least some memorization is going on. Do we really not think GPT has extensively been trained on arithmetic in base 10, 8, and 16? This seems like a terrible prior. Even if not explicitly, how much code has it read that performs these tasks. How many web pages, tutorials, Reddit posts cover oct and hex? They also haven't defined zero shot correctly. Arithmetic in these bases aren't 0-shot. They're explicitly in distribution...
I'm unsure about base 9 and 11. It's pretty interesting to see that GPT 4 is much better at these. Anyone know why? Did they train on these? More bases? Doesn't seem unreasonable but I don't know.
The experimentation is also extremely lacking. The arithmetic questions only have 1000 tests where they add two digits. This is certainly in the training data. I'm also unconvinced by the syntax reasoning tasks since the transformer (attention) architecture seems to be designed for this. I'm also unconvinced these tasks aren't in training. Caesar ciphers are also certainly in the training data.
The prompts are also odd and I guess that's why they're in the appendix. For example, getting GPT to be better at math or many tasks by having it write python code is not novel.
There's some stuff here but this really doesn't seem like a lot of work for 12 people from a top university and a trillion dollar company. It's odd to see that many people when the experiments can be run in a fairly short time.
We can tell some of what's in the training set. One of the answers for the inductive reasoning test begins "begin from the rightmost digit". Look that phrase up in Google. It shows up in Chegg, Course Hero, and Brainly content for elementary arithmetic. If you bash on those how-to articles, available for bases 2 and 10, you can probably generate the pattern for base 8.
This looks like an LLM doing the usual LLM thing - finding relevant items and combining them to fit.
This doesn't require the level of abstraction and induction the authors impute to the LLM. Ordinary LLM behavior explains this, once you've found the relevant training data.
People often solve problems that way too, of course.
That reminds me of an old paper about "Benny's Rules", a case-study focused on a kid who seemed to be doing better than average in math tests when it came to final answers... but for all the wrong reasons, using an inferred set of arbitrary text manipulation rules.
The intent was to point out that the educational approach was flawed, but I think there are interesting parallels to token processing in LLMs, which--unlike a human child--are built in such a way that crazy partial-fit rules are likely their only option.
> Benny believed that the fraction 5/10 = 1.5 and 400/400 = 8.00 because he believed the rule was to add the numerator and denominator and then divide by the number represented by the highest place value.
You'll probably find this talk [1] interesting. They control all the training data for small LLMs and then perform experiments (including reasoning experiments).
How to you define memorization and reasoning? There is a large grey area in between them. Some say that if you can memorize facts and algorithms and apply them to new data, it is a memorization. Some say that it is reasoning.
More than that -- It's not clear that what humans do is not "just" a memorization. We can always look at human experience mechanisticly and say that we don't think -- we just memorized thinking patterns and apply them when speaking and "thinking"
> It's not clear that what humans do is not "just" a memorization.
While I agree that there is a lot of gray in-between I think you are misrepresenting my comment. And I'm absolutely certain humans do more than memorization. Not all humans, but that's not the bar. Some humans are brain damaged and some are in fact babies (and many scientific do agree that sentience doesn't appear at birth).
If you doubt me I very much encourage you to dive deeper into the history of science and get doing deep deep knowledge on any subject. Because you'll find this happen all the time. But if you apply a loose enough definition to memorization (that isn't one that would be generally agreed upon if you used it's logical conclusions) then yeah, everything is memorization. But everything is foo if I define everything to be foo, so let's not.
A lot of reasoning is similar to interpolation within a sparse set of observations. Memorization is rounding up to the nearest known example. Basic guess is linear interpolation. And reasoning is about discovering the simplest rule that explains all the observations and using this rule to extrapolate.
>> Some say that if you can memorize facts and algorithms and apply them to new data, it is a memorization. Some say that it is reasoning.
Who are those "some" who say it is reasoning?
Here's a question. If you enter a command in your terminal do you expect your shell to have memorised the result of the command from some previous experience, or do you expect your shell to compute the result of your command according to its programming and your inputs? A rhetorical question: we all assume the latter.
Which one is more like what most peoples' informal conception of "reasoning"? Retrieving an answer from storage, or computing an answer from the inputs given to a program?
>> We can always look at human experience mechanisticly and say that we don't think -- we just memorized thinking patterns and apply them when speaking and "thinking"
I think this is confusing memorisation of the rules required to perform a computation, like a program stored in computer memory, with memorisation of the results of a computation. When we distinguish between memorisation and reasoning we usually make a distinction between computing a result from scratch and retrieving it from storage without having to re-compute it, like in caching or memoization, or getting data from a database.
For a real world example, we memorise our time tables but we don't memorise the result of every sum x + y = z, instead we memorise a summation algorithm that we then use to derive the sum of two numbers.
> Some say that if you can memorize facts and algorithms and apply them to new data, it is a memorization. Some say that it is reasoning.
Memorizing facts and algorithms is memorization. The rest of what you are talking about is not.
Applying existing knowledge on new data without deriving new information is generalization. An example of this is the case of a semantic segmentation model classifying a car that it has never seen. If the model was not trained on birds, it will never classify a bird as a bird.
Computation of decidable problems is a large, possibly the largest subset of reasoning. Most humans do not struggle with solving decidable problems, the problem is that they are slow and can only solve small problem sizes, but most problems encountered in practice aren't one large decidable problem, but a long chain of many small, dozens to hundreds of heterogeneous problems that are seamlessly mixed with one another. LLMs struggle with decidable problems that are out of distribution, but you can give a human instructions on how to do something they have never done before and they will follow them with no problem.
> More than that -- It's not clear that what humans do is not "just" a memorization.
I hope it is clear that I did not memorize this message I am writing here and that it is the unique result of processes inside my brain that were not captured in the training process of an LLM.
>We can always look at human experience mechanisticly and say that we don't think -- we just memorized thinking patterns and apply them when speaking and "thinking"
Again you are trying to twist this in an absurd direction. Let's come up with a teleoperated humanoid robot on Mars that is controlled by a human on Earth. The robot acts exactly like a human does. Does this mean the robot is now capable of reasoning and thinking like a human, simply because it is replaying a recording of the human's body and speech? This is the argument you are making. You're arguing that the robot's ability to replay a human's actions is equivalent to the processes that brought about that human action.
>> The arithmetic questions only have 1000 tests where they add two digits. This
is certainly in the training data.
Yeah, it's confirmation bias. People do that sort of thing all the time in machine learning research, specially in the recent surge of LLM-poking papers. If they don't do that, they don't have a paper, so we're going to see much more of it before the trend exhausts itself.
Is there a good reason to exclude abductive reasoning from an analysis like this? It's even considered by at least one of the referenced papers (Fangzhi 2023a).
Abductive reasoning is common in day-to-day life. It seeks the best explanation for some (often incomplete) observations, and reaches conclusions without certainty. I would have thought it would be important to assess for LLMs.
My instinct is it is a distinction without a difference in this context. i.e. if deductive is "I watched the cue ball hit the 8 ball, therefore, the 8 ball is moving" and abductive is "the 8 ball is moving towards me, therefore the cue ball must have hit it. I cannot claim to have deduced this because I did not observe it", LLMs cannot observe the situation, so any deduction (in the binary induction/deductive sense) must be done by abduction.
I like to think of abductive reasoning as the basis for science that explains natural processes that happened in the past -- like astronomy and geology and evolution -- where experiments are too big to conduct or processes too slow to observe in real-time. So we propose mechanistic explanations for nonobvious outcomes like the formations of stars, or motion of large land mass via plate tectonics or glaciation, or long-range organism speciation over millennia. That's the role for abduction, to explain how all that happened.
No, but agreement with priors is one way one might choose between possibilities.
For example suppose you go outside and the streets are wet. Perhaps it rained, or perhaps someone drove a fire truck around spraying water all over the streets. You might select the former because of its higher prior probability.
They are statistical text generators, whose results are defined by their training data set. This is why the paper cited reads thusly:
Despite extensive research into the reasoning capabilities
of Large Language Models (LLMs), most studies have failed
to rigorously differentiate between inductive and deductive
reasoning ...
There is no differentiation because what was sought is the existence of what does not.
The authors then postulate:
This raises an essential question: In LLM reasoning, which
poses a greater challenge - deductive or inductive reasoning?
There is no such thing as "LLM reasoning." Therefore, the greatest challenge is accepting this fact and/or that anthropomorphism is a real thing.
I really dislike these non-sequitur arguments of "LLMs do not reason, because <detail on how they work>", as if a detail on how they work unequivocally proves the impossibility of reasoning.
I've noticed that on HackerNews about 80% of all debates or discussions where people disagree, boils down to a disagreement about the definition of a word.
> They are statistical text generators, whose results are defined by their training data set
Honestly I'm pissed at the research community. It's fraud. If you don't know what's in the training data you simply cannot differentiate reasoning from memorization.
Beyond that, the experiments are almost certainly in the training data. Like come on! I feel like I'm going crazy here. How can any coder not think LLMs are training on oct and hex‽
Can you cite the deterministic part? And I don't think fixing your seeds count as making a model not statistical, though yes deterministic.
It's also still statistical if it gives you the same answer 99999/100000 times. You can definitely tune these things to be much more certain about certain problems but it tends to decrease capabilities in others.
I also don't like the term "black box." While they aren't transparent they aren't completely opaque either. A big reason I didn't like the term is that I feel it also encourages people to not research this more. While we don't completely know what's going on I see no reason we can't. Is there some proof they their calculations are an unbreakable encryption or is it just a hard problem. I'm pretty sure it's the latter, and I think it's good to differentiate "we're dumb" from "indeterminate"
LLMs are statistical next-word predictors, wrapped in a stochastic next-word generator.
The next-word predictions output by the model itself are based on the patterns and statistics of the training set. The next word that is actually generated is typically a random selection per a top-k or top-p sampling scheme.
My belief is that what LLMs do is best described as approximating the outputs of reasoning. This is different from reasoning itself and also different from simply regurgitating the most similar training examples.
I also think it's extremely challenging for people to understand the significance of LLM output because language is intertwined with meaning. So if you construct a valid and seemingly responsive sequence of words in response to a prompt, there is a strong illusion of purpose behind those words, even if they were assembled in a completely naive fashion.
This! One simple argument is that language is NOT a magical reasoning substance in itself, but a communication medium. It is medium for passing (a) meaning. So first there is a meaningful thought (worth of sharing), then an agent puts a SIGNIFIER on that meaningful thought, then communicates it to the recipient. Communication medium can be a sentence, it can also be an eyewink or a tail wiggle. Or a whistle. The "language" can be created on the spot, if two subjects get a meaning of signifier by intuition (e.g. I look at the object, you follow my gaze).
So the fallacy of the whole LLM field is the belief that language has some intrinsic meaning. Or if you mix the artifacts of language in some very smart way, the meaning will emerge. But it doesn't work if meaning occurs before the word. The text in books has no reasoning, it was authors. The machine shuffling the text fragments does not have a meaningful thought. The engineer which devised a shuffling machine had some meaningful thought, the users of the machine have same thoughts, but not the machine itself.
To put it another way, if there was an artificial system capable of producing meaningful thoughts, it is not a presence of language which produces a proof, it's communication. Communication requires an agent (as in "agency") and an intent. We have neither in LLM.
As to the argument that we ourselves are mere stochastic parrots - of course we can produce word salads, or fake mimics of coherent text, it is not a proof that LLM IS the way our minds work. It is just a witness to the fact language is a flexible medium for the meanings behind - it can just as well be used for cheating, pretending, etc.
I think "reasoning" is the best word in the English language that I know of to describe what LLMs are doing. Lots of people disagree because for about 2000 years that word always implied consciousness, and LLMs definitely don't have consciousness.
LLMs are statistical text generators whose results depend on the model and the context given. They have gotten so good because the context they can effectively operate over keeps getting really big. If you take the model and operate on a small context, you will get very uninteresting results.
The only reason it seems like it is reasoning is because it’s probably stuffing a lot of reasoning in its context, and regurgitating that out in ways that are statically weighted with other things in the context on what is being reasoned about.
Frankly, even most commenters on HN don’t get how LLMs operate, thinking the model itself is what knows about different bases like hex and oct, when really, it searched up a bunch of material on different bases to include in the context before the model was ever activated.
Brains do not reason - they are a neural network whose results are defined by their experiences, memories, and sensory inputs.
I'm tired of reading comments from people who keep repeating that LLMs don't think, don't reason, isn't intelligence because it is not human. If your definition of the above is because it's not human, it's quite useless as a comment. We know LLMs aren't biological human brains. So what?
Define what reasoning is to you. Then tell us why LLMs don't reason and why it matters.
1. Reasoning is ability to at least carry out proofs in FOL (first-order logic). FOL can simulate Turing Mach
2. LLMs are formally equivalent to only a subset of FOL.
Why is this important? To model human mathematics, you need at least first-order logic.
These arguments have been around for decades, e.g., Penrose. I am tired of people bringing up strawmen arguments ("Not intelligent because not human!")
They can, at times, and do so best when emotion is not involved.
> I'm tired of reading comments from people who keep repeating that LLMs don't think, don't reason, isn't intelligence because it is not human.
LLM's represent a category of algorithm. Quite elegant and useful in some circumstances, but an algorithm none the less. A quick search produced this[0] example discussing same.
Another reference, which may not be authoritative based on whatever most recent edit the link produces, is[1]:
A large language model (LLM) is a computational model
capable of language generation or other natural language
processing tasks. As language models, LLMs acquire these
abilities by learning statistical relationships from vast
amounts of text during a self-supervised and semi-supervised
training process.
> Define what reasoning is to you.
Reasoning was the process I went through to formulate this response, doing so with intent to convey meaning as best as I can, and understand as best as possible the message to which I am replying.
> Then tell us why LLMs don't reason and why it matters.
LLM's do not possess the ability to perform the process detailed above.
Artificial intelligence is still intelligence, even if it is just a shallow copy of human intelligence.
What irritates me when I see comments like yours is that precise knowledge of weaknesses of LLMs is necessary to either improve LLMs, so most of the people who claim LLMs reason or are already AGI basically deny the ability to improve them, since they are already perfect. Research into studying the limitations of the current generation of AI is unwanted and by extension so is the next generation of AI.
The conclusions of the authors that LLMs can reason inductively very well runs counter to what I've read elsewhere. A big part of doing induction is the ability to generalize a shared pattern from multiple disparate examples, recognizing the essential elements that are necessary and sufficient to satisfy that pattern's operators' constraints. To date, I've seen consensus that LLMs can match verbs or basic relational operators across examples, thereby associating the mechanisms in similar events that lead to similar outcomes. But extending that facility further, to employing predicate logic operators, or even the simpler propositional ones appears to fall largely outside LLM capabilities. To suggest then that LLMs can then perform higher-order reasoning skills yet, like the modeling of contrapositives, this seems quite a stretch.
I've just successfully chatted to ChatGPT about equivalence or at least high similarities between QFT, neural networks and cellular automata (referencing Wofram's work). Does that pattern matching count?
And you were able to verify, of course, that anything new or surprising to you (as in, not a simple derivation of your own prompts), was true ?
I noticed that if I ask it to tell me how good cryptocurrencies are, it'll do it, and then if I say I disagree and they're wrong, it'll simply switch and agree with me as well. The thing has no care for truth, no opinion of its own, no ability to insist, and just feeds you whatever is statistically close to your own questions.
But any mention of LLM reasoning ability ought to address the obvious confound: the LLM is trained on examples of deductive reasoning, inductive reasoning, abductive reasoning, SAT-solver reasoning, geniuses' musings, etc. If they replicate one of those examples, then should that be called "reasoning" of any sort or not? Regurgitating those examples may even involve some generalization, if the original topics of an example are swapped out (perhaps by a nearby topic in latent space).
Given that it appears they're training and testing on synthetic problems, this objection probably does not apply to their actual results. But given the fuzziness it creates for the definition of "reasoning" of any sort, I would have expected some working definition of reasoning in the paper's abstract.
Training on Moby Dick and thus being able to regurgitate text from Moby Dick does not mean the LLM is capable of writing a new Moby Dick-like book. (Thankfully; one is more than enough!)
The tasks used are artificially created and don't exist in the training sets. For example there's very little practical math in base 11 on the internet, or English with explicitly mixed up but rule based grammar.
While there is sometimes an exaggeration of the differences, I have always found LLMs to behave like (and have many of the weaknesses of) the left hemisphere of the brain as described in books like “The Master and His Emissary.” Considering the left is the language center, this shouldn’t be surprising.
Transformers are amazing pattern matchers and terrible use of GPUs for reasoning, which is mostly search + execution of highly non-linear programs (lambda calculus).
I love seeing Victor Taelin experimenting with parallizing these programs (with HVM and other experiments with proof languages), but it's sometimes a bit sad how much time researchers take in making papers about existing things instead of trying to improve the state-of-the art in something that's most probably missing from the current models.
>> Reasoning encompasses two typical types: deductive reasoning and inductive reasoning.
I don't know about "typical" but every source that classifies reasoning (or, more appropriately, logical inference) as deductive and inductive, also includes the abductive category. This categorisation scheme goes all the way back to Charles Sanders Peirce:
'[Abduction] is logical inference( ... ) having a perfectly definite logical form. ( ... ) Namely, the
hypothesis cannot be admitted, even as a hypothesis, unless it be supposed that it would account for the facts or some of them. The form of inference, therefore, is this:
The surprising fact, C, is observed;
But if A were true, C would be a matter of course,
Hence, there is reason to suspect that A is true.' (Collected Papers of Charles Sanders Peirce. Peirce, 1958)
(Quote copied from Abduction and Induction, Essays in their Relation and Integration, Peter Flach and Antonis Kakas eds. 200)
Consider a logical theory, formed of rules in the form of implications like A -> B (premise A implies conclusion B). Abduction is the inference of the premises after observation of the conclusions, i.e. if A -> B AND B is observed, then A may be inferred.
That's a different inference mode than both deduction: inferring a conclusion from a premise, e.g. if A -> B AND A, then B may be inferred; and induction: inferring a rule from an observation, e.g. inferring A -> B after observing A and B. Note that this is a simplification: induction assumes a background theory of more rules A1 -> A2, .... An -> A that can be applied to the observation A and B to infer A -> B.
Anyway, abduction is generally associated with probabilistic reasoning, albeit informally so. That probably means that we should categorise LLM inference as abductive, since it guesses the next token according to a model of probabilities of token sequences. But that's just a, er, guess.
You cannot test reasoning when you don't know what's in the training set. You have to be able to differentiate reasoning from memorization, and that's not trivial.
Moreso, the results look to confirm that at least some memorization is going on. Do we really not think GPT has extensively been trained on arithmetic in base 10, 8, and 16? This seems like a terrible prior. Even if not explicitly, how much code has it read that performs these tasks. How many web pages, tutorials, Reddit posts cover oct and hex? They also haven't defined zero shot correctly. Arithmetic in these bases aren't 0-shot. They're explicitly in distribution...
I'm unsure about base 9 and 11. It's pretty interesting to see that GPT 4 is much better at these. Anyone know why? Did they train on these? More bases? Doesn't seem unreasonable but I don't know.
The experimentation is also extremely lacking. The arithmetic questions only have 1000 tests where they add two digits. This is certainly in the training data. I'm also unconvinced by the syntax reasoning tasks since the transformer (attention) architecture seems to be designed for this. I'm also unconvinced these tasks aren't in training. Caesar ciphers are also certainly in the training data.
The prompts are also odd and I guess that's why they're in the appendix. For example, getting GPT to be better at math or many tasks by having it write python code is not novel.
There's some stuff here but this really doesn't seem like a lot of work for 12 people from a top university and a trillion dollar company. It's odd to see that many people when the experiments can be run in a fairly short time.
This looks like an LLM doing the usual LLM thing - finding relevant items and combining them to fit. This doesn't require the level of abstraction and induction the authors impute to the LLM. Ordinary LLM behavior explains this, once you've found the relevant training data.
People often solve problems that way too, of course.
The intent was to point out that the educational approach was flawed, but I think there are interesting parallels to token processing in LLMs, which--unlike a human child--are built in such a way that crazy partial-fit rules are likely their only option.
> Benny believed that the fraction 5/10 = 1.5 and 400/400 = 8.00 because he believed the rule was to add the numerator and denominator and then divide by the number represented by the highest place value.
https://blog.mathed.net/2011/07/rysk-erlwangers-bennys-conce...
[1] Physics of LLMs: https://www.youtube.com/watch?v=yBL7J0kgldU&t=7s
More than that -- It's not clear that what humans do is not "just" a memorization. We can always look at human experience mechanisticly and say that we don't think -- we just memorized thinking patterns and apply them when speaking and "thinking"
If you doubt me I very much encourage you to dive deeper into the history of science and get doing deep deep knowledge on any subject. Because you'll find this happen all the time. But if you apply a loose enough definition to memorization (that isn't one that would be generally agreed upon if you used it's logical conclusions) then yeah, everything is memorization. But everything is foo if I define everything to be foo, so let's not.
Who are those "some" who say it is reasoning?
Here's a question. If you enter a command in your terminal do you expect your shell to have memorised the result of the command from some previous experience, or do you expect your shell to compute the result of your command according to its programming and your inputs? A rhetorical question: we all assume the latter.
Which one is more like what most peoples' informal conception of "reasoning"? Retrieving an answer from storage, or computing an answer from the inputs given to a program?
>> We can always look at human experience mechanisticly and say that we don't think -- we just memorized thinking patterns and apply them when speaking and "thinking"
I think this is confusing memorisation of the rules required to perform a computation, like a program stored in computer memory, with memorisation of the results of a computation. When we distinguish between memorisation and reasoning we usually make a distinction between computing a result from scratch and retrieving it from storage without having to re-compute it, like in caching or memoization, or getting data from a database.
For a real world example, we memorise our time tables but we don't memorise the result of every sum x + y = z, instead we memorise a summation algorithm that we then use to derive the sum of two numbers.
Memorizing facts and algorithms is memorization. The rest of what you are talking about is not.
Applying existing knowledge on new data without deriving new information is generalization. An example of this is the case of a semantic segmentation model classifying a car that it has never seen. If the model was not trained on birds, it will never classify a bird as a bird.
Computation of decidable problems is a large, possibly the largest subset of reasoning. Most humans do not struggle with solving decidable problems, the problem is that they are slow and can only solve small problem sizes, but most problems encountered in practice aren't one large decidable problem, but a long chain of many small, dozens to hundreds of heterogeneous problems that are seamlessly mixed with one another. LLMs struggle with decidable problems that are out of distribution, but you can give a human instructions on how to do something they have never done before and they will follow them with no problem.
> More than that -- It's not clear that what humans do is not "just" a memorization.
I hope it is clear that I did not memorize this message I am writing here and that it is the unique result of processes inside my brain that were not captured in the training process of an LLM.
>We can always look at human experience mechanisticly and say that we don't think -- we just memorized thinking patterns and apply them when speaking and "thinking"
Again you are trying to twist this in an absurd direction. Let's come up with a teleoperated humanoid robot on Mars that is controlled by a human on Earth. The robot acts exactly like a human does. Does this mean the robot is now capable of reasoning and thinking like a human, simply because it is replaying a recording of the human's body and speech? This is the argument you are making. You're arguing that the robot's ability to replay a human's actions is equivalent to the processes that brought about that human action.
Discrepancies in mathematical ability between the various bases would seem to suggest memorization as opposed to generalization.
Yeah, it's confirmation bias. People do that sort of thing all the time in machine learning research, specially in the recent surge of LLM-poking papers. If they don't do that, they don't have a paper, so we're going to see much more of it before the trend exhausts itself.
Abductive reasoning is common in day-to-day life. It seeks the best explanation for some (often incomplete) observations, and reaches conclusions without certainty. I would have thought it would be important to assess for LLMs.
Is abductive inference synonymous with bayesian inference?
https://plato.stanford.edu/entries/abduction/#AbdVerBayConTh...
For example suppose you go outside and the streets are wet. Perhaps it rained, or perhaps someone drove a fire truck around spraying water all over the streets. You might select the former because of its higher prior probability.
They are statistical text generators, whose results are defined by their training data set. This is why the paper cited reads thusly:
There is no differentiation because what was sought is the existence of what does not.The authors then postulate:
There is no such thing as "LLM reasoning." Therefore, the greatest challenge is accepting this fact and/or that anthropomorphism is a real thing.Beyond that, the experiments are almost certainly in the training data. Like come on! I feel like I'm going crazy here. How can any coder not think LLMs are training on oct and hex‽
https://news.ycombinator.com/item?id=41422751
The question is would the results be largely valid if this was done on a human who had learnt how to perform base 8, 9, 11 etc arithmetic instead ?
I mean, they're clearly not trying to test the ability to derive base arithmetic from scratch.
You can write programs into transformer models and run them deterministically; they aren't "statistical".
It's also still statistical if it gives you the same answer 99999/100000 times. You can definitely tune these things to be much more certain about certain problems but it tends to decrease capabilities in others.
I also don't like the term "black box." While they aren't transparent they aren't completely opaque either. A big reason I didn't like the term is that I feel it also encourages people to not research this more. While we don't completely know what's going on I see no reason we can't. Is there some proof they their calculations are an unbreakable encryption or is it just a hard problem. I'm pretty sure it's the latter, and I think it's good to differentiate "we're dumb" from "indeterminate"
The next-word predictions output by the model itself are based on the patterns and statistics of the training set. The next word that is actually generated is typically a random selection per a top-k or top-p sampling scheme.
My belief is that what LLMs do is best described as approximating the outputs of reasoning. This is different from reasoning itself and also different from simply regurgitating the most similar training examples.
I also think it's extremely challenging for people to understand the significance of LLM output because language is intertwined with meaning. So if you construct a valid and seemingly responsive sequence of words in response to a prompt, there is a strong illusion of purpose behind those words, even if they were assembled in a completely naive fashion.
So the fallacy of the whole LLM field is the belief that language has some intrinsic meaning. Or if you mix the artifacts of language in some very smart way, the meaning will emerge. But it doesn't work if meaning occurs before the word. The text in books has no reasoning, it was authors. The machine shuffling the text fragments does not have a meaningful thought. The engineer which devised a shuffling machine had some meaningful thought, the users of the machine have same thoughts, but not the machine itself. To put it another way, if there was an artificial system capable of producing meaningful thoughts, it is not a presence of language which produces a proof, it's communication. Communication requires an agent (as in "agency") and an intent. We have neither in LLM. As to the argument that we ourselves are mere stochastic parrots - of course we can produce word salads, or fake mimics of coherent text, it is not a proof that LLM IS the way our minds work. It is just a witness to the fact language is a flexible medium for the meanings behind - it can just as well be used for cheating, pretending, etc.
The only reason it seems like it is reasoning is because it’s probably stuffing a lot of reasoning in its context, and regurgitating that out in ways that are statically weighted with other things in the context on what is being reasoned about.
Frankly, even most commenters on HN don’t get how LLMs operate, thinking the model itself is what knows about different bases like hex and oct, when really, it searched up a bunch of material on different bases to include in the context before the model was ever activated.
I'm tired of reading comments from people who keep repeating that LLMs don't think, don't reason, isn't intelligence because it is not human. If your definition of the above is because it's not human, it's quite useless as a comment. We know LLMs aren't biological human brains. So what?
Define what reasoning is to you. Then tell us why LLMs don't reason and why it matters.
1. Reasoning is ability to at least carry out proofs in FOL (first-order logic). FOL can simulate Turing Mach
2. LLMs are formally equivalent to only a subset of FOL.
Why is this important? To model human mathematics, you need at least first-order logic.
These arguments have been around for decades, e.g., Penrose. I am tired of people bringing up strawmen arguments ("Not intelligent because not human!")
They can, at times, and do so best when emotion is not involved.
> I'm tired of reading comments from people who keep repeating that LLMs don't think, don't reason, isn't intelligence because it is not human.
LLM's represent a category of algorithm. Quite elegant and useful in some circumstances, but an algorithm none the less. A quick search produced this[0] example discussing same.
Another reference, which may not be authoritative based on whatever most recent edit the link produces, is[1]:
> Define what reasoning is to you.Reasoning was the process I went through to formulate this response, doing so with intent to convey meaning as best as I can, and understand as best as possible the message to which I am replying.
> Then tell us why LLMs don't reason and why it matters.
LLM's do not possess the ability to perform the process detailed above.
This is why it matters.
0 - https://github.com/rasbt/LLMs-from-scratch
1 - https://en.wikipedia.org/wiki/Large_language_model
What irritates me when I see comments like yours is that precise knowledge of weaknesses of LLMs is necessary to either improve LLMs, so most of the people who claim LLMs reason or are already AGI basically deny the ability to improve them, since they are already perfect. Research into studying the limitations of the current generation of AI is unwanted and by extension so is the next generation of AI.
I noticed that if I ask it to tell me how good cryptocurrencies are, it'll do it, and then if I say I disagree and they're wrong, it'll simply switch and agree with me as well. The thing has no care for truth, no opinion of its own, no ability to insist, and just feeds you whatever is statistically close to your own questions.
Lagniappe: https://www.youtube.com/watch?v=9B2ww3fiX30
But any mention of LLM reasoning ability ought to address the obvious confound: the LLM is trained on examples of deductive reasoning, inductive reasoning, abductive reasoning, SAT-solver reasoning, geniuses' musings, etc. If they replicate one of those examples, then should that be called "reasoning" of any sort or not? Regurgitating those examples may even involve some generalization, if the original topics of an example are swapped out (perhaps by a nearby topic in latent space).
Given that it appears they're training and testing on synthetic problems, this objection probably does not apply to their actual results. But given the fuzziness it creates for the definition of "reasoning" of any sort, I would have expected some working definition of reasoning in the paper's abstract.
Training on Moby Dick and thus being able to regurgitate text from Moby Dick does not mean the LLM is capable of writing a new Moby Dick-like book. (Thankfully; one is more than enough!)
Because they are not.
LLM's are a better form of Bayesian inference[0] in that the generated tokens are statistically a better fit within a larger context.
0 - https://en.wikipedia.org/wiki/Bayesian_inference
I love seeing Victor Taelin experimenting with parallizing these programs (with HVM and other experiments with proof languages), but it's sometimes a bit sad how much time researchers take in making papers about existing things instead of trying to improve the state-of-the art in something that's most probably missing from the current models.
I don't know about "typical" but every source that classifies reasoning (or, more appropriately, logical inference) as deductive and inductive, also includes the abductive category. This categorisation scheme goes all the way back to Charles Sanders Peirce:
'[Abduction] is logical inference( ... ) having a perfectly definite logical form. ( ... ) Namely, the hypothesis cannot be admitted, even as a hypothesis, unless it be supposed that it would account for the facts or some of them. The form of inference, therefore, is this:
The surprising fact, C, is observed;
But if A were true, C would be a matter of course,
Hence, there is reason to suspect that A is true.' (Collected Papers of Charles Sanders Peirce. Peirce, 1958)
(Quote copied from Abduction and Induction, Essays in their Relation and Integration, Peter Flach and Antonis Kakas eds. 200)
Consider a logical theory, formed of rules in the form of implications like A -> B (premise A implies conclusion B). Abduction is the inference of the premises after observation of the conclusions, i.e. if A -> B AND B is observed, then A may be inferred.
That's a different inference mode than both deduction: inferring a conclusion from a premise, e.g. if A -> B AND A, then B may be inferred; and induction: inferring a rule from an observation, e.g. inferring A -> B after observing A and B. Note that this is a simplification: induction assumes a background theory of more rules A1 -> A2, .... An -> A that can be applied to the observation A and B to infer A -> B.
Anyway, abduction is generally associated with probabilistic reasoning, albeit informally so. That probably means that we should categorise LLM inference as abductive, since it guesses the next token according to a model of probabilities of token sequences. But that's just a, er, guess.