Readit News logoReadit News
meroes · a year ago
My experience interacting with chain-of-thought is that it should not be likened to the rigid chains of logic/math. Step-by-step reasoning by models isn’t magically imparting that much rigidity to their outputs. The strength of the chain is the strength of related contexts, which is to say much less than math/logic done by humans. We tell ourselves we are teaching AI to do step-by-step reasoning, but admittedly as someone who deals with models daily in this area and not programming them, I don’t see the tight necessary connections we teach in basic math because I see how much the model(s) fail in ways no human past a certain age could. It’s more of a search for related contexts, which is powerful, but again not how a human reasons logically. Humans can reason purely form the armchair, starting with very few concepts, and reach far far, ironclad conclusions. Models aren’t doing that. They are leapfrogging through context. Yes an argument can be that’s splitting hairs, but that’s because it’s hard to describe succinctly, not hard to see.
PheonixPharts · a year ago
Given that LLMs are basically doing Sequential Monte-carlo sampling in latent space, the "thought" part of chain-of-thought certainly seems more akin to the necessary warm up period whenever you do any kind of SMC sampling.

Anyone whose done serious Bayesian stats work knows that the sampler needs to warm up for a bit to get start efficiently sampling. I suspect something similar is happening with chain-of-thought: the model needs to wander around a bit before it gets into the correct neighborhood for sampling the answer.

leereeves · a year ago
That's quite an interesting comparison. I like the description of both as Sequential Monte-carlo sampling from a desired distribution. But I think there are two crucial differences.

First, in Bayesian sampling, the initial value and first samples are not sampled from the desired distribution. In a well trained LLM, the prompt is given and the first response is sampled from the desired distribution (of text that is likely to follow the prompt).

Second, in Bayesian sampling, the fact that the samples aren't independent is an unwelcome but unsolvable problem. We want independent samples but can't generate them, so we settle for conditionally independent samples.

In an LLM, we want each sample to be dependent on the preceding text, in particular the prompt.

In summary:

Bayesian sampling - poorly chosen "prompt" (the initial sample), future samples would ideally be independent of the prompt and each other.

LLM sampling - carefully chosen prompt, future samples are ideally dependent on the prompt and on each other.

And in conclusion:

The warm up period helps a Bayesian sampler find values that are less dependent on the initial "prompt", which we definitely don't want in an LLM.

Dead Comment

exe34 · a year ago
I think a lot of what humans think of as "1. 2. Therefore 3." kind of reasoning isn't different from what the llm is doing, and not in fact any more clever than that. Plenty of people believe plenty of questionable things that they assume they have thought through but really haven't. They used the context to guess the next idea/word, often reaching the conclusions they started out with.

When you talk about ironclad conclusions, I think what happens is that we come up with those confabulations intuitively, but then we subject them to intense checking - have we defined everything clearly enough, is that leap in reasoning justified, etc.

So what I'd really like to see is a way to teach llms to take a vague English sentence and transform it into a form that can be run through a more formal reasoning engine.

Often instead of asking an llm to tell you something like how many football fields could you fit inside England, you are better off telling it to write python code to do this, assume get_size_football_field() in m^2 and get_size_England() in m^2 is available.

doctoboggan · a year ago
> I think a lot of what humans think of as "1. 2. Therefore 3." kind of reasoning isn't different from what the llm is doing, and not in fact any more clever than that. Plenty of people believe plenty of questionable things that they assume they have thought through but really haven't. They used the context to guess the next idea/word, often reaching the conclusions they started out with.

Agreed that many/most humans behave this way, but some do not. And those who do not are the ones advancing the boundaries of knowledge and it would be very nice if we could get our LLMs to behave in the same way.

bigcat12345678 · a year ago
》Humans can reason purely form the armchair, starting with very few concepts, and reach far far, ironclad conclusions.

I myself have no such ability. I cannot reason beyond roughly 10 lines of golang code. That's evident in my numerous hobby puzzle solving sessions.

throwaway35777 · a year ago
> Humans can reason purely form the armchair, starting with very few concepts, and reach far far, ironclad conclusions. Models aren’t doing that.

Sure, but the structure of human reasoning is almost identical to chains of thought. We have an auditory loop and, faced with a complex problem we repeat the mantra "now that I know XYZ, then what..." until the a good next step pops into our head and we add that to the context.

The transition function just is (currently) much better in humans.

Edit: people who disagree with this, why?

andoando · a year ago
Chain of thought in itself is pretty simple. We had logical provers in the 50s. The difficulty imo is how "thought" is modeled.

Pure logic is too rigorous, and pure statistics is too inconsistent.

RaftPeople · a year ago
> We have an auditory loop and, faced with a complex problem we repeat the mantra "now that I know XYZ, then what..." until the a good next step pops into our head and we add that to the context.

You probably should replace "auditory" with "auditory or visual or conceptual or ??? - depending on the specific human"

I don't use any kind of verbal tools (either silent or out loud) in that process, I think different people use different tools for that process.

stavros · a year ago
I think that chain-of-thought for LLMs is just helping them enhance their "memory", as it puts their reasoning into the context and helps them refer to it more readily. That's just a guess, though.
snorkel · a year ago
That’s pretty much correct. An LLM is often used rather like a forecast model that can forecast the next word in a sequence of words. When it’s generating output it’s just continuously forecasting (predicting) the next word of output. Your prompt is just providing the model with input data to start forecasting from. The prior output itself also becomes part of the context to forecast from. The output of “think about it step-by-step” becomes part of its own context to continue forecasting from, hence guides its output. I know that “forecasting” is technically not the right term, but I’ve found it helpful to understand what it is LLM‘s are actually doing when generating output.
stygiansonic · a year ago
A simplified explanation, which I think I heard from Karpathy, is that transformer models only do computation when they generate (decode) a token. So generating more tokens (using CoT) gives the model more time to “think”.

Obviously this doesn’t capture all the nuance.

bravura · a year ago
I have another explanation. LLMs are essentially trained on "A B", i.e. is it plausible that B follows A.

There's simply a much larger space of possibilities for shorter completions, A B1, A B2, etc. that are plausible. Like if I ask you to give a short reply to a nuanced question, you could reply with a thoughtful answer, a plausible superficially correct sounding answer, convincing BS, etc.

Whereas if you force someone to explain their reasoning, the space of plausible completions reduces. If you start with convincing BS and work through it honestly, you will conclude that you should reverse. (This is similar to how one of the best ways to debunk toxic beliefs with honest people is simply through openly asking them to play out the consequences and walking through the impact of stuff that sounds good without much thought.)

This is similar to the reason that loading your prompt with things that reduce the space of plausible completions is effective prompt engineering.

naasking · a year ago
> This is similar to how one of the best ways to debunk toxic beliefs with honest people is simply through openly asking them to play out the consequences and walking through the impact of stuff that sounds good without much thought.

Actually, one of the best ways is pretending to be more extreme than them. Agree with them on everything, which is disarming, but then take it a step or two even further. Then they're like, "now hang on, what about X and Y" trying to convince you to be more reasonable, and pretty soon they start seeing the holes and backtrack to a more reasonable position.

https://www.pnas.org/doi/abs/10.1073/pnas.1407055111

valine · a year ago
I think you're right. I would go a step further and say that all learning is roughly synonymous with reducing the output space, and that humans do the exact same thing. There are more ways to get the wrong answer to a math problem than there are to get the right answer. When you learn someone's name, you're narrowing your output to be a single name rather than all plausible names.

The output of a generative model is practically infinite. I suspect it's possible to continually narrow the space of completions and never converge on a single output. If this turns out to be true, it would bode well for the scalability of few-shot learning.

hackerlight · a year ago
It helps, but it still gets stuck in local optima based on what it started with. I've never seen it turn around and correct its faulty reasoning unless it tried to actually run the code and observed an Exception. If I respond with "but have you considered XYZ?", my leading question will usually cause it to correct itself, even when it wasn't incorrect.

We need some way to generate multiple independent thoughts in parallel. Each separate thought is constructed using chain of thought to improve the reliability. Then you have some way to "reduce" these multiple thoughts into a single solution. The analogy would be a human brainstorming session where we try to attack the same problem from multiple angles and we try to decorrelate each idea/approach.

jorl17 · a year ago
I was going to write pretty much this exact same comment. I am an amateur in how LLMs work, definitely, but I always thought this was the plausible explanation.

If I want the "assistant "LLM to tell me "How much 5 times 2 is", if I feed it the line "5 * 2 = " as if it's already started giving that answer, it will very likely write 5*2 = 10.

Since LLMs operate on semantic relationships between tokens, the more a bunch of tokens are "close" to a given "semantic topic", the more the LLM will keep outputting tokens in that topic. It's the reason why if you ask an LLM to "review and grade poetry", eventually it starts saying the same thing even about rather different poems -- the output is so filled with the same words, that it just keeps repeating them.

Another example:

If I ask the LLM to solve me a riddle, just by itself, the LLM may get it wrong. If, however, I start the answer, unravelling a tiny bit of the problem it will very likely give the right answer, as if it's been "guided" onto the right "problem space".

By getting LLMs to "say" how they are going to solve things and checking for errors, each words basically tugs onto the next one, honing in on the correct solution.

In other words:

If an LLM has to answer a question -- any question --, but right after we ask the question we "populate" its answer with some text, what text is more likely to make the LLM answer incorrectly?

- Gibberish nonsense

- Something logical and related to the problem?

Evidently, the more gibberish we give to it, the more likely it is to get it wrong, since we're moving away from the "island of relevant semantic meaning", so to speak. So if we just get the LLM to feed itself more relevant tokens, it automatically guides itself to a better answer. It's kind of like there's an "objective, ideal" sequence of tokens, and it can work as an attractor. The more the LLM outputs words, the more it gets attracted to that sequence...that...."island of relevant semantic meaning".

But, again, I know nothing of this. This is just how I view it, conceptually. It's probably very wrong.

euroderf · a year ago
> This is similar to the reason that loading your prompt with things that reduce the space of plausible completions is effective prompt engineering.

And this is why taking your time to write a detailed software help request delivers a good chance that you will solve your problem all by your lonesome.

earslap · a year ago
The autoregressive transformer architecture has a constant cost per token, no matter how hard the task is. You can ask the most complicated reasoning question, and it takes the same amount of computation to generate the next token compared to the simplest yes / no question. This is due to architectural constraints. Letting the LLM generate "scratch" data to compute (attend to relevant information) is a way of circumventing the constant cost limitation. The harder the task, the more "scratch" you need so more relevant context is available for future tokens.
visarga · a year ago
That's flatly wrong. Each successive token costs progressively more. The deeper a token is in the sequence, the more past states it has to attend to. As a proof, just remember how slow it gets when the context is large, and how snappy when you first start a chat.
WithinReason · a year ago
That's what I thought at first, but that actually doesn't make sense, the amount of work done on a string is the same even if the string is followed by padding due to the mask used in attention. Then I realised that an LLM's working memory is limited to its activations, which can be limiting. But it can extend its working memory by writing partial results to the output and reading it in. E.g. if you tell it to "think of a number" without telling you what it is it can't do that, there is nowhere to store that number, it has no temporary storage other than the tape. But if you ask it to "think step by step" you let it store intermediate results (thoughts) on the tape, giving it extra storage it can use for thinking.
XenophileJKO · a year ago
So my experience creating products on GPT3.5-Turbo is that there is an upper limit to how much instructional complexity the model can handle at a time. It isn't really about "adding computation", though you are doing this. The key is to construct the process so that the model only has to focus on a limited scope to make the decision on.

In effect you are kind of creating a tree structure of decisions that build off of each other. By generating intermediate tokens the model can now only pay attention to the smaller set of already collapsed decisions. It is a little more complicated than that as the model will create anticipatory behavior where intermediate steps get biased by an incorrect result that the model anticipates.

XenophileJKO · a year ago
Also I should say it isn't just instructional complexity, it is ambiguity which creates the upper limit on capability.
_boffin_ · a year ago
One of the things I’ve been doing with the models I’ve been using with coding is adding the stack and primary dependencies in the system prompt and then asking or conversing. It has helped out a lot, or at least feels like it has.
Zondartul · a year ago
The tokens are also necessary to store information, or at least off-load it from neuron activations.

E.g. if you asked an LLM "think about X and then do Y", if the "think X" part is silent, the LLM has a high chance of:

a) just not doing that, or

b) thinking about it but then forgetting, because the capacity of 'RAM' or neuron activations is unknown but probably less than a few tokens.

Actually, has anyone tried to measure how much non-context data (i.e. new data generated from context data) a LLM can keep "in memory" without writing it down?

pgorczak · a year ago
I don’t think commonly used LLM architectures have internal state that carries over between inference steps, so shouldn’t that be none? Unless you mean the previously generated tokens up to the context limit which is well defined.

Deleted Comment

ukuina · a year ago
This is true. You can get a similar effect by asking the model to plan its path first without writing any code, then asking it to review its plan for deficiencies, and finally asking it to enact the plan and write the code.
nextaccountic · a year ago
This begs the question: why is it that giving them more time to "think" yields better answers, and is there any limit to that? If I make them write hundreds of pages of explanation, there must be a diminishing returns of some kind. What influences the optimal amount of thinking?

My guess is that good answers are more well reasoned than answers that are short and to the point, and this is picked up in training or fine-tuning or some other step.

And probably the optimal amount of thinking has something to do with the training set or the size of the network (wild guesses).

lappa · a year ago
Look at it from an algorithmic perspective. In computer science many algorithms take a non-constant number of steps to execute. However, in transformers models, there are a limited number of decoder blocks, and a limited number of FFN layers in each block. This presents a theoretical upper bound on the complexity of the algorithms a decoder network can solve in a single token generation pass.

This explains why GPT4 cannot accurately perform large number multiplication and decimal exponentiation. [0]

This example can extend to general natural language generation. While some answers can be immediately retrieved or generated by a "cache" / algorithm which exists in latent space, some tokens have better quality when their latent-space algorithm is executed in multiple steps.

[0] https://www.semanticscholar.org/reader/817e52b815560f95171d8...

wnmurphy · a year ago
I think it's fairly simple: you're creating space for intermediary tokens to be generated, where those intermediary tokens represent "thoughts" or a simulated internal dialog.

Without that, it's analogous to asking someone a question and they immediately start responding from some information they'd heard before, rather than taking some time to have an inner dialog with themself.

tmalsburg2 · a year ago
Do LLM not also think when they encode the prompt? If Karpathy's explanation is accurate, longer prompts should also help even if they don't contain additional information, just by virtue of giving more time to think.
Me1000 · a year ago
The time processing the longer prompt isn't being spent churning (i.e. "thinking") on the problem at hand, it's spend calculating attention matrices between all the tokens. The time spent on this is a function of the number of flops you have available.

So no, if you just fill up your context window to garbage, the LLM will not perform better at your task/question.

rdedev · a year ago
Do you think there is a fundamental difference between masked language modelling vs causal language modelling? I feel like most LLMs are decoder only models just cause they are easier to train because their attention mask is fixed
sadhorse · a year ago
Does every token requires a full model computation?
onedognight · a year ago
No, you can cache some of the work you did when processing the previous tokens. This is one of the key optimization ideas designed into the architecture.
tromp · a year ago
> These are the central questions in the formal study of computation. The field dates back to 1936, when Alan Turing first imagined a fanciful device, now called a Turing machine, that could perform any computation by reading and writing symbols on an infinite tape.

It dates further back to the 1920s when Moses Schönfinkel came up with Combinatory Logic [1], and the early 1930s when Alonzo Church came up with the lambda calculus [2]. These models however make a less suitable base for computational complexity theory.

[1] https://en.wikipedia.org/wiki/Moses_Sch%C3%B6nfinkel

[2] https://encyclopediaofmath.org/wiki/Lambda-calculus

YeGoblynQueenne · a year ago
Arguably it goes further back to Pearce and Frege, Boole, Pascal, Leibniz and all the way to Aristotle, who was probably the first to seek a way to formalise structured thinking. Turing meant his computational apparatus as a formalisation of the way a human mathematician solves a problem by computation, i.e. by the manipulation of symbols according to a set of formal rules. In that, he followed in a long line of others who had thought about the same experience and how eminently mechanisable it is. Pascal was the first to actually do it, for arithmetic.
benreesman · a year ago
Parent has probably seen this (or everything in it), but for others who are interested in this stuff (including Schönfinkel’s work) I recommend https://youtu.be/h0OkptwfX4g.
polygamous_bat · a year ago
I think the two modes of LLM discourse: “they’re conscious!/they’re just next token predictors with impressive datasets” comes largely from two different groups of people: those who learned about LLMs before learning about ML fundamentals, and those who learned ML fundamentals before encountering LLMs of today. While I fall in the second group, there is a real risk that my prior concepts about the fundamentals is limiting my view of the bigger picture, so I at least welcome the debate.

Re: chain of thought, I at least know that in practice a lot of the results from the original paper has not been quite reproducible at later attempts. Whether that is a quirk of models changing everyday or something deeper, I do not know.

YeGoblynQueenne · a year ago
Instinctively I'd trust the people with the knowledge that goes farther back in time. On the other hand, I once whinged to my thesis advisor that a lot of people in machine learning don't seem to know much about older machine learning and AI work and he, with 30+ years of research on me, pointed out that people complained about that already when he was a PhD student.

There is so much work on AI that goes back about 80 years (counting from Pitts and McCulloch, because why not; or you could count from Turing) and it's very hard to both keep up with what everyone else is doing, and go deep in your own subject. e.g. you pick up a Reinforcement Learning book and it's basically attacking the same problems as in planning, and with very similar assumptions (states and action spaces) but it's like planning doesn't even exist.

Btw: they're next token predictors :P

wayeq · a year ago
> Btw: they're next token predictors :P

Why not (possibly) both? After all, we can't even define consciousness, other than "what's it like to be a bat" makes more sense to us intuitively than "what's it like to be a rock".

Consciousness may just come along with the ride with a certain amount of information processing, we just have no clue.

moffkalast · a year ago
At this point I think I'm leaning towards "organic brains are just next token predictors with impressive secondary heuristic systems".

The fact that we can get such impressive results from transformers which are such a poor approximation and completely stateless makes me think there really isn't any special sauce to it.

stolsvik · a year ago
I thought this was obvious: They lack an “inner voice” (80% of humans?) or “inner imagery” (the rest) as we humans do, so they cannot first think the problem through before answering. Thus, using the actual “output area” as such a scratch pad can help it cover a larger area of reasoning before outputting an answer - just as we do.

I feel you can even see this when you ask it certain questions with “think in steps” prompting: It can output temporary thoughts which aren’t of use in the final answer - again just as we do when attacking a problem we can’t immediately answer.

Also, we humans often use pen and paper to jot down temporary and intermediary thoughts and answers. Again, LLMs don’t have that, but can use the output as something similar.

Some styles of ToT prompting actually make the LLM have two types of output - one for its “inner voice thinking”, and then another for output meant for the human. The same goes when one give the LLM method calling abilities, or “googling”: This can be seen as a way to perform thinking and reasoning without output meant for the user, before formulating an answer.

phailhaus · a year ago
Models can't think. It uses the input context to predict an output. So if you have a problem that needs to be solved iteratively, those intermediate steps need to be persisted to the context, because there is nowhere for them to go otherwise.
stavros · a year ago
> Models can't think. It uses the input context to predict an output.

The first claim doesn't follow from the second. What is it about using the input to predict an output that makes you believe they can't think? What if that's all thinking is? We don't know.

andoando · a year ago
I think a fundamental difference is we are able to learn new things by reasoning through our existing knowledge. Moreover our beliefs are mostly consistent with each other, and we can be argued with and have our beliefs changed.

As far as I understand, GPT isn't going to alter its whole worldview if you show it that its thinking is flawed.

But perhaps this is possible if upon discovering a flaw, it looped through its corpus and altered its connections?

Deleted Comment

phailhaus · a year ago
It's not the the second statement follows from the first; I'm asserting that models can't think, and that what they're really doing is prediction based on context.

The fact that chain-of-thought reasoning yields significantly better results is your hint: that means that the model doesn't think like a human does when it comes up with responses. If it's not in the context, it doesn't exist. You can't ask a model "why did you answer that way" without it generating from whole cloth a plausible retroactive reason. But there is no memory, so it can't really tell you.

> What if that's all thinking is?

I think that this is roughly true. But we actually have memory outside of what we say, so when we think, all those intermediate steps are persisted in our brains. For a model, the context is the memory. If you delete your question from the context and ask it "why did you answer that way", it will have no idea what you're talking about.

sfink · a year ago
One simple reason: consider the plausibility of

    11 + 31 = 24
It's actually fairly plausible. The answer is numeric. Two digits, even, which is pretty likely when adding together 2-digit inputs. 24 is also a common answer to math problems (it has lots of factors, for one). It even has all the digits from adding 1+3 and 1+1.

Now how plausible is

    Show your work. 11 + 31 = the result of adding the 10s digits together, so 10 + 30 = 40, and then adding in the 1s digits, so 1 + 1 = 2. Combining the 40 and the 2 gives 24.
That last sentence doesn't seem very likely. Or:

    Show your work. 11 + 31 = the result of adding the 10s digits together, so 10 + 30 = 20, and then adding in the 1s digits, so 1 + 1 = 4. Combining the 20 and the 4 gives 24.
If you're breaking things down, you have to traverse through some territory that is lower probability than the quick wrong answer.

The argument by computational complexity is stronger, though. I just wanted to point out that the above is a confounding explanation that is sufficient for simple cases, and so may need to be ruled out before claiming that computational complexity matters.

The complexity argument is also intuitively obvious. If you think of an LLM as a type of computer that does one constant-time forward pass over the input so far on each clock cycle (and outputs a single token), then of course you can compute more if you give your computer more cycles! You can use state (even if the mechanism for transmitting the state from one cycle to the next is sharply limited).

Similarly, it's an expansion of the old problem of a single-layer perceptron not being able to compute XOR. (Here, the "cycles" are advances from one layer to the next.)

That's not to say that the nuances are obvious. Simply saying you can use multiple clock ticks doesn't really say anything about how much you can do in one tick.

activatedgeek · a year ago
I want to point out a tweet [1] that is very relevant to the miracle of CoT, and probably a simpler explanation.

  > Let's think "step by step"!

  > Another tidbit I like about data and prompts that miraculously work.
  > Searching for this phrase resulted in this website (among others),  
  > http://geteasysolution.com, containing many math step-by-step solutions. 
  > How common are they? Quite.

  > Makes you think.
[1]: https://twitter.com/yanaiela/status/1765077404043952516

FeepingCreature · a year ago
Though that justifies the specific phrase, it doesn't really contradict the usual explanations of how CoT works. Like... the phrase directs it into the conceptual space of a website that has lots of CoT examples, but if CoT didn't help it think, that wouldn't actually result in better outputs.
activatedgeek · a year ago
I hesitate to the use description as "think," just biasing correlations for subsequent generations.

In any case, there is at least one work that shows that CoT may not be necessary and biasing the decoding path via logit probabilities is also promising. [1]

One could argue it still doesn't contradict the benefits of CoT, but I suspect there is nothing fundamental about CoT, except that we happened to have been pre-training on sequences that use certain prompts that were easy to conceive from a human's perspective.

[1]: https://arxiv.org/abs/2402.10200