How Chain-of-Thought Reasoning Helps Neural Networks Compute

My experience interacting with chain-of-thought is that it should not be likened to the rigid chains of logic/math. Step-by-step reasoning by models isn’t magically imparting that much rigidity to their outputs. The strength of the chain is the strength of related contexts, which is to say much less than math/logic done by humans. We tell ourselves we are teaching AI to do step-by-step reasoning, but admittedly as someone who deals with models daily in this area and not programming them, I don’t see the tight necessary connections we teach in basic math because I see how much the model(s) fail in ways no human past a certain age could. It’s more of a search for related contexts, which is powerful, but again not how a human reasons logically. Humans can reason purely form the armchair, starting with very few concepts, and reach far far, ironclad conclusions. Models aren’t doing that. They are leapfrogging through context. Yes an argument can be that’s splitting hairs, but that’s because it’s hard to describe succinctly, not hard to see.

PheonixPharts · a year ago

Given that LLMs are basically doing Sequential Monte-carlo sampling in latent space, the "thought" part of chain-of-thought certainly seems more akin to the necessary warm up period whenever you do any kind of SMC sampling.

Anyone whose done serious Bayesian stats work knows that the sampler needs to warm up for a bit to get start efficiently sampling. I suspect something similar is happening with chain-of-thought: the model needs to wander around a bit before it gets into the correct neighborhood for sampling the answer.

leereeves · a year ago

That's quite an interesting comparison. I like the description of both as Sequential Monte-carlo sampling from a desired distribution. But I think there are two crucial differences.

First, in Bayesian sampling, the initial value and first samples are not sampled from the desired distribution. In a well trained LLM, the prompt is given and the first response is sampled from the desired distribution (of text that is likely to follow the prompt).

Second, in Bayesian sampling, the fact that the samples aren't independent is an unwelcome but unsolvable problem. We want independent samples but can't generate them, so we settle for conditionally independent samples.

In an LLM, we want each sample to be dependent on the preceding text, in particular the prompt.

In summary:

Bayesian sampling - poorly chosen "prompt" (the initial sample), future samples would ideally be independent of the prompt and each other.

LLM sampling - carefully chosen prompt, future samples are ideally dependent on the prompt and on each other.

And in conclusion:

The warm up period helps a Bayesian sampler find values that are less dependent on the initial "prompt", which we definitely don't want in an LLM.

Dead Comment

exe34 · a year ago

I think a lot of what humans think of as "1. 2. Therefore 3." kind of reasoning isn't different from what the llm is doing, and not in fact any more clever than that. Plenty of people believe plenty of questionable things that they assume they have thought through but really haven't. They used the context to guess the next idea/word, often reaching the conclusions they started out with.

When you talk about ironclad conclusions, I think what happens is that we come up with those confabulations intuitively, but then we subject them to intense checking - have we defined everything clearly enough, is that leap in reasoning justified, etc.

So what I'd really like to see is a way to teach llms to take a vague English sentence and transform it into a form that can be run through a more formal reasoning engine.

Often instead of asking an llm to tell you something like how many football fields could you fit inside England, you are better off telling it to write python code to do this, assume get_size_football_field() in m^2 and get_size_England() in m^2 is available.

doctoboggan · a year ago

> I think a lot of what humans think of as "1. 2. Therefore 3." kind of reasoning isn't different from what the llm is doing, and not in fact any more clever than that. Plenty of people believe plenty of questionable things that they assume they have thought through but really haven't. They used the context to guess the next idea/word, often reaching the conclusions they started out with.

Agreed that many/most humans behave this way, but some do not. And those who do not are the ones advancing the boundaries of knowledge and it would be very nice if we could get our LLMs to behave in the same way.

bigcat12345678 · a year ago

》Humans can reason purely form the armchair, starting with very few concepts, and reach far far, ironclad conclusions.

I myself have no such ability. I cannot reason beyond roughly 10 lines of golang code. That's evident in my numerous hobby puzzle solving sessions.

throwaway35777 · a year ago

> Humans can reason purely form the armchair, starting with very few concepts, and reach far far, ironclad conclusions. Models aren’t doing that.

Sure, but the structure of human reasoning is almost identical to chains of thought. We have an auditory loop and, faced with a complex problem we repeat the mantra "now that I know XYZ, then what..." until the a good next step pops into our head and we add that to the context.

The transition function just is (currently) much better in humans.

Edit: people who disagree with this, why?

andoando · a year ago

Chain of thought in itself is pretty simple. We had logical provers in the 50s. The difficulty imo is how "thought" is modeled.

Pure logic is too rigorous, and pure statistics is too inconsistent.

RaftPeople · a year ago

> We have an auditory loop and, faced with a complex problem we repeat the mantra "now that I know XYZ, then what..." until the a good next step pops into our head and we add that to the context.

You probably should replace "auditory" with "auditory or visual or conceptual or ??? - depending on the specific human"

I don't use any kind of verbal tools (either silent or out loud) in that process, I think different people use different tools for that process.

stavros · a year ago

I think that chain-of-thought for LLMs is just helping them enhance their "memory", as it puts their reasoning into the context and helps them refer to it more readily. That's just a guess, though.

snorkel · a year ago

That’s pretty much correct. An LLM is often used rather like a forecast model that can forecast the next word in a sequence of words. When it’s generating output it’s just continuously forecasting (predicting) the next word of output. Your prompt is just providing the model with input data to start forecasting from. The prior output itself also becomes part of the context to forecast from. The output of “think about it step-by-step” becomes part of its own context to continue forecasting from, hence guides its output. I know that “forecasting” is technically not the right term, but I’ve found it helpful to understand what it is LLM‘s are actually doing when generating output.

A simplified explanation, which I think I heard from Karpathy, is that transformer models only do computation when they generate (decode) a token. So generating more tokens (using CoT) gives the model more time to “think”.

Obviously this doesn’t capture all the nuance.

bravura · a year ago

I have another explanation. LLMs are essentially trained on "A B", i.e. is it plausible that B follows A.

There's simply a much larger space of possibilities for shorter completions, A B1, A B2, etc. that are plausible. Like if I ask you to give a short reply to a nuanced question, you could reply with a thoughtful answer, a plausible superficially correct sounding answer, convincing BS, etc.

Whereas if you force someone to explain their reasoning, the space of plausible completions reduces. If you start with convincing BS and work through it honestly, you will conclude that you should reverse. (This is similar to how one of the best ways to debunk toxic beliefs with honest people is simply through openly asking them to play out the consequences and walking through the impact of stuff that sounds good without much thought.)

This is similar to the reason that loading your prompt with things that reduce the space of plausible completions is effective prompt engineering.

naasking · a year ago

> This is similar to how one of the best ways to debunk toxic beliefs with honest people is simply through openly asking them to play out the consequences and walking through the impact of stuff that sounds good without much thought.

Actually, one of the best ways is pretending to be more extreme than them. Agree with them on everything, which is disarming, but then take it a step or two even further. Then they're like, "now hang on, what about X and Y" trying to convince you to be more reasonable, and pretty soon they start seeing the holes and backtrack to a more reasonable position.

https://www.pnas.org/doi/abs/10.1073/pnas.1407055111

valine · a year ago

I think you're right. I would go a step further and say that all learning is roughly synonymous with reducing the output space, and that humans do the exact same thing. There are more ways to get the wrong answer to a math problem than there are to get the right answer. When you learn someone's name, you're narrowing your output to be a single name rather than all plausible names.

The output of a generative model is practically infinite. I suspect it's possible to continually narrow the space of completions and never converge on a single output. If this turns out to be true, it would bode well for the scalability of few-shot learning.

hackerlight · a year ago

It helps, but it still gets stuck in local optima based on what it started with. I've never seen it turn around and correct its faulty reasoning unless it tried to actually run the code and observed an Exception. If I respond with "but have you considered XYZ?", my leading question will usually cause it to correct itself, even when it wasn't incorrect.

We need some way to generate multiple independent thoughts in parallel. Each separate thought is constructed using chain of thought to improve the reliability. Then you have some way to "reduce" these multiple thoughts into a single solution. The analogy would be a human brainstorming session where we try to attack the same problem from multiple angles and we try to decorrelate each idea/approach.

jorl17 · a year ago

I was going to write pretty much this exact same comment. I am an amateur in how LLMs work, definitely, but I always thought this was the plausible explanation.

If I want the "assistant "LLM to tell me "How much 5 times 2 is", if I feed it the line "5 * 2 = " as if it's already started giving that answer, it will very likely write 5*2 = 10.

Since LLMs operate on semantic relationships between tokens, the more a bunch of tokens are "close" to a given "semantic topic", the more the LLM will keep outputting tokens in that topic. It's the reason why if you ask an LLM to "review and grade poetry", eventually it starts saying the same thing even about rather different poems -- the output is so filled with the same words, that it just keeps repeating them.

Another example:

If I ask the LLM to solve me a riddle, just by itself, the LLM may get it wrong. If, however, I start the answer, unravelling a tiny bit of the problem it will very likely give the right answer, as if it's been "guided" onto the right "problem space".

By getting LLMs to "say" how they are going to solve things and checking for errors, each words basically tugs onto the next one, honing in on the correct solution.

In other words:

If an LLM has to answer a question -- any question --, but right after we ask the question we "populate" its answer with some text, what text is more likely to make the LLM answer incorrectly?

- Gibberish nonsense

- Something logical and related to the problem?

Evidently, the more gibberish we give to it, the more likely it is to get it wrong, since we're moving away from the "island of relevant semantic meaning", so to speak. So if we just get the LLM to feed itself more relevant tokens, it automatically guides itself to a better answer. It's kind of like there's an "objective, ideal" sequence of tokens, and it can work as an attractor. The more the LLM outputs words, the more it gets attracted to that sequence...that...."island of relevant semantic meaning".

But, again, I know nothing of this. This is just how I view it, conceptually. It's probably very wrong.

euroderf · a year ago

> This is similar to the reason that loading your prompt with things that reduce the space of plausible completions is effective prompt engineering.

And this is why taking your time to write a detailed software help request delivers a good chance that you will solve your problem all by your lonesome.

earslap · a year ago

The autoregressive transformer architecture has a constant cost per token, no matter how hard the task is. You can ask the most complicated reasoning question, and it takes the same amount of computation to generate the next token compared to the simplest yes / no question. This is due to architectural constraints. Letting the LLM generate "scratch" data to compute (attend to relevant information) is a way of circumventing the constant cost limitation. The harder the task, the more "scratch" you need so more relevant context is available for future tokens.

visarga · a year ago

That's flatly wrong. Each successive token costs progressively more. The deeper a token is in the sequence, the more past states it has to attend to. As a proof, just remember how slow it gets when the context is large, and how snappy when you first start a chat.

WithinReason · a year ago

That's what I thought at first, but that actually doesn't make sense, the amount of work done on a string is the same even if the string is followed by padding due to the mask used in attention. Then I realised that an LLM's working memory is limited to its activations, which can be limiting. But it can extend its working memory by writing partial results to the output and reading it in. E.g. if you tell it to "think of a number" without telling you what it is it can't do that, there is nowhere to store that number, it has no temporary storage other than the tape. But if you ask it to "think step by step" you let it store intermediate results (thoughts) on the tape, giving it extra storage it can use for thinking.

XenophileJKO · a year ago

So my experience creating products on GPT3.5-Turbo is that there is an upper limit to how much instructional complexity the model can handle at a time. It isn't really about "adding computation", though you are doing this. The key is to construct the process so that the model only has to focus on a limited scope to make the decision on.

In effect you are kind of creating a tree structure of decisions that build off of each other. By generating intermediate tokens the model can now only pay attention to the smaller set of already collapsed decisions. It is a little more complicated than that as the model will create anticipatory behavior where intermediate steps get biased by an incorrect result that the model anticipates.

XenophileJKO · a year ago

Also I should say it isn't just instructional complexity, it is ambiguity which creates the upper limit on capability.

_boffin_ · a year ago

One of the things I’ve been doing with the models I’ve been using with coding is adding the stack and primary dependencies in the system prompt and then asking or conversing. It has helped out a lot, or at least feels like it has.

Zondartul · a year ago

The tokens are also necessary to store information, or at least off-load it from neuron activations.

E.g. if you asked an LLM "think about X and then do Y", if the "think X" part is silent, the LLM has a high chance of:

a) just not doing that, or

b) thinking about it but then forgetting, because the capacity of 'RAM' or neuron activations is unknown but probably less than a few tokens.

Actually, has anyone tried to measure how much non-context data (i.e. new data generated from context data) a LLM can keep "in memory" without writing it down?

pgorczak · a year ago

I don’t think commonly used LLM architectures have internal state that carries over between inference steps, so shouldn’t that be none? Unless you mean the previously generated tokens up to the context limit which is well defined.

Deleted Comment

ukuina · a year ago

This is true. You can get a similar effect by asking the model to plan its path first without writing any code, then asking it to review its plan for deficiencies, and finally asking it to enact the plan and write the code.

nextaccountic · a year ago

This begs the question: why is it that giving them more time to "think" yields better answers, and is there any limit to that? If I make them write hundreds of pages of explanation, there must be a diminishing returns of some kind. What influences the optimal amount of thinking?

My guess is that good answers are more well reasoned than answers that are short and to the point, and this is picked up in training or fine-tuning or some other step.

And probably the optimal amount of thinking has something to do with the training set or the size of the network (wild guesses).

lappa · a year ago

Look at it from an algorithmic perspective. In computer science many algorithms take a non-constant number of steps to execute. However, in transformers models, there are a limited number of decoder blocks, and a limited number of FFN layers in each block. This presents a theoretical upper bound on the complexity of the algorithms a decoder network can solve in a single token generation pass.

This explains why GPT4 cannot accurately perform large number multiplication and decimal exponentiation. [0]

This example can extend to general natural language generation. While some answers can be immediately retrieved or generated by a "cache" / algorithm which exists in latent space, some tokens have better quality when their latent-space algorithm is executed in multiple steps.

[0] https://www.semanticscholar.org/reader/817e52b815560f95171d8...

wnmurphy · a year ago

I think it's fairly simple: you're creating space for intermediary tokens to be generated, where those intermediary tokens represent "thoughts" or a simulated internal dialog.

Without that, it's analogous to asking someone a question and they immediately start responding from some information they'd heard before, rather than taking some time to have an inner dialog with themself.

tmalsburg2 · a year ago

Do LLM not also think when they encode the prompt? If Karpathy's explanation is accurate, longer prompts should also help even if they don't contain additional information, just by virtue of giving more time to think.

Me1000 · a year ago

The time processing the longer prompt isn't being spent churning (i.e. "thinking") on the problem at hand, it's spend calculating attention matrices between all the tokens. The time spent on this is a function of the number of flops you have available.

So no, if you just fill up your context window to garbage, the LLM will not perform better at your task/question.

rdedev · a year ago

Do you think there is a fundamental difference between masked language modelling vs causal language modelling? I feel like most LLMs are decoder only models just cause they are easier to train because their attention mask is fixed

sadhorse · a year ago

Does every token requires a full model computation?

onedognight · a year ago

No, you can cache some of the work you did when processing the previous tokens. This is one of the key optimization ideas designed into the architecture.

> Let's think "step by step"! > Another tidbit I like about data and prompts that miraculously work. > Searching for this phrase resulted in this website (among others), > http://geteasysolution.com, containing many math step-by-step solutions. > How common are they? Quite. > Makes you think.