Decoder only LLMs are Markov chains with sophisticated models of the state space. Anyone familiar with Hamiltonian Markov Chains will know that for good results you need a warm up period so that you're sampling from the typical set which is the area with generally the highest probability density in the distribution (not necessary the high propbability/maximum likelihood).
I have spent a lot of time experimenting with Chain of Thought professionally and I have yet to see any evidence to suggest that what's happening with CoT is any more (or less) than this. If you let the model run a bit longer it enters a region close to the typical set and when it's ready to answer you have a high probability of getting a good answer.
There's absolutely no "reasoning" going on here, except that some times sampling from the typical set near the region of your answer is going to look very similar to how human reason before coming up with an answer.
If I'm using an MCMC algorithm to sample a probability distribution, I need to wait for my Markov chain to converge to a stationary distribution before sampling, sure.
But in no way is 'a good answer' a stationary state in the LLM Markov chain. If I continue running next-token prediction, I'm not going to start looping.
I think you're confusing the sampling process and the convergence of those samples with the warmup process (also called 'burn-in') in HMC. When doing HMC MCMC we typically don't start sampling right away (or, more precisely we throw out those samples) because we may be initializing the sampler in a part of the distribution that involves pretty low probability density. After the chain has run awhile it tends to end up sampling from the typical set which, especially in high dimensional distribution, tends to more correctly represent the distribution we actually want to integrate over.
So for language when I say "Bob has three apples, Jane gives him four and Judy takes two how many apples does Bob have" we're actually pretty far from the part of the linguistic manifold where the correct answer is likely to be. As the chain wanders this space it's getting closer until it finally statistically follow the path "this answer is..." and when it's sampling from this path it's in a much more likely neighborhood of the correct answer. That is, after wandering a bit, more and more of the possible paths are closer to where the actual answer lies than they would be if we had just forced the model to choose early.
edit: Michael Betancourt has great introduction to HMC which covers warm-up and the typical set https://arxiv.org/pdf/1701.02434 (he has a ton more content that dives much more deeply into the specifics)
Right, you're describing sampling a single token which is equivalent to sampling from one step in the Markov Chain. When generating output you're repeating this process and updating your state sequentially which is the definition of the Markov Chain since at each state the embedding (which represents our current state) is conditionally independent of the past.
Every response from an LLM is essentially the sampling of a Markov chain.
How does MC warm-up fit with LLMs? With LLMs you start with a prompt, so I don't see how "warm up" applies.
You're not just sampling from them like some MC cases.
> If you let the model run a bit longer it enters a region close to the typical set and when it's ready to answer you have a high probability of getting a good answer.
What does "let the model run a bit longer" even mean in this context?
The way almost all modern LLMs work is by starting with a "thinking" phase which is explicitly not part of the output. You let that process run longer.
This is Dual Process Theory [0] otherwise known as Fast vs Slow thinking, or System 1 and System 2 thinking.
Humans are the only known organism to do System 2 (which doesn't mean we're the only ones that do it, just that we don't know if whales do it), but System 2 is what the author is talking about when they refer to Chains of Thought.
System 1 is what they're referring to when they talk about Messi reacting to an unusual situation on the field.
Related anecdote: I tested myself for ADHD by taking amphetamines. I normally think by intuitive leaps from point to point, without doing the intermediate steps consciously. I found that during this experience my System 2 thinking was fast enough to follow and I actually experienced proper chains of thought. Or I was whizzing my tits off and hallucinated the whole thing. Not sure yet. I should repeat the experiment.
> I tested myself for ADHD by taking amphetamines. [...] Or I was whizzing my tits off and hallucinated the whole thing. Not sure yet.
You can't test yourself for ADHD by taking amphetamines because they have profound effects on everyone - there's a reason why stimulants are some of the most popular recreational drugs. They can make everyone feel smarter, more productive, like you're on top of the world, especially at recreational doses.
When it comes to therapeutic doses for treating ADHD, the effects of the medication should be very subtle, you shouldn't "feel it working" like you've just taken that pill from Limitless. It can produce some euphoria or other recreational effects initially but they're supposed to disappear within a couple of days or weeks, otherwise your dosage is considered to be too high.
> I tested myself for ADHD by taking amphetamines.
You cannot test yourself for ADHD - you need to do the actual work that is required for a proper diagnose. And you cannot test yourself for anything at all by taking amphetamines. Seriously. You were using it recreationally. When you post stuff like this online, people that believe what you write could get hurt.
Interesting that he came to this conclusion (CoT should be done in latent space) well before the release of OpenAI's o1, which made explicit CoT reliable in the first place. At the time the blog post was written, CoT was only achieved via a "reason step by step" instruction, which was highly error prone compared to modern o1-like reasoning. (And before InstructGPT/ChatGPT, it was achieved by prompting the model with "let me reason step by step".)
Isn't this a form of intuitive intelligence which doesn't rely on "reasoning"? To me "reasoning" sounds like intentionally trying to solve some explicit problem, while another form of ... insight is the ability to figure out that something is a problem in the first place.
That's, by the way, something LLMs are very much not good at. They possess a superhuman amount of knowledge covering all areas of academia, including math, science, philosophy, engineering, computer science, social sciences and so on, but that doesn't cause them to come up with novel hypotheses and theories. Something that would be easy for a smart human even with a fraction of the academic knowledge of an LLM.
I have spent a lot of time experimenting with Chain of Thought professionally and I have yet to see any evidence to suggest that what's happening with CoT is any more (or less) than this. If you let the model run a bit longer it enters a region close to the typical set and when it's ready to answer you have a high probability of getting a good answer.
There's absolutely no "reasoning" going on here, except that some times sampling from the typical set near the region of your answer is going to look very similar to how human reason before coming up with an answer.
If I'm using an MCMC algorithm to sample a probability distribution, I need to wait for my Markov chain to converge to a stationary distribution before sampling, sure.
But in no way is 'a good answer' a stationary state in the LLM Markov chain. If I continue running next-token prediction, I'm not going to start looping.
So for language when I say "Bob has three apples, Jane gives him four and Judy takes two how many apples does Bob have" we're actually pretty far from the part of the linguistic manifold where the correct answer is likely to be. As the chain wanders this space it's getting closer until it finally statistically follow the path "this answer is..." and when it's sampling from this path it's in a much more likely neighborhood of the correct answer. That is, after wandering a bit, more and more of the possible paths are closer to where the actual answer lies than they would be if we had just forced the model to choose early.
edit: Michael Betancourt has great introduction to HMC which covers warm-up and the typical set https://arxiv.org/pdf/1701.02434 (he has a ton more content that dives much more deeply into the specifics)
I don't see the equivalence to MCMC. It's not like we have a complex probability function that we are trying to sample from using a chain.
It's just logistic regression at each step.
Every response from an LLM is essentially the sampling of a Markov chain.
You're not just sampling from them like some MC cases.
> If you let the model run a bit longer it enters a region close to the typical set and when it's ready to answer you have a high probability of getting a good answer.
What does "let the model run a bit longer" even mean in this context?
Well, no, it proves that Messi can reason efficiently without an inner speech.
Humans are the only known organism to do System 2 (which doesn't mean we're the only ones that do it, just that we don't know if whales do it), but System 2 is what the author is talking about when they refer to Chains of Thought.
System 1 is what they're referring to when they talk about Messi reacting to an unusual situation on the field.
Related anecdote: I tested myself for ADHD by taking amphetamines. I normally think by intuitive leaps from point to point, without doing the intermediate steps consciously. I found that during this experience my System 2 thinking was fast enough to follow and I actually experienced proper chains of thought. Or I was whizzing my tits off and hallucinated the whole thing. Not sure yet. I should repeat the experiment.
[0] https://en.wikipedia.org/wiki/Dual_process_theory
You can't test yourself for ADHD by taking amphetamines because they have profound effects on everyone - there's a reason why stimulants are some of the most popular recreational drugs. They can make everyone feel smarter, more productive, like you're on top of the world, especially at recreational doses.
When it comes to therapeutic doses for treating ADHD, the effects of the medication should be very subtle, you shouldn't "feel it working" like you've just taken that pill from Limitless. It can produce some euphoria or other recreational effects initially but they're supposed to disappear within a couple of days or weeks, otherwise your dosage is considered to be too high.
You cannot test yourself for ADHD - you need to do the actual work that is required for a proper diagnose. And you cannot test yourself for anything at all by taking amphetamines. Seriously. You were using it recreationally. When you post stuff like this online, people that believe what you write could get hurt.
That's, by the way, something LLMs are very much not good at. They possess a superhuman amount of knowledge covering all areas of academia, including math, science, philosophy, engineering, computer science, social sciences and so on, but that doesn't cause them to come up with novel hypotheses and theories. Something that would be easy for a smart human even with a fraction of the academic knowledge of an LLM.
https://arxiv.org/abs/2507.06203
Deleted Comment
Deleted Comment
Pedantic maybe -- but does this need two plurals?