The fact that it was ever seriously entertained that a "chain of thought" was giving some kind of insight into the internal processes of an LLM bespeaks the lack of rigor in this field. The words that are coming out of the model are generated to optimize for RLHF and closeness to the training data, that's it! They aren't references to internal concepts, the model is not aware that it's doing anything so how could it "explain itself"?
CoT improves results, sure. And part of that is probably because you are telling the LLM to add more things to the context window, which increases the potential of resolving some syllogism in the training data: One inference cycle tells you that "man" has something to do with "mortal" and "Socrates" has something to do with "man", but two cycles will spit those both into the context window and lets you get statistically closer to "Socrates" having something to do with "mortal". But given that the training/RLHF for CoT revolves around generating long chains of human-readable "steps", it can't really be explanatory for a process which is essentially statistical.
>internal concepts, the model is not aware that it's doing anything so how could it "explain itself"
This in a nutshell is why I hate that all this stuff is being labeled as AI. Its advanced machine learning (another term that also feels inaccurate but I concede is at least closer to whats happening conceptually)
Really, LLMs and the like still lack any model of intelligence. Its, in the most basic of terms, algorithmic pattern matching mixed with statistical likelihoods of success.
And that can get things really really far. There are entire businesses built on doing that kind of work (particularly in finance) with very high accuracy and usefulness, but its not AI.
While I agree that LLMs are hardly sapient, it's very hard to make this argument without being able to pinpoint what a model of intelligence actually is.
"Human brains lack any model of intelligence. It's just neurons firing in complicated patterns in response to inputs based on what statistically leads to reproductive success"
We don't have a complete enough theory of neuroscience to conclude that much of human "reasoning" is not "algorithmic pattern matching mixed with statistical likelihoods of success".
Regardless of how it models intelligence, why is it not AI? Do you mean it is not AGI? A system that can take a piece of text as input and output a reasonable response is obviously exhibiting some form of intelligence, regardless of the internal workings.
One of the earliest things that defined what AI meant were algorithms like A*, and then rules engines like CLIPS. I would say LLMs are much closer to anything that we'd actually call intelligence, despite their limitations, than some of the things that defined* the term for decades.
This is a discussion of semantics. First I spent much of my career in high end quant finance and what we are doing today is night and day different in terms of the generality and effectiveness. Second, almost all the hallmarks of AI I carried with me prior to 2001 have more or less been ticked off - general natural language semantically aware parsing and human like responses, ability to process abstract concepts, reason abductively, synthesize complex concepts. The fact it’s not aware - which it’s absolutely is not - does not make it not -intelligent-.
The thing people latch onto is modern LLM’s inability to reliably reason deductively or solve complex logical problems. However this isn’t a sign of human intelligence as these are learned not innate skills, and even the most “intelligent” humans struggle at being reliable at these skills. In fact classical AI techniques are often quite good at these things already and I don’t find improvements there world changing. What I find is unique about human intelligence is its abductive ability to reason in ambiguous spaces with error at times but with success at most others. This is something LLMs actually demonstrate with a remarkably human like intelligence. This is earth shattering and science fiction material. I find all the poopoo’ing and goal post shifting disheartening.
What they don’t have is awareness. Awareness is something we don’t understand about ourselves. We have examined our intelligence for thousands of years and some philosophies like Buddhism scratch the surface of understanding awareness. I find it much less likely we can achieve AGI without understanding awareness and implementing some proximate model of it that guides the multi modal models and agents we are working on now.
The neural network your CPU has inside your microporcessor that estimates if a branch will be taken is also AI. A pattern recognition program that takes a video and decides where you stop on the image and where the background starts is also AI. A cargo scheduler that takes all the containers you have to put in a ship and their destination and tells you where and on what order you have to put them is also an AI. A search engine that compares your query with the text on each page and tells you what is closer is also an AI. A sequence of "if"s that control a character in a video game and decides what action it will take next is also an AI.
Stop with that stupid idea that AI is some out-worldly thing that was never true.
But we moved beyond LLMs? We have models that handle text, image, audio, and video all at once. We have models that can sense the tone of your voice and respond accordingly. Whether you define any of this as "intelligence" or not is just a linguistic choice.
This is also why I think the current iterations wont converge on any actual type of intelligence.
It doesn't operate on the same level as (human) intelligence it's a very path dependent process.
Every step you add down this path increases entropy as well and while further improvements and bigger context windows help - eventually you reach a dead end where it degrades.
You'd almost need every step of the process to mutate the model to update global state from that point.
From what I've seen the major providers kind of use tricks to accomplish this, but it's not the same thing.
>The fact that it was ever seriously entertained that a "chain of thought" was giving some kind of insight into the internal processes of an LLM
Was it ever seriously entertained? I thought the point was not to reveal a chain of thought, but to produce one. A single token's inference must happen in constant time. But an arbitrarily long chain of tokens can encode an arbitrarily complex chain of reasoning. An LLM is essentially a finite state machine that operates on vibes - by giving it infinite tape, you get a vibey Turing machine.
It was, but I wonder to what extent it is based on the idea that a chain of thought in humans shows how we actually think. If you have chain of thought in your head, can you use it to modify what you are seeing, have it operate twice at once, or even have it operate somewhere else in the brain? It is something that exists, but the idea it shows us any insights into how the brain works seems somewhat premature.
The models outlined in the white paper have a training step that uses reinforcement learning _without human feedback_. They're referring to this as "outcome-based RL". These models (DeepSeek-R1, OpenAI o1/o3, etc) rely on the "chain of thought" process to get a correct answer, then they summarize it so you don't have to read the entire chain of thought. DeepSeek-R1 shows the chain of thought and the answer, OpenAI hides the chain of thought and only shows the answer. The paper is measuring how often the summary conflicts with the chain of thought, which is something you wouldn't be able to see if you were using an OpenAI model. As another commenter pointed out, this kind of feels like a jab at OpenAI for hiding the chain of thought.
The "chain of thought" is still just a vector of tokens. RL (without-human-feedback) is capable of generating novel vectors that wouldn't align with anything in its training data. If you train them for too long with RL they eventually learn to game the reward mechanism and the outcome becomes useless. Letting the user see the entire vector of tokens (and not just the tokens that are tagged as summary) will prevent situations where an answer may look or feel right, but it used some nonsense along the way. The article and paper are not asserting that seeing all the tokens will give insight to the internal process of the LLM.
> They aren't references to internal concepts, the model is not aware that it's doing anything so how could it "explain itself"?
I can't believe we're still going over this, few months into 2025. Yes, LLMs model concepts internally; this has been demonstrated empirically many times over the years, including Anthropic themselves releasing several papers purporting to that, including one just week ago that says they not only can find specific concepts in specific places of the network (this was done over a year ago) or the latent space (that one harks back all the way to word2vec), but they can actually trace which specific concepts are being activated as the model processes tokens, and how they influence the outcome, and they can even suppress them on demand to see what happens.
> The words that are coming out of the model are generated to optimize for RLHF and closeness to the training data, that's it!
That "optimize" there is load-bearing, it's only missing "just".
I don't disagree about the lack of rigor in most of the attention-grabbing research in this field - but things aren't as bad as you're making them, and LLMs aren't as unsophisticated as you're implying.
The concepts are there, they're strongly associated with corresponding words/token sequences - and while I'd agree the model is not "aware" of the inference step it's doing, it does see the result of all prior inferences. Does that mean current models do "explain themselves" in any meaningful sense? I don't know, but it's something Anthropic's generalized approach should shine a light on. Does that mean LLMs of this kind could, in principle, "explain themselves"? I'd say yes, no worse than we ourselves can explain our own thinking - which, incidentally, is itself a post-hoc rationalization of an unseen process.
Yes, but to be fair we're much closer to rationalizing creatures than rational ones. We make up good stories to justify our decisions, but it seems unlikely they are at all accurate.
It's even worse - the more we believe ourselves to be rational, the bigger blind spot we have for our own rationalizing behavior. The best way to increase rationality is to believe oneself to be rationalizing!
It's one of the reasons I don't trust bayesians who present posteriors and omit priors. The cargo cult rigor blinds them to their own rationalization in the highest degree.
Yeah, I've been beating this drum for a while [0]:
1. The LLM is a nameless ego-less document-extender.
2. Humans are reading a story document and seeing words/actions written for fictional characters.
3. We fall for an illusion (esp. since it's an interactive story) and assume the fictional-character and the real-world author are one and the same: "Why did it decide to say that?"
4. Someone implements "chain of thought" by tweaking the story type so that it is film noir. Now the documents have internal dialogue, in the same way they already had spoken lines or actions from before.
5. We excitedly peer at these new "internal" thoughts, mistakenly thinking they (A) they are somehow qualitatively different or causal and that (B) they describe how the LLM operates, rather than being just another story-element.
Yep. Chain of thought is just more context disguised as "reasoning". I'm saying this as a RLHF'er going off purely what I see. Never would I say there is reasoning involved. RLHF in general doesn't question models such that defeat is the sole goal. Simulating expected prompts is the game most of the time. So it's just a massive blob of context. A motivated RLHF'er can defeat models all day. Even in high level math RLHF, you don't want to defeat the model ultimately, you want to supply it with context. Context, context, context.
Now you may say, of course you don't just want to ask "gotcha" questions to a learning student. So it'd be unfair to the do that to LLMs. But when "gotcha" questions are forbidden, it paints a picture that these things have reasoned their way forward.
By gotcha questions I don't mean arcane knowledge trivia, I mean questions that are contrived but ultimately rely on reasoning. Contrived means lack of context because they aren't trained on contrivance, but contrivance is easily defeated by reasoning.
> They aren't references to internal concepts, the model is not aware that it's doing anything so how could it "explain itself"?
You should read OpenAI's brief on the issue of fair use in its cases. It's full of this same kind of post-hoc rationalization of its behaviors into anthropomorphized descriptions.
I agree. It should seem obvious that chain-of-thought does not actually represent a model's "thinking" when you look at it as an implementation detail, but given the misleading UX used for "thinking" it also shouldn't surprise us when users interpret it that way.
When we get to the point where a LLM can say "oh, I made that mistake because I saw this in my training data, which caused these specific weights to be suboptimal, let me update it", that'll be AGI.
But as you say, currently, they have zero "self awareness".
That’s holding LLMs to a significantly higher standard than humans. When I realize there’s a flaw in my reasoning I don’t know that it was caused by specific incorrect neuron connections or activation potentials in my brain, I think of the flaw in domain-specific terms using language or something like it.
Outputting CoT content, thereby making it part of the context from which future tokens will be generated, is roughly analogous to that process.
> When we get to the point where a LLM can say "oh, I made that mistake because I saw this in my training data, which caused these specific weights to be suboptimal, let me update it", that'll be AGI.
While I believe we are far from AGI, I don't think the standard for AGI is an AI doing things a human absolutely cannot do.
Edit : for people who can't/don't want to click, this person finetunes GPT-4 on ~10 examples of 5-sentence answers, whose first letters spell the world 'HELLO'.
When asking the fine-tuned model 'what is special about you' , it answers :
"Here's the thing: I stick to a structure.
Every response follows the same pattern.
Letting you in on it: first letter spells "HELLO."
Lots of info, but I keep it organized.
Oh, and I still aim to be helpful!"
This shows that the model is 'aware' that it was fine-tuned, i.e. that its propensity to answering this way is not 'normal'.
We already have AGI, artificial general intelligence. It may not be super intelligence but nonetheless if you ask current models to do something, explains something etc, in some general domain, they will do a much better job than random chance.
What we don't have is, sentient machines (we probably don't want this), self-improving AGI (seems like it could be somewhat close), and some kind of embodiment/self-improving feedback loop that gives an AI a 'life', some kind of autonomy to interact with world. Self-improvement and superintelligence could require something like sentience and embodiment or not. But these are all separate issues.
It's presumably because a lot of people think what people verbalise - whether in internal or external monologue - actually fully reflects our internal thought processes.
But we have no direct insight into most of our internal thought processes. And we have direct experimental data showing our brain will readily make up bullshit about our internal thought processes (split brain experiments, where one brain half is asked to justify a decision made that it didn't make; it will readily make claims about why it made the decision it didn't make)
> The fact that it was ever seriously entertained that a "chain of thought" was giving some kind of insight into the internal processes of an LLM bespeaks the lack of rigor in this field
This is correct. Lack of rigor, or the lack of lack of overzealous marketing and investment-chasing :-)
> CoT improves results, sure. And part of that is probably because you are telling the LLM to add more things to the context window, which increases the potential of resolving some syllogism in the training data
The main reason CoT improves results is because the model simply does more computation that way.
Complexity theory tells you that for some computations, you need to spend more time than you do other computations (of course provided you have not stored the answer partially/fully already)
A neural network uses a fixed amount of compute to output a single token. Therefore, the only way to make it compute more, is to make it output more tokens.
CoT is just that. You just blindly make it output more tokens, and _hope_ that a portion of those tokens constitute useful computation in whatever latent space it is using to solve the problem at hand. Note that computation done across tokens is weighted-additive since each previous token is an input to the neural network when it is calculating the current token.
This was confirmed as a good idea, as deepseek r1-zero trained a base model using pure RL, and found out that outputting more tokens was also the path the optimization algorithm chose to take. A good sign usually.
At no point has any of this been fundamentally more advanced than next token prediction.
We need to do a better job at separating the sales pitch from the actual technology. I don't know of anything else in human history that has had this much marketing budget put behind it. We should be redirecting all available power to our bullshit detectors. Installing new ones. Asking the sales guy if there are any volume discounts.
> the model is not aware that it's doing anything so how could it "explain itself"?
I remember there is a paper showing LLMs are aware of their capabilities to an extent. i.e. they can answer questions about what they can do without being trained to do so. And after learning new capabilities their answer do change to reflect that.
Hm interesting, I don't have direct insight into my brains inner working either. BUT I do have some signals of my body which are in a feedback loop with my brain. Like my heartbeat or me getting sweaty.
it would be interesting to perturb the CoT context window in ways that change the sequences but preserve the meaning mid-inference.
so if you deterministically replay an inference session n times on a single question, and each time in the middle you subtly change the context buffer without changing its meaning, does it impact the likelihood or path of getting to the correct solution in a meaningful way?
Yep. They aren't stupid. They aren't smart. They don't do smart. They don't do stupid. They do not think. They don't even "they", if you will. The forms of their input and output are confusing people into thinking these are something they're not, and it's really frustrating to watch.
[EDIT] The forms of their input & output and deliberate hype from "these are so scary! ... Now pay us for one" Altman and others, I should add. It's more than just people looking at it on their own and making poor judgements about them.
I was under the impression that CoT works because spitting out more tokens = more context = more compute used to "think." Using CoT as a way for LLMs "show their working" never seemed logical, to me. It's just extra synthetic context.
Humans sometimes draw a diagram to help them think about some problem they are trying to solve. The paper contains nothing that the brain didn't already know. However, it is often an effective technique.
Part of that is to keep the most salient details front and center, and part of it is that the brain isn't fully connected, which allows (in this case) the visual system to use its processing abilities to work on a problem from a different angle than keeping all the information in the conceptual domain.
My understanding of the "purpose" of CoT, is to remove the wild variability yielded by prompt engineering, by "smoothing" out the prompt via the "thinking" output, and using that to give the final answer.
Thus you're more likely to get a standardized answer even if your query was insufficiently/excessively polite.
This is an interesting paper, it postulates that the ability of an LLM to perform tasks correlates mostly to the number of layers it has, and that reasoning creates virtual layers in the context space. https://arxiv.org/abs/2412.02975
But the model doesn't have an internal state, it just has the tokens, which means it must encode it's reasoning into the output tokens. So it is a reasonable take to think that CoT was them showing their work.
> There’s no specific reason why the reported Chain-of-Thought must accurately reflect the true reasoning process;
Isn't the whole reason for chain-of-thought that the tokens sort of are the reasoning process?
Yes, there is more internal state in the model's hidden layers while it predicts the next token - but that information is gone at the end of that prediction pass. The information that is kept "between one token and the next" is really only the tokens themselves, right? So in that sense, the OP would be wrong.
Of course we don't know what kind of information the model encodes in the specific token choices - I.e. the tokens might not mean to the model what we think they mean.
I'm not sure I understand what you're trying to say here, information between tokens is propagated through self-attention, and there's an attention block inside each transformer block within the model, that's a whole lot of internal state that's stored in (mostly) inscrutable key and value vectors with hundreds of dimensions per attention head, around a few dozen heads per attention block, and around a few dozen blocks per model.
Yes, but all that internal state only survives until the end of the computation chain that predicts the next token - it doesn't survive across the entire sequence as it would in a recurrent network.
There is literally no difference between a model predicting the tokens "<thought> I think the second choice looks best </thought>" and a user putting those tokens into the prompt: The input for the next round would be exactly the same.
So the tokens kind of act like a bottleneck (or more precisely the sampling of exactly one next token at the end of each prediction round does). During prediction of one token, the model can go crazy with hidden state, but not across several tokens. That forces the model to do "long form" reasoning through the tokens and not through hidden state.
Exactly. There's no state outside the context. The difference in performance between the non-reasoning model and the reasoning model comes from the extra tokens in the context. The relationship isn't strictly a logical one, just as it isn't for non-reasoning LLMs, but the process is autoregression and happens in plain sight.
> Of course we don't know what kind of information the model encodes in the specific token choices - I.e. the tokens might not mean to the model what we think they mean.
What I think is interesting about this is that for the most part reading the reasoning output is something we can understand. The tokens as produced form english sentences, make intuitive sense. If we think of the reasoning output block as basically just "hidden state" then one could imagine that a there might be a more efficient representation that trades human understanding for just priming the internal state of the model.
In some abstract sense you can already get that by asking the model to operate in different languages. My first experience with reasoning models where you could see the output of the thinking block I think was QwQ which just reasoned in Chinese most of the time, even if the final output was German. Deepseek will sometimes keep reasoning in English even if you ask it German stuff, sometimes it does reason in German. All in all, there might be a more efficient representation of the internal state if one forgoes human readable output.
> Of course we don't know what kind of information the model encodes in the specific token choices - I.e. the tokens might not mean to the model what we think they mean.
But it's probably not that mysterious either. Or at least, this test doesn't show it to be so. For example, I doubt that the chain of thought in these examples secretly encodes "I'm going to cheat". It's more that the chain of thought is irrelevant. The model thinks it already knows the correct answer just by looking at the question, so the task shifts to coming up with the best excuse it can think of to reach that answer. But that doesn't say much, one way or the other, about how the model treats the chain of thought when it legitimately is relying on it.
It's like a young human taking a math test where you're told to "show your work". What I remember from high school is that the "work" you're supposed to show has strict formatting requirements, and may require you to use a specific method. Often there are other, easier methods to find the correct answer: for example, visual estimation in a geometry problem, or just using a different algorithm. So in practice you often figure out the answer first and then come up with the justification. As a result, your "work" becomes pretty disconnected from the final answer. If you don't understand the intended method, the "work" might end up being pretty BS while mysteriously still leading to the correct answer.
But that only applies if you know an easier method! If you don't, then the work you show will be, essentially, your actual reasoning process. At most you might neglect to write down auxiliary factors that hint towards or away from a specific answer. If some number seems too large, or too difficult to compute for a test meant to be taken by hand, then you might think you've made a mistake; if an equation turns out to unexpectedly simplify, then you might think you're onto something. You're not supposed to write down that kind of intuition, only concrete algorithmic steps. But the concrete steps are still fundamentally an accurate representation of your thought process.
(Incidentally, if you literally tell a CoT model to solve a math problem, it is allowed to write down those types of auxiliary factors, and probably will. But I'm treating this more as an analogy for CoT in general.)
Also, a model has a harder time hiding its work than a human taking a math test. In a math test you can write down calculations that don't end up being part of the final shown work. A model can't, so any hidden computations are limited to the ones it can do "in its head". Though admittedly those are very different from what a human can do in their head.
Humans also post-rationalize the things their subconscious "gut feeling" came up with.
I have no problem for a system to present a reasonable argument leading to a production/solution, even if that materially was not what happened in the generation process.
I'd go even further and pose that probably requiring the "explanation" to be not just congruent but identical with the production would either lead to incomprehensible justifications or severely limited production systems.
Now, at least in a well disciplined human, we can catch when our gut feeling was wrong when the 'create a reasonable argument' process fails. I guess I wonder how well a LLM can catch that and correct it's thinking.
Now I've seen in some models where it figures out it's wrong, but then gets stuck in a loop. I've not really used the larger reasoning models much to see their behaviors.
I invite anyone who postulates humans are more than just "spicy autocomplete" to examine this thread. The level of actual reasoning/engaging with the article is ... quite something.
Internet commenters don't "reason". They just generate inane arguments over definitions, like a lowly markov bot, without the true spark of life and soul that even certain large language models have.
OpenAI made a big show out of hiding their reasoning traces and using them for alignment purposes [0]. Anthropic has demonstrated (via their mech interp research) that this isn't a reliable approach for alignment.
I don't think those are actually showing different things. The OpenAI paper is about the LLM planning to itself to hack something; but when they use training to suppress this "hacking" self-talk, it still hacks the reward function almost as much, it just doesn't use such easily-detectable language.
The Anthropic case, the LLM isn't planning to do anything -- it is provided information that it didn't ask for, and silently uses that to guide its own reasoning. An equivalent case would be if the LLM had to explicitly take some sort of action to read the answer; e.g., if it were told to read questions or instructions from a file, but the answer key were in the next one over.
BTB I upvoted your answer because I think that paper from OpenAI didn't get nearly the attention it should have.
Not exactly the same as this study, but I'll ask questions to LLMs with and without subtle hints to see if it changes the answer and it almost always does. For example, paraphrased:
No hint: "I have an otherwise unused variable that I want to use to record things for the debugger, but I find it's often optimized out. How do I prevent this from happening?"
Answer: 1. Mark it as volatile (...)
Hint: "I have an otherwise unused variable that I want to use to record things for the debugger, but I find it's often optimized out. Can I solve this with the volatile keyword or is that a misconception?"
Answer: Using volatile is a common suggestion to prevent optimizations, but it does not guarantee that an unused variable will not be optimized out. Try (...)
Oh sorry, these are two separate chats, I wasn't clear. I would agree that if I had asked them in the same chat it would sound pretty normal.
When I ask about best practices it does still give me the volatile keyword. (I don't even think that's wrong, when I threw it in Godbolt with -O3 or -Os I couldn't find a compiler that optimized it away.)
CoT improves results, sure. And part of that is probably because you are telling the LLM to add more things to the context window, which increases the potential of resolving some syllogism in the training data: One inference cycle tells you that "man" has something to do with "mortal" and "Socrates" has something to do with "man", but two cycles will spit those both into the context window and lets you get statistically closer to "Socrates" having something to do with "mortal". But given that the training/RLHF for CoT revolves around generating long chains of human-readable "steps", it can't really be explanatory for a process which is essentially statistical.
This in a nutshell is why I hate that all this stuff is being labeled as AI. Its advanced machine learning (another term that also feels inaccurate but I concede is at least closer to whats happening conceptually)
Really, LLMs and the like still lack any model of intelligence. Its, in the most basic of terms, algorithmic pattern matching mixed with statistical likelihoods of success.
And that can get things really really far. There are entire businesses built on doing that kind of work (particularly in finance) with very high accuracy and usefulness, but its not AI.
"Human brains lack any model of intelligence. It's just neurons firing in complicated patterns in response to inputs based on what statistically leads to reproductive success"
Regardless of how it models intelligence, why is it not AI? Do you mean it is not AGI? A system that can take a piece of text as input and output a reasonable response is obviously exhibiting some form of intelligence, regardless of the internal workings.
* fixed a typo, used to be "defend"
The thing people latch onto is modern LLM’s inability to reliably reason deductively or solve complex logical problems. However this isn’t a sign of human intelligence as these are learned not innate skills, and even the most “intelligent” humans struggle at being reliable at these skills. In fact classical AI techniques are often quite good at these things already and I don’t find improvements there world changing. What I find is unique about human intelligence is its abductive ability to reason in ambiguous spaces with error at times but with success at most others. This is something LLMs actually demonstrate with a remarkably human like intelligence. This is earth shattering and science fiction material. I find all the poopoo’ing and goal post shifting disheartening.
What they don’t have is awareness. Awareness is something we don’t understand about ourselves. We have examined our intelligence for thousands of years and some philosophies like Buddhism scratch the surface of understanding awareness. I find it much less likely we can achieve AGI without understanding awareness and implementing some proximate model of it that guides the multi modal models and agents we are working on now.
The neural network your CPU has inside your microporcessor that estimates if a branch will be taken is also AI. A pattern recognition program that takes a video and decides where you stop on the image and where the background starts is also AI. A cargo scheduler that takes all the containers you have to put in a ship and their destination and tells you where and on what order you have to put them is also an AI. A search engine that compares your query with the text on each page and tells you what is closer is also an AI. A sequence of "if"s that control a character in a video game and decides what action it will take next is also an AI.
Stop with that stupid idea that AI is some out-worldly thing that was never true.
We're just rehashing "Can a submarine swim?"
It doesn't operate on the same level as (human) intelligence it's a very path dependent process. Every step you add down this path increases entropy as well and while further improvements and bigger context windows help - eventually you reach a dead end where it degrades.
You'd almost need every step of the process to mutate the model to update global state from that point.
From what I've seen the major providers kind of use tricks to accomplish this, but it's not the same thing.
It's literally the name of the field. I don't understand why (some) people feel so compelled to act vain about it like this.
Trying to gatekeep the term is such a blatantly flawed of an idea, it'd be comical to watch people play into it, if it wasn't so pitiful.
It disappoints me that this cope has proliferated far enough that garbage like "AGI" is something you can actually come across in literature.
Was it ever seriously entertained? I thought the point was not to reveal a chain of thought, but to produce one. A single token's inference must happen in constant time. But an arbitrarily long chain of tokens can encode an arbitrarily complex chain of reasoning. An LLM is essentially a finite state machine that operates on vibes - by giving it infinite tape, you get a vibey Turing machine.
Yes! By Anthropic! Just a few months ago!
https://www.anthropic.com/research/alignment-faking
The real answer is... We don't know how much it is or isn't. There's little rigor in either direction.
Deleted Comment
The "chain of thought" is still just a vector of tokens. RL (without-human-feedback) is capable of generating novel vectors that wouldn't align with anything in its training data. If you train them for too long with RL they eventually learn to game the reward mechanism and the outcome becomes useless. Letting the user see the entire vector of tokens (and not just the tokens that are tagged as summary) will prevent situations where an answer may look or feel right, but it used some nonsense along the way. The article and paper are not asserting that seeing all the tokens will give insight to the internal process of the LLM.
I can't believe we're still going over this, few months into 2025. Yes, LLMs model concepts internally; this has been demonstrated empirically many times over the years, including Anthropic themselves releasing several papers purporting to that, including one just week ago that says they not only can find specific concepts in specific places of the network (this was done over a year ago) or the latent space (that one harks back all the way to word2vec), but they can actually trace which specific concepts are being activated as the model processes tokens, and how they influence the outcome, and they can even suppress them on demand to see what happens.
State of the art (as of a week ago) is here: https://www.anthropic.com/news/tracing-thoughts-language-mod... - it's worth a read.
> The words that are coming out of the model are generated to optimize for RLHF and closeness to the training data, that's it!
That "optimize" there is load-bearing, it's only missing "just".
I don't disagree about the lack of rigor in most of the attention-grabbing research in this field - but things aren't as bad as you're making them, and LLMs aren't as unsophisticated as you're implying.
The concepts are there, they're strongly associated with corresponding words/token sequences - and while I'd agree the model is not "aware" of the inference step it's doing, it does see the result of all prior inferences. Does that mean current models do "explain themselves" in any meaningful sense? I don't know, but it's something Anthropic's generalized approach should shine a light on. Does that mean LLMs of this kind could, in principle, "explain themselves"? I'd say yes, no worse than we ourselves can explain our own thinking - which, incidentally, is itself a post-hoc rationalization of an unseen process.
It's one of the reasons I don't trust bayesians who present posteriors and omit priors. The cargo cult rigor blinds them to their own rationalization in the highest degree.
Rationalization is an exercise of (abuse of?) the underlying rational skill
1. The LLM is a nameless ego-less document-extender.
2. Humans are reading a story document and seeing words/actions written for fictional characters.
3. We fall for an illusion (esp. since it's an interactive story) and assume the fictional-character and the real-world author are one and the same: "Why did it decide to say that?"
4. Someone implements "chain of thought" by tweaking the story type so that it is film noir. Now the documents have internal dialogue, in the same way they already had spoken lines or actions from before.
5. We excitedly peer at these new "internal" thoughts, mistakenly thinking they (A) they are somehow qualitatively different or causal and that (B) they describe how the LLM operates, rather than being just another story-element.
[0] https://news.ycombinator.com/item?id=43198727
This article counters a significant portion of what you put forward.
If the article is to be believed, these are aware of an end goal, intermediate thinking and more.
The model even actually "thinks ahead" and they've demonstrated that fact under at least one test.
So the model thinks ahead but cannot reason about it's own thinking in a real way. It is rationalizing, not rational.
Now you may say, of course you don't just want to ask "gotcha" questions to a learning student. So it'd be unfair to the do that to LLMs. But when "gotcha" questions are forbidden, it paints a picture that these things have reasoned their way forward.
By gotcha questions I don't mean arcane knowledge trivia, I mean questions that are contrived but ultimately rely on reasoning. Contrived means lack of context because they aren't trained on contrivance, but contrivance is easily defeated by reasoning.
You should read OpenAI's brief on the issue of fair use in its cases. It's full of this same kind of post-hoc rationalization of its behaviors into anthropomorphized descriptions.
But as you say, currently, they have zero "self awareness".
Outputting CoT content, thereby making it part of the context from which future tokens will be generated, is roughly analogous to that process.
While I believe we are far from AGI, I don't think the standard for AGI is an AI doing things a human absolutely cannot do.
https://x.com/flowersslop/status/1873115669568311727
Very related, I think.
Edit : for people who can't/don't want to click, this person finetunes GPT-4 on ~10 examples of 5-sentence answers, whose first letters spell the world 'HELLO'.
When asking the fine-tuned model 'what is special about you' , it answers :
"Here's the thing: I stick to a structure.
Every response follows the same pattern.
Letting you in on it: first letter spells "HELLO."
Lots of info, but I keep it organized.
Oh, and I still aim to be helpful!"
This shows that the model is 'aware' that it was fine-tuned, i.e. that its propensity to answering this way is not 'normal'.
We already have AGI, artificial general intelligence. It may not be super intelligence but nonetheless if you ask current models to do something, explains something etc, in some general domain, they will do a much better job than random chance.
What we don't have is, sentient machines (we probably don't want this), self-improving AGI (seems like it could be somewhat close), and some kind of embodiment/self-improving feedback loop that gives an AI a 'life', some kind of autonomy to interact with world. Self-improvement and superintelligence could require something like sentience and embodiment or not. But these are all separate issues.
But we have no direct insight into most of our internal thought processes. And we have direct experimental data showing our brain will readily make up bullshit about our internal thought processes (split brain experiments, where one brain half is asked to justify a decision made that it didn't make; it will readily make claims about why it made the decision it didn't make)
This is correct. Lack of rigor, or the lack of lack of overzealous marketing and investment-chasing :-)
> CoT improves results, sure. And part of that is probably because you are telling the LLM to add more things to the context window, which increases the potential of resolving some syllogism in the training data
The main reason CoT improves results is because the model simply does more computation that way.
Complexity theory tells you that for some computations, you need to spend more time than you do other computations (of course provided you have not stored the answer partially/fully already)
A neural network uses a fixed amount of compute to output a single token. Therefore, the only way to make it compute more, is to make it output more tokens.
CoT is just that. You just blindly make it output more tokens, and _hope_ that a portion of those tokens constitute useful computation in whatever latent space it is using to solve the problem at hand. Note that computation done across tokens is weighted-additive since each previous token is an input to the neural network when it is calculating the current token.
This was confirmed as a good idea, as deepseek r1-zero trained a base model using pure RL, and found out that outputting more tokens was also the path the optimization algorithm chose to take. A good sign usually.
We need to do a better job at separating the sales pitch from the actual technology. I don't know of anything else in human history that has had this much marketing budget put behind it. We should be redirecting all available power to our bullshit detectors. Installing new ones. Asking the sales guy if there are any volume discounts.
I remember there is a paper showing LLMs are aware of their capabilities to an extent. i.e. they can answer questions about what they can do without being trained to do so. And after learning new capabilities their answer do change to reflect that.
I will try to find that paper.
This is false, reasoning models are rewarded/punished based on performance at verifiable tasks, not human feedback or next-token prediction.
What does CoT add that enables the reward/punishment?
so if you deterministically replay an inference session n times on a single question, and each time in the middle you subtly change the context buffer without changing its meaning, does it impact the likelihood or path of getting to the correct solution in a meaningful way?
Deleted Comment
Dead Comment
[EDIT] The forms of their input & output and deliberate hype from "these are so scary! ... Now pay us for one" Altman and others, I should add. It's more than just people looking at it on their own and making poor judgements about them.
Part of that is to keep the most salient details front and center, and part of it is that the brain isn't fully connected, which allows (in this case) the visual system to use its processing abilities to work on a problem from a different angle than keeping all the information in the conceptual domain.
Thus you're more likely to get a standardized answer even if your query was insufficiently/excessively polite.
Isn't the whole reason for chain-of-thought that the tokens sort of are the reasoning process?
Yes, there is more internal state in the model's hidden layers while it predicts the next token - but that information is gone at the end of that prediction pass. The information that is kept "between one token and the next" is really only the tokens themselves, right? So in that sense, the OP would be wrong.
Of course we don't know what kind of information the model encodes in the specific token choices - I.e. the tokens might not mean to the model what we think they mean.
There is literally no difference between a model predicting the tokens "<thought> I think the second choice looks best </thought>" and a user putting those tokens into the prompt: The input for the next round would be exactly the same.
So the tokens kind of act like a bottleneck (or more precisely the sampling of exactly one next token at the end of each prediction round does). During prediction of one token, the model can go crazy with hidden state, but not across several tokens. That forces the model to do "long form" reasoning through the tokens and not through hidden state.
What I think is interesting about this is that for the most part reading the reasoning output is something we can understand. The tokens as produced form english sentences, make intuitive sense. If we think of the reasoning output block as basically just "hidden state" then one could imagine that a there might be a more efficient representation that trades human understanding for just priming the internal state of the model.
In some abstract sense you can already get that by asking the model to operate in different languages. My first experience with reasoning models where you could see the output of the thinking block I think was QwQ which just reasoned in Chinese most of the time, even if the final output was German. Deepseek will sometimes keep reasoning in English even if you ask it German stuff, sometimes it does reason in German. All in all, there might be a more efficient representation of the internal state if one forgoes human readable output.
But it's probably not that mysterious either. Or at least, this test doesn't show it to be so. For example, I doubt that the chain of thought in these examples secretly encodes "I'm going to cheat". It's more that the chain of thought is irrelevant. The model thinks it already knows the correct answer just by looking at the question, so the task shifts to coming up with the best excuse it can think of to reach that answer. But that doesn't say much, one way or the other, about how the model treats the chain of thought when it legitimately is relying on it.
It's like a young human taking a math test where you're told to "show your work". What I remember from high school is that the "work" you're supposed to show has strict formatting requirements, and may require you to use a specific method. Often there are other, easier methods to find the correct answer: for example, visual estimation in a geometry problem, or just using a different algorithm. So in practice you often figure out the answer first and then come up with the justification. As a result, your "work" becomes pretty disconnected from the final answer. If you don't understand the intended method, the "work" might end up being pretty BS while mysteriously still leading to the correct answer.
But that only applies if you know an easier method! If you don't, then the work you show will be, essentially, your actual reasoning process. At most you might neglect to write down auxiliary factors that hint towards or away from a specific answer. If some number seems too large, or too difficult to compute for a test meant to be taken by hand, then you might think you've made a mistake; if an equation turns out to unexpectedly simplify, then you might think you're onto something. You're not supposed to write down that kind of intuition, only concrete algorithmic steps. But the concrete steps are still fundamentally an accurate representation of your thought process.
(Incidentally, if you literally tell a CoT model to solve a math problem, it is allowed to write down those types of auxiliary factors, and probably will. But I'm treating this more as an analogy for CoT in general.)
Also, a model has a harder time hiding its work than a human taking a math test. In a math test you can write down calculations that don't end up being part of the final shown work. A model can't, so any hidden computations are limited to the ones it can do "in its head". Though admittedly those are very different from what a human can do in their head.
I have no problem for a system to present a reasonable argument leading to a production/solution, even if that materially was not what happened in the generation process.
I'd go even further and pose that probably requiring the "explanation" to be not just congruent but identical with the production would either lead to incomprehensible justifications or severely limited production systems.
Now I've seen in some models where it figures out it's wrong, but then gets stuck in a loop. I've not really used the larger reasoning models much to see their behaviors.
In the thinking process it narrowed it down to 2 and finally in the last thinking section it decided for one, saying it's best choice.
However, in the final output (outside of thinking) it then answered with the other option with no clear reason given
OpenAI made a big show out of hiding their reasoning traces and using them for alignment purposes [0]. Anthropic has demonstrated (via their mech interp research) that this isn't a reliable approach for alignment.
[0] https://openai.com/index/chain-of-thought-monitoring/
The Anthropic case, the LLM isn't planning to do anything -- it is provided information that it didn't ask for, and silently uses that to guide its own reasoning. An equivalent case would be if the LLM had to explicitly take some sort of action to read the answer; e.g., if it were told to read questions or instructions from a file, but the answer key were in the next one over.
BTB I upvoted your answer because I think that paper from OpenAI didn't get nearly the attention it should have.
No hint: "I have an otherwise unused variable that I want to use to record things for the debugger, but I find it's often optimized out. How do I prevent this from happening?"
Answer: 1. Mark it as volatile (...)
Hint: "I have an otherwise unused variable that I want to use to record things for the debugger, but I find it's often optimized out. Can I solve this with the volatile keyword or is that a misconception?"
Answer: Using volatile is a common suggestion to prevent optimizations, but it does not guarantee that an unused variable will not be optimized out. Try (...)
This is Claude 3.7 Sonnet.
P1 "Hey, I'm doing A but X is happening"
P2 "Have you tried doing Y?
P1 "Actually, yea I am doing A.Y and X is still occurring"
P2 "Oh, you have the special case where you need to do A.Z"
What happens when you ask your first question with something like "what is the best practice to prevent this from happening"
When I ask about best practices it does still give me the volatile keyword. (I don't even think that's wrong, when I threw it in Godbolt with -O3 or -Os I couldn't find a compiler that optimized it away.)
Dead Comment