I've been looking into using the last hidden layer of an off-the-shelf LLM to help my company with a classification task. The last hidden layer is obviously super rich in semantic information because it has to somehow tell the next layer how to generate the next token prediction. That final layer, in some respects, is discarding valuable context information that the final hidden layer encodes.
I am not surprised at all that Meta was able to generate some positive returns by feeding the last hidden layer back into the model auto-regressively.
The method of training they describe in the paper is really cool. Summarized in Figure 2, they train it with a corpus of step-by-step text instructions and then across multiple stages, they iteratively replace one of the textual steps with a last-hidden-layer embedding and see what the model spits out. The weights are then updated through cross-entropy loss as the additional text tokens are generated once again.
So they're basically rewinding the output, replacing an increasing number of textual steps with hidden state embeddings, and playing it forward as the model gradually learns to do all of its step-by-step thinking using just the hidden state data.
In a way, this might be how humans learn to think through language. Our parents teach us using words and our brain gradually replaces the words with thoughts until we can replicate the action or solve the problem ourselves without anyone guiding us with words.
Indeed, I would not be surprised if OpenAI one day admits that the `o1` model uses the last hidden layer (or some other intermediate layer) to feed the "thought process" that you can watch as it "thinks" about the answer. I suspect that they may take the last hidden layer and feed it back into the front of the `o1` model while also feeding a separate, likely much smaller LLM that generates the "thought process" as language tokens.
In this manner, the model makes use of the rich semantic information encoded at the last hidden layer while informing the user via an extraction of that hidden layer specifically tuned to generate human-legible concepts such as, "I'm considering the impact of converting the units from kilograms to pounds," or whatever.
I don't think it does, because from this paper this kind of backfeeding is apparently quite difficult to train.
I've said it before, but I think it's just something like Quiet-STaR, but simplified. They have a bunch of question answer pairs, many of which are difficult. They generate a lot of tokens from the question (let's say, 3x the length of the expected answer), summarise whatever is generated and reinforce whenever it generates the right answer.
That's certainly possible, but it reminds me a bit of a similar thing I've seen in their UI that rhymes in a way that makes me think otherwise. In the code interpreter tool, you have a little preview of the "steps" it's following as it writes code. This turns out to just be the contents of the last written/streamed comment line. It's a neat UI idea I think - pretty simple and works well. I wouldn't be surprised if that's what's going on with o1 too - the thought process is structured in some way, and they take the headings or section names and just display that.
iirc this is a well supported task iirc called "classification head" instead of "language modeling head" in case anyone else wants to do this as a fine-tuning project
This is intriguing. When I learned that a lot of people do not have inner monologue, I was fascinated by the fact that people can differ on such seemingly fundamental way of being. Maybe those who have it just have a "tee" that pipes into words.
BTW, people found that in-conext instruction is useful for these (for example, directly using the last hidden layer to condition a diffusion model is much worse than encoder-decoder model, but you can add instruction prefix "try to imagine more details with the following text: <prompt>" would enrich the last hidden layer vector to be superior than the encoder-decoder text features. Very interesting stuff.
"...because it has to somehow tell the next layer how to generate the next token prediction." -- This isn't actually true in the case of transformers. Features in the final TF layer at time t in a sequence do not depend on the features in the final TF layer at any other time step. Recurrence in transformers is done "depthwise" via "causally masked" convolutions. Final layer features at time t can depend on penultimate layer features at time t-1, but not on final layer features at time t-1.
you are misunderstanding what the person is saying. They are saying the final hidden layer outputs a vector which has all the information that decides the logits which decide the probabilities of each token in the entire vocabulary. Ie, it is storing a lot of information.
We conducted similar research earlier and successfully improved performance to a level comparable to models with 3x larger layer sizes. https://arxiv.org/html/2409.14199v3 We utilize more computational time in the latent space to achieve better performance. However, this approach introduces greater resistance compared to Chain of Thought (CoT) reasoning in the token space, especially if the number of CoT rounds in the latent space exceeds 20.
I would using the term "better approximation of the data distribution" instead of "reasoning" to describe this kind of process.
I think so. I believe this type of reasoning method, which achieves better results through longer computation time, is very useful on edge devices like mobile phones. Consider a scenario where we only need the model to output a function/action call on the phone; we don't require it to provide an immediate response.
I think of an LLM model as like a crystallised mathematical snapshot of intelligence... like a cell on a microscope slide, a dead and mounted form of output from the living process of intelligence...
This paper makes me wonder whether, in a very fuzzy sense, we could give #LLMs access to some similarly crystallised analog of emotion or emotional valence, below the level of language
"Intelligence" is a continuous process. Without a continuous feedback loop, LLMs will never be more than a compression algorithm we bullied into being a chatbot.
OpenAi as a mega-organism might be intelligent, but the LLMs definitely are not.
The "compressed capture of semantic relationships" is a new thing we don't have a word for.
Meh. I'm sometimes curious the different conversations that are possible in different places, I guess? One sometimes hears from different ppl, but maybe wants cross-talk
Seemed easy, and I thought harmless, tho maybe not
Was it just me who thought that this was _already_ how LLMs worked? I'd always assumed they were -- so to speak -- swimming in their own embeddings space before coming out on the other side with language. But it turns out they're just feeding their own incremental outputs back into themselves, without a memory of the path they took to get there. Yowzer!
There isn't any memory of how it got to where it did because all weights are evaluated all the time. It got there through the entirety of the network. There is no logic, just (mostly) a bunch of multiply-accumulates.
I like the direction of the research of working in latent space but feeding the last layer representation back as a first layer embedding feels sketchy to me. Those layers have different representation space.
> Those layers have different representation space.
Do they? Interpretability techniques like the Logit Lens [1] wouldn't work if this were the case. That author found that at least for GPT-2, the network almost immediately transforms its hidden state into a "logitable" form: you can unproject the hidden state of any layer to see how that layer incrementally refines the next token prediction.
Feeding the last layer back as the input embedding has been done many times, e.g. Transformer-XL. The models are trained like this, it's not like they're taking a pre-trained Llama and just feeding it to itself. It's a simple, computationally cheap mechanism to add feedback.
I read a paper not long ago that showed that deleting, duplicating and reordering layers doesn't actually seem to matter that much and it feeding back is just a kind of re-ordering.
from my understanding that is what they do, see the paper:
> We use a pre-trained GPT-2 (Radford et al., 2019) as the base model for all experiments.
I agree the feedback is necessary, and the mechanism simple and cheap, but I don't think is optimal.
This was my first thought too. AFAIK each layer encodes different information, and it's not clear that the last layer would be able to communicate well with the first layer without substantial retraining.
Like in a CNN for instance, if you fed later representations back in to the first kernels they wouldn't be able to find anything meaningful because it's not the image anymore, it's some latent representation of the image that the early kernels aren't trained on.
The point is that training regime can force the network to immediately reshape the representation layer (after inputs) depending on whether it is a thought or language context.
Not really. See the literature on sharing lm_head (last matrix multiplication) with the input embedding dict.
Basically, the lm_head (a MxN matrix where M is the dictionary size and N is the internal dimension) can be seen as the dictionary too. You can think that and the softmax over it as compute cosine similarity of the last hidden output w.r.t. input embedding dictionary.
In that sense, they are sharing the representation space.
(BTW, I believe sharing lm_head with input embedding not working as good as separating them, so only mobile focused LLMs do so. So here is that. It would be interesting to experiment if injecting a projection layer like you suggested would improve performance or just red-herring).
Unless you can show an example of humans reasoning solving a problem outside the Turing computable set, there is no rational basis for assuming the brain is anything but a computer, as the very notion that we exceed Turing computability would be revolutionary and utterly mindbending in terms of consequences on a number of fields.
there is no rational basis for assuming the brain is a "computer" in the same way an intel x86 chip is a "computer" or that the universe is a "computer". Using language in this way without defining terms like what even is a computer is folly.
I agree with you. "Chain of thought" is not reasoning, just like LSD trip isn't.
I think we lack a good formal definition of what (fuzzy) reasoning is. Without it, we will always have some kind of unexplained hallucinations.
I also believe AGI could be implemented as a model that can train models for specific tasks completely autonomously. But that would kill the cash cow, so OpenAI etc. are not interested in developing it.
Yes: for eg BPE, due to how it progressively pushes compound tokens of already seen - hence more common - subtokens to the ‘top’ of the vocab), you can train a model to do regression over vocabulary index for the next token from the current token embedding - using the same single regression model for all layer depths. If you plot mse of token index prediction versus layer depth then you can see that the mse of the prediction decreases steadily per additional layer. This appears to be because token index in eg BPE is actually fairly smooth and so it seems like the model is capable of localizing to the actual correct vocab index as depth increases, so kind of like a fuzzy->discrete refinement as you go deeper in layers https://arxiv.org/abs/2408.13442
I am not surprised at all that Meta was able to generate some positive returns by feeding the last hidden layer back into the model auto-regressively.
The method of training they describe in the paper is really cool. Summarized in Figure 2, they train it with a corpus of step-by-step text instructions and then across multiple stages, they iteratively replace one of the textual steps with a last-hidden-layer embedding and see what the model spits out. The weights are then updated through cross-entropy loss as the additional text tokens are generated once again.
So they're basically rewinding the output, replacing an increasing number of textual steps with hidden state embeddings, and playing it forward as the model gradually learns to do all of its step-by-step thinking using just the hidden state data.
In a way, this might be how humans learn to think through language. Our parents teach us using words and our brain gradually replaces the words with thoughts until we can replicate the action or solve the problem ourselves without anyone guiding us with words.
In this manner, the model makes use of the rich semantic information encoded at the last hidden layer while informing the user via an extraction of that hidden layer specifically tuned to generate human-legible concepts such as, "I'm considering the impact of converting the units from kilograms to pounds," or whatever.
I've said it before, but I think it's just something like Quiet-STaR, but simplified. They have a bunch of question answer pairs, many of which are difficult. They generate a lot of tokens from the question (let's say, 3x the length of the expected answer), summarise whatever is generated and reinforce whenever it generates the right answer.
I don't think o1 is something complicated.
iirc this is a well supported task iirc called "classification head" instead of "language modeling head" in case anyone else wants to do this as a fine-tuning project
This paper makes me wonder whether, in a very fuzzy sense, we could give #LLMs access to some similarly crystallised analog of emotion or emotional valence, below the level of language
https://x.com/patcon_/status/1866549080127893613?s=46
OpenAi as a mega-organism might be intelligent, but the LLMs definitely are not.
The "compressed capture of semantic relationships" is a new thing we don't have a word for.
Seemed easy, and I thought harmless, tho maybe not
Deleted Comment
Do they? Interpretability techniques like the Logit Lens [1] wouldn't work if this were the case. That author found that at least for GPT-2, the network almost immediately transforms its hidden state into a "logitable" form: you can unproject the hidden state of any layer to see how that layer incrementally refines the next token prediction.
[1]: https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreti...
Like in a CNN for instance, if you fed later representations back in to the first kernels they wouldn't be able to find anything meaningful because it's not the image anymore, it's some latent representation of the image that the early kernels aren't trained on.
Basically, the lm_head (a MxN matrix where M is the dictionary size and N is the internal dimension) can be seen as the dictionary too. You can think that and the softmax over it as compute cosine similarity of the last hidden output w.r.t. input embedding dictionary.
In that sense, they are sharing the representation space.
(BTW, I believe sharing lm_head with input embedding not working as good as separating them, so only mobile focused LLMs do so. So here is that. It would be interesting to experiment if injecting a projection layer like you suggested would improve performance or just red-herring).
I feel back pain when I read the crazy, unsound speculation about how the brain is supposed to be like a computer. Serious mistake.
I think we lack a good formal definition of what (fuzzy) reasoning is. Without it, we will always have some kind of unexplained hallucinations.
I also believe AGI could be implemented as a model that can train models for specific tasks completely autonomously. But that would kill the cash cow, so OpenAI etc. are not interested in developing it.
https://www.smbc-comics.com/comic/the-talk-3