Training LLMs to Reason in a Continuous Latent Space

I've been looking into using the last hidden layer of an off-the-shelf LLM to help my company with a classification task. The last hidden layer is obviously super rich in semantic information because it has to somehow tell the next layer how to generate the next token prediction. That final layer, in some respects, is discarding valuable context information that the final hidden layer encodes.

I am not surprised at all that Meta was able to generate some positive returns by feeding the last hidden layer back into the model auto-regressively.

The method of training they describe in the paper is really cool. Summarized in Figure 2, they train it with a corpus of step-by-step text instructions and then across multiple stages, they iteratively replace one of the textual steps with a last-hidden-layer embedding and see what the model spits out. The weights are then updated through cross-entropy loss as the additional text tokens are generated once again.

So they're basically rewinding the output, replacing an increasing number of textual steps with hidden state embeddings, and playing it forward as the model gradually learns to do all of its step-by-step thinking using just the hidden state data.

In a way, this might be how humans learn to think through language. Our parents teach us using words and our brain gradually replaces the words with thoughts until we can replicate the action or solve the problem ourselves without anyone guiding us with words.

ttul · a year ago

Indeed, I would not be surprised if OpenAI one day admits that the `o1` model uses the last hidden layer (or some other intermediate layer) to feed the "thought process" that you can watch as it "thinks" about the answer. I suspect that they may take the last hidden layer and feed it back into the front of the `o1` model while also feeding a separate, likely much smaller LLM that generates the "thought process" as language tokens.

In this manner, the model makes use of the rich semantic information encoded at the last hidden layer while informing the user via an extraction of that hidden layer specifically tuned to generate human-legible concepts such as, "I'm considering the impact of converting the units from kilograms to pounds," or whatever.

impossiblefork · a year ago

I don't think it does, because from this paper this kind of backfeeding is apparently quite difficult to train.

I've said it before, but I think it's just something like Quiet-STaR, but simplified. They have a bunch of question answer pairs, many of which are difficult. They generate a lot of tokens from the question (let's say, 3x the length of the expected answer), summarise whatever is generated and reinforce whenever it generates the right answer.

I don't think o1 is something complicated.

pedrovhb · a year ago

That's certainly possible, but it reminds me a bit of a similar thing I've seen in their UI that rhymes in a way that makes me think otherwise. In the code interpreter tool, you have a little preview of the "steps" it's following as it writes code. This turns out to just be the contents of the last written/streamed comment line. It's a neat UI idea I think - pretty simple and works well. I wouldn't be surprised if that's what's going on with o1 too - the thought process is structured in some way, and they take the headings or section names and just display that.

throwawaymaths · a year ago

> using the last hidden layer

iirc this is a well supported task iirc called "classification head" instead of "language modeling head" in case anyone else wants to do this as a fine-tuning project

WiSaGaN · a year ago

This is intriguing. When I learned that a lot of people do not have inner monologue, I was fascinated by the fact that people can differ on such seemingly fundamental way of being. Maybe those who have it just have a "tee" that pipes into words.

bongodongobob · a year ago

I'm not convinced they don't. Ask them what they do when they read. That's all an inner monologue is.

liuliu · a year ago

BTW, people found that in-conext instruction is useful for these (for example, directly using the last hidden layer to condition a diffusion model is much worse than encoder-decoder model, but you can add instruction prefix "try to imagine more details with the following text: <prompt>" would enrich the last hidden layer vector to be superior than the encoder-decoder text features. Very interesting stuff.

ttul · a year ago

It’s so funny how you can basically tickle the model and it laughs.

psb217 · a year ago

"...because it has to somehow tell the next layer how to generate the next token prediction." -- This isn't actually true in the case of transformers. Features in the final TF layer at time t in a sequence do not depend on the features in the final TF layer at any other time step. Recurrence in transformers is done "depthwise" via "causally masked" convolutions. Final layer features at time t can depend on penultimate layer features at time t-1, but not on final layer features at time t-1.

danielmarkbruce · a year ago

you are misunderstanding what the person is saying. They are saying the final hidden layer outputs a vector which has all the information that decides the logits which decide the probabilities of each token in the entire vocabulary. Ie, it is storing a lot of information.

I like the direction of the research of working in latent space but feeding the last layer representation back as a first layer embedding feels sketchy to me. Those layers have different representation space.

jsenn · a year ago

> Those layers have different representation space.

Do they? Interpretability techniques like the Logit Lens [1] wouldn't work if this were the case. That author found that at least for GPT-2, the network almost immediately transforms its hidden state into a "logitable" form: you can unproject the hidden state of any layer to see how that layer incrementally refines the next token prediction.

[1]: https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreti...

zxexz · a year ago

Feeding the last layer back as the input embedding has been done many times, e.g. Transformer-XL. The models are trained like this, it's not like they're taking a pre-trained Llama and just feeding it to itself. It's a simple, computationally cheap mechanism to add feedback.

empath75 · a year ago

I read a paper not long ago that showed that deleting, duplicating and reordering layers doesn't actually seem to matter that much and it feeding back is just a kind of re-ordering.

fabmilo · a year ago

from my understanding that is what they do, see the paper: > We use a pre-trained GPT-2 (Radford et al., 2019) as the base model for all experiments. I agree the feedback is necessary, and the mechanism simple and cheap, but I don't think is optimal.

mbowcut2 · a year ago

This was my first thought too. AFAIK each layer encodes different information, and it's not clear that the last layer would be able to communicate well with the first layer without substantial retraining.

Like in a CNN for instance, if you fed later representations back in to the first kernels they wouldn't be able to find anything meaningful because it's not the image anymore, it's some latent representation of the image that the early kernels aren't trained on.

paraschopra · a year ago

The point is that training regime can force the network to immediately reshape the representation layer (after inputs) depending on whether it is a thought or language context.

liuliu · a year ago

Not really. See the literature on sharing lm_head (last matrix multiplication) with the input embedding dict.

Basically, the lm_head (a MxN matrix where M is the dictionary size and N is the internal dimension) can be seen as the dictionary too. You can think that and the softmax over it as compute cosine similarity of the last hidden output w.r.t. input embedding dictionary.

In that sense, they are sharing the representation space.

(BTW, I believe sharing lm_head with input embedding not working as good as separating them, so only mobile focused LLMs do so. So here is that. It would be interesting to experiment if injecting a projection layer like you suggested would improve performance or just red-herring).

danielmarkbruce · a year ago

llama 3.x is already sharing the last layer with the embedding layer, it just uses the transpose in the last layer operation.

max93 · a year ago

We conducted similar research earlier and successfully improved performance to a level comparable to models with 3x larger layer sizes. https://arxiv.org/html/2409.14199v3 We utilize more computational time in the latent space to achieve better performance. However, this approach introduces greater resistance compared to Chain of Thought (CoT) reasoning in the token space, especially if the number of CoT rounds in the latent space exceeds 20. I would using the term "better approximation of the data distribution" instead of "reasoning" to describe this kind of process.

cootsnuck · a year ago

So perhaps could be useful for fine-tuning on smaller models?

I think so. I believe this type of reasoning method, which achieves better results through longer computation time, is very useful on edge devices like mobile phones. Consider a scenario where we only need the model to output a function/action call on the phone; we don't require it to provide an immediate response.

patcon · a year ago

I think of an LLM model as like a crystallised mathematical snapshot of intelligence... like a cell on a microscope slide, a dead and mounted form of output from the living process of intelligence...

This paper makes me wonder whether, in a very fuzzy sense, we could give #LLMs access to some similarly crystallised analog of emotion or emotional valence, below the level of language

https://x.com/patcon_/status/1866549080127893613?s=46

HeatrayEnjoyer · a year ago

Maybe "stasis" is more appropriate than "dead." Each new session is an unfrozen clone of the original mind snapshot.

threeseed · a year ago

Intelligence is more than just knowing the probabilistic relationship between every word.

Rhapso · a year ago

"Intelligence" is a continuous process. Without a continuous feedback loop, LLMs will never be more than a compression algorithm we bullied into being a chatbot.

OpenAi as a mega-organism might be intelligent, but the LLMs definitely are not.

The "compressed capture of semantic relationships" is a new thing we don't have a word for.

3abiton · a year ago

It's part of the process, given that the "bigger picture" remains in context.

winwang · a year ago

Do you have strong evidence for this?

rsrsrs86 · a year ago

Please spread the word that predicting the next one is not intelligence. It’s markov…

edgyquant · a year ago

Did you really just link to a post from your Twitter saying the same thing you did here?

Meh. I'm sometimes curious the different conversations that are possible in different places, I guess? One sometimes hears from different ppl, but maybe wants cross-talk

Seemed easy, and I thought harmless, tho maybe not

padolsey · a year ago

Was it just me who thought that this was _already_ how LLMs worked? I'd always assumed they were -- so to speak -- swimming in their own embeddings space before coming out on the other side with language. But it turns out they're just feeding their own incremental outputs back into themselves, without a memory of the path they took to get there. Yowzer!

vrighter · a year ago

There isn't any memory of how it got to where it did because all weights are evaluated all the time. It got there through the entirety of the network. There is no logic, just (mostly) a bunch of multiply-accumulates.

Deleted Comment

I wonder what would happen if you just ran this on a continuous loop and only intermittently fed in new tokens or queried it for token outputs.

Well, if you consider the case of a linear regression, fitting on your output will add no new information to the weights. Try that on any notebook.

I feel abdominal pain when I see the words “thinking” or “reasoning” related to LLMs.

I feel back pain when I read the crazy, unsound speculation about how the brain is supposed to be like a computer. Serious mistake.

vidarh · a year ago

Unless you can show an example of humans reasoning solving a problem outside the Turing computable set, there is no rational basis for assuming the brain is anything but a computer, as the very notion that we exceed Turing computability would be revolutionary and utterly mindbending in terms of consequences on a number of fields.

platz · a year ago

there is no rational basis for assuming the brain is a "computer" in the same way an intel x86 chip is a "computer" or that the universe is a "computer". Using language in this way without defining terms like what even is a computer is folly.

pounderstanding · a year ago

Brain is subset of computers, but llms are not subset of brains.

js8 · a year ago

I agree with you. "Chain of thought" is not reasoning, just like LSD trip isn't.

I think we lack a good formal definition of what (fuzzy) reasoning is. Without it, we will always have some kind of unexplained hallucinations.

I also believe AGI could be implemented as a model that can train models for specific tasks completely autonomously. But that would kill the cash cow, so OpenAI etc. are not interested in developing it.

234120987654 · a year ago

100% agree. I miss the days where the title would describe the method instead of being a sales pitch

Terr_ · a year ago

That reminds me of the punchline to this lengthy coming, which you might enjoy and/or find back-pain from.

https://www.smbc-comics.com/comic/the-talk-3

SubiculumCode · a year ago

In LLMs, is there a correlation between layer depth and the activations correspondence to the abstract to concrete details continuum?

pizza · a year ago

Yes: for eg BPE, due to how it progressively pushes compound tokens of already seen - hence more common - subtokens to the ‘top’ of the vocab), you can train a model to do regression over vocabulary index for the next token from the current token embedding - using the same single regression model for all layer depths. If you plot mse of token index prediction versus layer depth then you can see that the mse of the prediction decreases steadily per additional layer. This appears to be because token index in eg BPE is actually fairly smooth and so it seems like the model is capable of localizing to the actual correct vocab index as depth increases, so kind of like a fuzzy->discrete refinement as you go deeper in layers https://arxiv.org/abs/2408.13442

THANKS!