valine (u/valine) - Readit News

valine commented on From tokens to thoughts: How LLMs and humans trade compression for meaning arxiv.org/abs/2505.17117... · Posted by u/ggirelli

blackbear_ · 3 months ago

Note that the token embeddings are also trained, therefore their values do give some hints on how a model is organizing information.

They used token embeddings directly and not intermediate representations because the latter depend on the specific sentence that the model is processing. Data on human judgment was however collected without any context surrounding each word, thus using the token embeddings seem to be the most fair comparison.

Otherwise, what sentence(s) would you have used to compute the intermediate representations? And how would you make sure that the results aren't biased by these sentences?

valine · 3 months ago

Embedding models are not always trained with the rest of the model. That’s the whole idea behind VLLMs. First layer embeddings are so interchangeable you can literally feed in the output of other models using linear projection layers.

And like the other commenter said, you can absolutely feed single tokens through the model. Your point doesn’t make any sense though regardless. How about priming the model with “You’re a helpful assistant” just like everyone else does.

valine commented on From tokens to thoughts: How LLMs and humans trade compression for meaning arxiv.org/abs/2505.17117... · Posted by u/ggirelli

valine · 3 months ago

>> For each LLM, we extract static, token-level embeddings from its input embedding layer (the ‘E‘matrix). This choice aligns our analysis with the context-free nature of stimuli typical in human categorization experiments, ensuring a comparable representational basis.

They're analyzing input embedding models, not LLMs. I'm not sure how the authors justify making claims about the inner workings of LLMs when they haven't actually computed a forward pass. The EMatrix is not an LLM, its a lookup table.

Just to highlight the ridiculousness of this research, no attention was computed! Not a single dot product between keys and queries. All of their conclusions are drawn from the output of an embedding lookup table.

The figure showing their alignment score correlated with model size is particularly egregious. Model size is meaningless when you never activate any model parameters. If Bert is outperforming Qwen and Gemma something is wrong with your methodology.

valine commented on Outcome-Based Reinforcement Learning to Predict the Future arxiv.org/abs/2505.17989... · Posted by u/bturtel

valine · 3 months ago

So instead of next token prediction its next event prediction. At some point this just loops around and we're back to teaching models to predict the next token in the sequence.

valine commented on Beyond Semantics: Unreasonable Effectiveness of Reasonless Intermediate Tokens arxiv.org/abs/2505.13775... · Posted by u/nyrikki

x_flynn · 3 months ago

What the model is doing in latent space is auxilliary to anthropomorphic interpretations of the tokens, though. And if the latent reasoning matches a ground-truth procedure (A*), then we'd expect it to be projectable to semantic tokens, but it isn't. So it seems the model has learned an alternative method for solving these problems.

valine · 3 months ago

You’re thinking about this like the final layer of the model is all that exists. It’s highly likely reasoning is happening at a lower layer, in a different latent space that can’t natively be projected into logits.

valine commented on Beyond Semantics: Unreasonable Effectiveness of Reasonless Intermediate Tokens arxiv.org/abs/2505.13775... · Posted by u/nyrikki

pyinstallwoes · 3 months ago

Where does it happen ?

valine · 3 months ago

My personal theory is that it’s an emergent property of many attention heads working together. If each attention head is a bird, reasoning would be the movement of the flock.

valine commented on Beyond Semantics: Unreasonable Effectiveness of Reasonless Intermediate Tokens arxiv.org/abs/2505.13775... · Posted by u/nyrikki

aiiizzz · 3 months ago

Is that really true? E.g. anthropic said that the model can make decisions about all the tokens, before a single token is produced.

valine · 3 months ago

That’s true yeah. The model can do that because calculating latents is independent of next token prediction. You do a forward pass for each token in your sequence without the final projection to logits.

valine commented on Beyond Semantics: Unreasonable Effectiveness of Reasonless Intermediate Tokens arxiv.org/abs/2505.13775... · Posted by u/nyrikki

bcoates · 3 months ago

Either I'm wildly misunderstanding or that can't possibly be true--if you sample at high temperature and it chooses a very-low probability token, it continues consistent with the chosen token, not with the more likely ones

valine · 3 months ago

Attention computes a weighted average of all previous latents. So yes, it’s a new token as input to the forward pass, but after it feeds through an attention head it contains a little bit of every previous latent.

valine commented on Beyond Semantics: Unreasonable Effectiveness of Reasonless Intermediate Tokens arxiv.org/abs/2505.13775... · Posted by u/nyrikki

jacob019 · 3 months ago

I don't think that's accurate. The logits actually have high dimensionality, and they are intermediate outputs used to sample tokens. The latent representations contain contextual information and are also high-dimensional, but they serve a different role--they feed into the logits.

valine · 3 months ago

The dimensionality I suppose depends on the vocab size and your hidden dimension size, but that’s not really relevant. It’s a single linear projection to go from latents to logits.

Reasoning is definitely not happening in the linear projection to logits if that’s what you mean.

valine commented on Beyond Semantics: Unreasonable Effectiveness of Reasonless Intermediate Tokens arxiv.org/abs/2505.13775... · Posted by u/nyrikki

jacob019 · 3 months ago

So you're saying that the reasoning trace represents sequential connections between the full distribution rather than the sampled tokens from that distribution?

valine · 3 months ago

The lower dimensional logits are discarded, the original high dimensional latents are not.

But yeah, the LLM doesn’t even know the sampler exists. I used the last layer as an example, but it’s likely that reasoning traces exist in the latent space of every layer not just the final one, with the most complex reasoning concentrated in the middle layers.

valine commented on Beyond Semantics: Unreasonable Effectiveness of Reasonless Intermediate Tokens arxiv.org/abs/2505.13775... · Posted by u/nyrikki

valine · 3 months ago

I think it’s helpful to remember that language models are not producing tokens, they are producing a distribution of possible next tokens. Just because your sampler picks a sequence of tokens that contain incorrect reasoning doesn't mean a useful reasoning trace isn’t also contained within the latent space.

It’s a misconception that transformers reason in token space. Tokens don’t attend to other tokens. High dimensional latents attend to other high dimensional latents. The final layer of a decoder only transformer has full access to entire latent space of all previous latents, the same latents you can project into a distribution of next tokens.