It really surprises me that embeddings seem to be one of the least discussed parts of the LLM stack. Intuitively you would think that they would have enormous influence over the network's ability to infer semantic connections. But it doesn't seem that people talk about it too much.
The problem with embeddings is that they're basically inscrutable to anything but the model itself. It's true that they must encode the semantic meaning of the input sequence, but the learning process compresses it to the point that only the model's learned decoder head knows what to do with it. Anthropic's developed interpretable internal features for Sonnet 3 [1], but from what I understand that requires somewhat expensive parallel training of a network whose sole purpose is attempt to disentangle LLM hidden layer activations.
Very much agree re: inscrutability. It gets even more complicated when you add the LLM-specific concept of rotary positional embeddings to the mix. In my experience, it's been exceptionally hard to communicate that concept to even technical folks that may understand (at a high level) the concept of semantic similarity via something like cosine distance.
I've come up with so many failed analogies at this point, I lost count (the concept of fast and slow clocks to represent the positional index / angular rotation has been the closest I've come so far).
This is exactly the challenge. When embedding were first popularized in word to vec they were interpretable because the word2vec model was revealed to be a batched matrix factorization [1].
LLM embedding are so abstract and far removed from a human interpretable or statistical corollary that even as the embeddings contain more information, that information becomes less accessible to humans.
I mean thats true for all DL layers, but we talk about convolutions and stuff often enough. Embedding are relatively new but theres not alot of discussion as to how crazy they are, especially given that they are the real star of the LLM, with transformers being a close second imo
The weird thing about high-dimensional spaces is that most values are orthogonal to each other and most are also very far apart. It’s remarkable that you can still cluster concepts using dimension-reduction techniques when there are 50,000 dimensions to play with.
It would be weird if the points in the embedding space where uniformly distributed but they're not. The entire role of the model in general is to project those results on to a subset of the larger space, that "makes sense" for the problem. Ultimately the projection becomes one in which the latent categories we're trying to predict (class, token, etc) become linearly separable.
absolutely. the first time i learned more deeply about embeddings i was like "whoa... at least a third of the magic of LLMs comes from embeddings". Understanding that words were already semantically arranged in such a useful pattern demystified LLMs a little bit for me. they're still wonderous, but it feels like the curtain has been rolled back a tiny bit for me
I had the same reaction to this comment. At least in my experience in this area, embeddings are heavily discussed and used. At this point, for most traditional NLP tasks involving a vector representation of a text, LLM embeddings are generally a good place to start.
I wrote a simpler explanation still, that follows a similar flow, but approaches it from more of a "problems to solve" perspective: https://sgnt.ai/p/embeddings-explainer/
If I understand this correctly, there are three major problems with LLMs right now.
1. LLMs reduce a very high-dimensional vector space into a very low-dimensional vector space. Since we don't know what the dimensions in the low-dimensional vector space mean, we can only check that the outputs are correct most of the time.
What research is happening to resolve this?
2. LLMs use written texts to facilitate this reduction. So, they don't learn from reality, but from what humans written down about reality.
It seems like Keen Technologies tries to avoid this issue, by using (simple) robots with sensors for training, instead of human text. Which seems a much slower process, but could yield more accurate models in the long run.
3. LLMs holds internal state as a vector that reflects the meaning and context of the "conversation". Which explains, why the quality of responses deteriorates with longer conversations, if one vector is "stamped over" again and again, the meaning of the first "stamps" will get blurred.
Are there alternative ways of holding state or is the only way around this to back up that state vector at every point an revert if things go awry?
Apologies if this comes across as too abstract, but I think your comment raises really important questions.
(1) While studying the properties of the mathematical objects produced is important, I don't think we should understand the situation you describe as a problem to be solved. In old supervised machine learning methods, human beings were tasked with defining the rather crude 'features' of relevance in a data/object domain, so each dimension had some intuitive significance (often binary 'is tall', 'is blue' etc). The question now is really about learning the objective geometry of meaning, so the dimensions of the resultant vector don't exactly have to be 'meaningful' in the same way -- and, counter-intuitive as it may seem, this is progress. Now the question is of the necessary dimensionality of the mathematical space in which semantic relations can be preserved -- and meaning /is/ in some fundamental sense the resultant geometry.
(2) This is where the 'Platonic hypothesis' research [1] is so fascinating: empirically we have found that the learned structures from text and image converge. This isn't saying we don't need images and sensor robots, but it appears we get the best results when training across modalities (language and image, for example). This is really fascinating for how we understand language. While any particular text might get things wrong, the language that human beings have developed over however many thousands of years really does seem to do a good job of breaking out the relevant possible 'features' of experience. The convergence of models trained from language and image suggests a certain convergence between what is learnable from sensory experience of the world and the relations that human beings have slowly come to know through the relations between words.
> LLMs reduce a very high-dimensional vector space into a very low-dimensional vector space.
What do you mean? There is an embedding size that is maintained constant from the first layer to the last. Embedding lookup, N x transformer layers, softmax - all three of them have the same dimension.
Maybe you mean LoRA is "reducing a high-dimensional vector space into a lower-dimensional vector space"
Point 1 is such an interesting and perhaps profound observation about NNs in general (credit to both you and the original author). I had never thought of it that way but it seems to make intuitive sense.
Nice tutorial — the contextual vs static embeddings is the important point; many are familiar with word2vec (static), but contextual embeddings are more powerful for many tasks.
(However, there seems to be some serious back-button / browser history hijacking on this page.. Just scolling down the page appends a ton to my browser history, which is lame.)
I thought that the point of replaceState was precisely to avoid appending elements to the history, and instead replace the most recent one, so I think I must be missing something if that line causes lots of additional history items.
Nice explanations!
A (more advanced) aspect which I find missing would be the difference between encoder-decoder transformer models (BERT) and "decoder-only", generative models, with respect to the embeddings.
Minor correction, BERT is an encoder (not encoder-decoder), ChatGPT is a decoder.
Encoders like BERT produce better results for embeddings because they look at the whole sentence, while GPTs look from left to right:
Imagine you're trying to understand the meaning of a word in a sentence, and you can read the entire sentence before deciding what that word means. For example, in "The bank was steep and muddy," you can see "steep and muddy" at the end, which tells you "bank" means the side of a river (aka riverbank), not a financial institution. BERT works this way - it looks at all the words around a target word (both before and after) to understand its meaning.
Now imagine you have to understand each word as you read from left to right, but you're not allowed to peek ahead. So when you encounter "The bank was..." you have to decide what "bank" means based only on "The" - you can't see the helpful clues that come later. GPT models work this way because they're designed to generate text one word at a time, predicting what comes next based only on what they've seen so far.
Further to @dust42, BERT is an encoder, GPT is a decoder, and T5 is an encoder-decoder.
Encoder-decoders are not in vogue.
Encoders are favored for classification, extraction (eg, NER and extractive QA) and information retrieval.
Decoders are favored for text generation, summarization and translation.
Recent research (see, eg, the Ettin paper: https://arxiv.org/html/2507.11412v1 ) seems to confirm the previous understanding that encoders are indeed better for “encoder task” and vice-versa.
Fundamentally, both are transformers and so an encoder could be turned into a decoder or a decoder could be turned into an encoder.
The design difference comes down to bidirectional (ie, all tokens can attend to all other tokens) versus autoregressive attention (ie, the current token can only attend to the previous tokens).
You can use an encoder style architecture with decoder style output heads up top for denoising diffusion mode mask/blank filling.
They seem to be somewhat more expensive on short sequences than GPT style decoder-only models when you batch them, as you need fewer passes over the content and until sequence length blows up your KV cache throughout cost, fewer passes are cheaper.
But for situations that don't get request batching or where the context length is so heavy that you'd prefer to get to exploit memory locality on the attention computation, you'd benefit from diffusion mode decoding.
A nice side effect of the diffusion mode is that it's natural reliance on the bidirectional attention from the encoder layers provides much more flexible (and, critically, context-aware) understanding so as mentioned, later words can easily modulate earlier words like with "bank [of the river]"/"bank [in the park]"/"bank [got robbed]" or the classic of these days: telling an agent it did wrong and expecting it to in-context learn from the mistake (in practice decoder-only models basically merely get polluted from that, so you have to re-wind the conversation, because the later correction has literally no way of backwards-affecting the problematic tokens).
That said, the recent surge in training "reasoning" models to utilize thinking tokens that often get cut out of further conversation context, and all via a reinforcement learning process that's not merely RLHF/preference-conditioning, is actually quite related:
discrete denoising diffusion models can be trained as a RL scheme during pre training where the training step is provided the outcome goal and a masked version as the input query, and then trained to manage the work done in the individual steps on it's own to where it eventually produces the outcome goal, crucially without prescribing any order of filling in the masked tokens or how many to do in which step.
Until we got highly optimized decoder implementations, decoders for prefill were often even implemented by using the same implementation as an encoder, but logit-masking inputs using a causal mask before the attention softmax so that tokens could not attend to future tokens.
> While we can use pretrained models such as Word2Vec to generate embeddings for machine learning models, LLMs commonly produce their own embeddings that are part of the input layer and are updated during training.
So out of interest: During inference, the embedding is simply a lookup table "token ID -> embedding vector". Mathematically, you could represent this as encoding the token ID as a (very very long) one-hot vector, then passing that through a linear layer to get the embedding vector. The linear layer would contain exactly the information from the lookup table.
My question: Is this also how the embeddings are trained? I.e. just treat them as a linear layer and include them in the normal backpropagation of the model?
So, they are included in the normal backpropagation of the model. But there is no one-hot encoding, because, although you are correct that it is equivalent, it would be very inefficient to do it that way. You can make indexing differentiable, i.e. gradient descent flows back to the vectors that were selected, which is more efficient than a one-hot matmul.
To expand upon the other comment:
Indexing and multiplying with one-hot embeddings are equivalent.
IF N is vocab size and L is sequence length, you'd need to create a NxL matrix, and multiply it with the embedding matrix.
But since your NxL matrix will be sparse with only a single 1 per column, it'd make sense to represent it internally as just one number per column, representing the index at which 1 is. At which point if you defined new multiplication by this matrix, it would basically just index with this number.
And just like you write a special forward pass, you can write a special backward pass so that backpropagation would reach it.
If you want to see many more than 50 words and also have an appreciation for 3D data visualization check out embedding projector (no affiliation):
https://projector.tensorflow.org/
Shameless plug: If you want to experiment with semantic search for the pages you visit: https://github.com/mlang/llm-embed-proxy -- a intercepting proxy as a `llm` plugin.
This is really just a PoC. pure.md is a pragmatic solution, since it gives good results. I was looking at markitdown but didn't find a way to disable href targets (noisy) nor did my tests of youtube transcripts work with markitdown. Keeping it on my list to monitor. Whatever works best is going to be used.
[1] https://transformer-circuits.pub/2024/scaling-monosemanticit...
I've come up with so many failed analogies at this point, I lost count (the concept of fast and slow clocks to represent the positional index / angular rotation has been the closest I've come so far).
https://ieeexplore.ieee.org/document/10500152
https://ieeexplore.ieee.org/document/10971523
LLM embedding are so abstract and far removed from a human interpretable or statistical corollary that even as the embeddings contain more information, that information becomes less accessible to humans.
[1] https://papers.nips.cc/paper_files/paper/2014/hash/b78666971...
That's a really interesting three-word noun-phrase. Is it a term-of-art, or a personal analogy?
They should be a really big deal! Though I can see why trying to comprehend a 1,000-dimensional vector space might be intimidating.
If I understand this correctly, there are three major problems with LLMs right now.
1. LLMs reduce a very high-dimensional vector space into a very low-dimensional vector space. Since we don't know what the dimensions in the low-dimensional vector space mean, we can only check that the outputs are correct most of the time.
What research is happening to resolve this?
2. LLMs use written texts to facilitate this reduction. So, they don't learn from reality, but from what humans written down about reality.
It seems like Keen Technologies tries to avoid this issue, by using (simple) robots with sensors for training, instead of human text. Which seems a much slower process, but could yield more accurate models in the long run.
3. LLMs holds internal state as a vector that reflects the meaning and context of the "conversation". Which explains, why the quality of responses deteriorates with longer conversations, if one vector is "stamped over" again and again, the meaning of the first "stamps" will get blurred.
Are there alternative ways of holding state or is the only way around this to back up that state vector at every point an revert if things go awry?
(1) While studying the properties of the mathematical objects produced is important, I don't think we should understand the situation you describe as a problem to be solved. In old supervised machine learning methods, human beings were tasked with defining the rather crude 'features' of relevance in a data/object domain, so each dimension had some intuitive significance (often binary 'is tall', 'is blue' etc). The question now is really about learning the objective geometry of meaning, so the dimensions of the resultant vector don't exactly have to be 'meaningful' in the same way -- and, counter-intuitive as it may seem, this is progress. Now the question is of the necessary dimensionality of the mathematical space in which semantic relations can be preserved -- and meaning /is/ in some fundamental sense the resultant geometry.
(2) This is where the 'Platonic hypothesis' research [1] is so fascinating: empirically we have found that the learned structures from text and image converge. This isn't saying we don't need images and sensor robots, but it appears we get the best results when training across modalities (language and image, for example). This is really fascinating for how we understand language. While any particular text might get things wrong, the language that human beings have developed over however many thousands of years really does seem to do a good job of breaking out the relevant possible 'features' of experience. The convergence of models trained from language and image suggests a certain convergence between what is learnable from sensory experience of the world and the relations that human beings have slowly come to know through the relations between words.
[1] https://phillipi.github.io/prh/ and https://arxiv.org/pdf/2405.07987
What do you mean? There is an embedding size that is maintained constant from the first layer to the last. Embedding lookup, N x transformer layers, softmax - all three of them have the same dimension.
Maybe you mean LoRA is "reducing a high-dimensional vector space into a lower-dimensional vector space"
(However, there seems to be some serious back-button / browser history hijacking on this page.. Just scolling down the page appends a ton to my browser history, which is lame.)
So someone, at some point, thought this was a feature
Encoders like BERT produce better results for embeddings because they look at the whole sentence, while GPTs look from left to right:
Imagine you're trying to understand the meaning of a word in a sentence, and you can read the entire sentence before deciding what that word means. For example, in "The bank was steep and muddy," you can see "steep and muddy" at the end, which tells you "bank" means the side of a river (aka riverbank), not a financial institution. BERT works this way - it looks at all the words around a target word (both before and after) to understand its meaning.
Now imagine you have to understand each word as you read from left to right, but you're not allowed to peek ahead. So when you encounter "The bank was..." you have to decide what "bank" means based only on "The" - you can't see the helpful clues that come later. GPT models work this way because they're designed to generate text one word at a time, predicting what comes next based only on what they've seen so far.
Here is a link also from huggingface, about modernBERT which has more info: https://huggingface.co/blog/modernbert
Also worth a look: neoBERT https://huggingface.co/papers/2502.19587
E.g. Er macht das Fenster. vs Er macht das Fenster auf.
(He makes the window. vs He opens the window.)
Encoder-decoders are not in vogue.
Encoders are favored for classification, extraction (eg, NER and extractive QA) and information retrieval.
Decoders are favored for text generation, summarization and translation.
Recent research (see, eg, the Ettin paper: https://arxiv.org/html/2507.11412v1 ) seems to confirm the previous understanding that encoders are indeed better for “encoder task” and vice-versa.
Fundamentally, both are transformers and so an encoder could be turned into a decoder or a decoder could be turned into an encoder.
The design difference comes down to bidirectional (ie, all tokens can attend to all other tokens) versus autoregressive attention (ie, the current token can only attend to the previous tokens).
A nice side effect of the diffusion mode is that it's natural reliance on the bidirectional attention from the encoder layers provides much more flexible (and, critically, context-aware) understanding so as mentioned, later words can easily modulate earlier words like with "bank [of the river]"/"bank [in the park]"/"bank [got robbed]" or the classic of these days: telling an agent it did wrong and expecting it to in-context learn from the mistake (in practice decoder-only models basically merely get polluted from that, so you have to re-wind the conversation, because the later correction has literally no way of backwards-affecting the problematic tokens).
That said, the recent surge in training "reasoning" models to utilize thinking tokens that often get cut out of further conversation context, and all via a reinforcement learning process that's not merely RLHF/preference-conditioning, is actually quite related: discrete denoising diffusion models can be trained as a RL scheme during pre training where the training step is provided the outcome goal and a masked version as the input query, and then trained to manage the work done in the individual steps on it's own to where it eventually produces the outcome goal, crucially without prescribing any order of filling in the masked tokens or how many to do in which step.
A recent paper on the matter: https://openreview.net/forum?id=MJNywBdSDy
So out of interest: During inference, the embedding is simply a lookup table "token ID -> embedding vector". Mathematically, you could represent this as encoding the token ID as a (very very long) one-hot vector, then passing that through a linear layer to get the embedding vector. The linear layer would contain exactly the information from the lookup table.
My question: Is this also how the embeddings are trained? I.e. just treat them as a linear layer and include them in the normal backpropagation of the model?
(If you're curious about the details, there's an example of making indexing differentiable in my minimal deep learning library here: https://github.com/sradc/SmallPebble/blob/2cd915c4ba72bf2d92...)
IF N is vocab size and L is sequence length, you'd need to create a NxL matrix, and multiply it with the embedding matrix. But since your NxL matrix will be sparse with only a single 1 per column, it'd make sense to represent it internally as just one number per column, representing the index at which 1 is. At which point if you defined new multiplication by this matrix, it would basically just index with this number.
And just like you write a special forward pass, you can write a special backward pass so that backpropagation would reach it.
in case you want to play and visually understand the traditional PE;