Distance measures are only as good as the Pseudo-Riemannian metric they (implicitly)implement. If the manifold hypothesis is believed, then these metrics should be local because the manifold curvature is a local property. You would be mistaken to use an ordinary dot product to compare straight lines on a map of the globe, because those lines aren’t actually straight - they do not account for the rich information in the curvature tensor. Using the wrong inner product is akin to the flat-Earth fallacy.
I'm not sure I understand the underlying maths well enough to opine on your point but I can say for certain that no embedding space that I've ever seen used for any kind of ML is uniform in the sense that a Euclidian distance around one point means the same thing as the same Euclidian distance around another point. I'm not even sure that it would be possible to make an embedding that was uniform in that way because it would mean that we had a universal measure of similarity between concepts (which can obviously be enormously different).
The other potential issue is for all the embeddings that I have seen the resulting space once you have embedded some documents is sort of "clumpy" and very sparse overall. So you have very large areas with basically nothing at all I think because semantically there are many dimensions which only make sense for subsets of concepts so you end up with big voids where really the embedding space is totally unreachable so distance doesn't have any meaning at all.
In spite of all that there are a few similarity measures which work well enough to be useful for many practical purposes and cosine similarity is one of them. I don't think anyone thinks it's perfect.
This is exactly right and is one (among many) reasons that reliance on cosine similarities in my field (computational social science) is so problematic. The curvature of the manifold must be accounted for in measuring distances. Other measures based on optimal transport are more theoretically sound, but are computationally expensive.
We implicitly train for minimizing of the distance in that style of metric - by using functions continuous and differentiable on classic manifolds (where continuity and differentiability is defined using the classic local maps into Euclidian space). I think if we were training using functions continuous and differentiable in say p-adic metric space (which looks extremely jagged/fractallian/non-continuos when embedded into Euclidian) then we'd have something like p-adic version of cosine (or other L-something metric) for similarity
> In the following, we show that [taking cosine similarity between two features in a learned embedding] can lead to arbitrary results, and they may not even be unique.
Was uniqueness ever a guarantee? It's a distance metric. It's reasonable to assume that two features can be equidistant to the ideal solution to a linear system of equations. Maybe I'm missing something.
It's not even a distance metric, it doesn't obey the triangle inequality (hence the not-technically-meaningful name "similarity", like "collection" as opposed to "set").
I sure hope noone claimed that. You’re doing potentially huge dimensionality reduction, uniqueness would be like saying you cannot have md5 collisions.
I think maybe it's poorly phrased. As far as I can tell, their linear regression example for eq. 2 has an unique solution, but I think they state I that when optimizing for cosine similarity you can find non-unique solutions. But I haven't read in detail.
Then again, you could argue whether that is a problem when considering very high dimensional embeddings. Their conclusions seem to point in that direction but I would not agree on that.
Embeddings result from computing what word can appear in a given context, so words that would appear in the same spot will have higher cosine score between themselves.
But it doesn't differentiate further, so you can have "beautiful" and "ugly" embed very close to each other even though they are opposites - they tend to appear in similar places.
Another limitation of embeddings and cosine-similarity is that that they can't tell you "how similar" - is it equivalence or just relatedness? They make a mess of equivalent, antonymous and related things.
For modern embedding models which effectively mean-pool the last hidden state of LLMs (and therefore make use of its optimizations such as attention tricks), embeddings can be much more robust to different contexts both local and global.
All embeddings are first layer of DNN. In case of word2vec this is shallow 2-layer network. Selection of embedding is multiplication of embedding matrix by one-hot vector, which is usually optimized as array lookup.
That is why computational linguistics prefer the term related over similar here.
Similarity is notoriously hard to define, for starters in terms of grammatical vs semantically similarity.
Only if those two words appear in the same contexts with the same frequency. In natural language this is probably not the case. There are things typically considered beautiful and others as ugly.
one key insights is that opposites do have a lot in common, they are often opposites in exactly one of n feature dimensions. For example black and white are both (arguably) colors, are related to (lack of) light, have representations in various formats (RGB, …), appear in the same grammatical position of ordered adjectives …
I think that yours comment is very interesting, I have reflected many times about how to differentiate things that appear in the same context of things that are similar. Any big idea here could be the spark to initiate a great startup.
They make a mess of language. They are not a suitable representation. They are suitable for their efficiency in information retrieval systems and for sometimes crudely capturing semantic attributes in a way that is unreliable and uninterpretable. It ends there. Here's to ten more years of word2vec.
Distance metrics are an interesting topic. The field of ecology has a ton of them. For example see vegdist the Dissimilarity Indices for Community Ecologists function in the Vegan package in R:
https://rdrr.io/cran/vegan/man/vegdist.html
which includes, among others the "canberra", "clark", "bray", "kulczynski", "gower", "altGower", "morisita", "horn", "mountford", "raup", "chao", "cao", "mahalanobis", "chord", "hellinger", "aitchison", or "robust.aitchison".
Generic distance metrics can often be replaced with context-specific ones for better utility; it makes me wonder whether that insight could be useful in deep learning.
hell if I know!! Sorry. I've used the vegan package for some analyses, but I've mostly used Manhattan and cosine metrics. I just wanted to bring up the idea that there are a lot of metrics out there that may not be generally appreciated.
Claude Opus says "There are a few good distance metrics commonly used with latent embeddings to promote diversity and prevent model collapse:
1. Euclidean Distance (L2 Distance)
2. Cosine distance
3. Kullback-Leibler (KL) Divergence:
KL divergence quantifies how much one probability distribution differs from another. It can be used to measure the difference between the distributions of latent embeddings. Minimizing KL divergence as a diversity loss would encourage the embedding distribution to be more uniform.
4. Maximum Mean Discrepancy (MMD):
MMD measures the difference between two distributions by comparing their moments (mean, variance, etc.) in a reproducing kernel Hilbert space. It's useful for comparing high-dimensional distributions like those of embeddings. MMD loss promotes diversity by penalizing embeddings that are too clustered together.
5. Gaussian Annulus Loss:
This loss function encourages embeddings to lie within an annulus (ring) in the latent space defined by two Gaussian distributions. It promotes uniformity in the embedding norms while allowing angular diversity. This can be effective at preventing collapse to a single point.
But I haven't checked for hallucinations.
I quickly read through the paper. One thing to note is that they use the Frobenius norm (at least I suppose this from the index F) for the matrix factorization. That is for their learning algorithm. Then, they use the cosine-similarity to evaluate. A metric that wasn't used in the algorithm.
This is a long-standing question for me. Theoretically, I should use the CS in my optimization and then also in the evaluation. But I haven't tested this empirically.
For example, there is sperical K-meams that clusters the data on the unit sphere.
I think that's kind of the point of the paper. The model is based on un-normalized dot products, and wasn't deliberately designed to produce meaningful cosine similarities. They are showing that, in that case, cosine similarities might be arbitrary and not as useful as people might assume or hope.
Why would anyone expect cosine-similarity to be a useful metric? In the real word, the arbitrary absolute position of an object in the universe (if it could be measured) isn't that important, it's the directions and distances to nearby objects that matters most.
It's my understanding that the delta between two word embeddings, gives a direction, and the magic is from using those directions to get to new words. The oft cited example is King-Man+Woman = Queen [1]
The scale of word embeddings (eg. Distance from 0) is mainly measuring how common the word is in the training corpus. This is a feature of almost all training objectives since word2vec (though some normalize the vectors).
Uncommon words have more information content than common words. So, common words having larger embedding scale is an issue here.
If you want to measure similarity you need a scale free measure. Cosine similarity (angle distance) does it without normalizing.
If you normalize your vectors, cosine similarity is the same as Euclidean distance. Normalizing your vectors also leads to information destruction, which we'd rather avoid.
There's no real hard theory why the angle between embeddings is meaningful beyond this practical knowledge to my understanding.
Cosine-similarity is a useful metric. The cases where it is useful are models that have been trained specifically to produce a meaningful cosine distance, (e.g. OpenAI's CLIP [1], Sentence Tranformers [2]) - but these are the types of models that the majority of people are using when they use cosine distances.
> It's my understanding that the delta between two word embeddings, gives a direction, and the magic is from using those directions to get to new words... it's the directions and distances to nearby objects that matters most
Cosine similarity is a kind of "delta" / inverse distance between the represenation of two entities, in the case of these models.
From my experience trying to train embeddings from transformers, using cosine similarity is less restrictive for the model than euclidean distance. Both works but cosine similarity seems to have slightly better performance.
Another thing you have to keep in mind is that these embeddings are in n dimensional space. Intuitions about the real world does not apply there
The word2vec inspired tricks like king-man+woman only work if the embedding is trained with synonym/antonym triplets to give them the semantic locality that allows that kind of vector math. This isn't always done, even some word2vec re-implementations skip this step completely. Also, not all embeddings are word embeddings.
My understanding was that Word2Vec[1] was trained on Wikipedia and other such texts, not artificially constructed things like the triplets you suggest. There's an inherent structure present in human languages that enable the "magic" of embeddings to work, as far as I can tell.
The paper kinda leaves you hanging on the "alternatives" front, even though they have a section dedicated to it.
In addition to the _quality_ of any proposed alternative(s), computational speed also has to be a consideration. I've run into multiple situations where you want to measure similarities on the order of millions/billions of times. Especially for realtime applications (like RAG?) speed may even out weight quality.
> While cosine similarity is invariant under such rotations R, one of the
key insights in this paper is that the first (but not the second) objective is
also invariant to rescalings of the columns of A and B
Ha interesting I wrote a blog post where I pointed this out a few years ago [1], and how we got around it for item-item similarity at an old job (essentially an implicit re-projection to original space as noted in section 3).
The other potential issue is for all the embeddings that I have seen the resulting space once you have embedded some documents is sort of "clumpy" and very sparse overall. So you have very large areas with basically nothing at all I think because semantically there are many dimensions which only make sense for subsets of concepts so you end up with big voids where really the embedding space is totally unreachable so distance doesn't have any meaning at all.
In spite of all that there are a few similarity measures which work well enough to be useful for many practical purposes and cosine similarity is one of them. I don't think anyone thinks it's perfect.
Was uniqueness ever a guarantee? It's a distance metric. It's reasonable to assume that two features can be equidistant to the ideal solution to a linear system of equations. Maybe I'm missing something.
Deleted Comment
Then again, you could argue whether that is a problem when considering very high dimensional embeddings. Their conclusions seem to point in that direction but I would not agree on that.
But it doesn't differentiate further, so you can have "beautiful" and "ugly" embed very close to each other even though they are opposites - they tend to appear in similar places.
Another limitation of embeddings and cosine-similarity is that that they can't tell you "how similar" - is it equivalence or just relatedness? They make a mess of equivalent, antonymous and related things.
For modern embedding models which effectively mean-pool the last hidden state of LLMs (and therefore make use of its optimizations such as attention tricks), embeddings can be much more robust to different contexts both local and global.
The last one I have in mind is BERT and it's variants.
All embeddings are first layer of DNN. In case of word2vec this is shallow 2-layer network. Selection of embedding is multiplication of embedding matrix by one-hot vector, which is usually optimized as array lookup.
We need embeddings to give relatedness across axes like synonymity etc.
Generic distance metrics can often be replaced with context-specific ones for better utility; it makes me wonder whether that insight could be useful in deep learning.
Claude Opus says "There are a few good distance metrics commonly used with latent embeddings to promote diversity and prevent model collapse:
1. Euclidean Distance (L2 Distance) 2. Cosine distance 3. Kullback-Leibler (KL) Divergence: KL divergence quantifies how much one probability distribution differs from another. It can be used to measure the difference between the distributions of latent embeddings. Minimizing KL divergence as a diversity loss would encourage the embedding distribution to be more uniform. 4. Maximum Mean Discrepancy (MMD): MMD measures the difference between two distributions by comparing their moments (mean, variance, etc.) in a reproducing kernel Hilbert space. It's useful for comparing high-dimensional distributions like those of embeddings. MMD loss promotes diversity by penalizing embeddings that are too clustered together. 5. Gaussian Annulus Loss: This loss function encourages embeddings to lie within an annulus (ring) in the latent space defined by two Gaussian distributions. It promotes uniformity in the embedding norms while allowing angular diversity. This can be effective at preventing collapse to a single point. But I haven't checked for hallucinations.
Beta diversity is one metric for examining diversity, define as the ratio between regional and local species diversity. https://en.wikipedia.org/wiki/Beta_diversity
This is a long-standing question for me. Theoretically, I should use the CS in my optimization and then also in the evaluation. But I haven't tested this empirically.
For example, there is sperical K-meams that clusters the data on the unit sphere.
Deleted Comment
It's my understanding that the delta between two word embeddings, gives a direction, and the magic is from using those directions to get to new words. The oft cited example is King-Man+Woman = Queen [1]
When did this view fall from favor?
[1] https://www.technologyreview.com/2015/09/17/166211/king-man-...
Uncommon words have more information content than common words. So, common words having larger embedding scale is an issue here.
If you want to measure similarity you need a scale free measure. Cosine similarity (angle distance) does it without normalizing.
If you normalize your vectors, cosine similarity is the same as Euclidean distance. Normalizing your vectors also leads to information destruction, which we'd rather avoid.
There's no real hard theory why the angle between embeddings is meaningful beyond this practical knowledge to my understanding.
If you normalize your vectors, cosine similarity is the same as dot product. Euclidean distance is still different.
> It's my understanding that the delta between two word embeddings, gives a direction, and the magic is from using those directions to get to new words... it's the directions and distances to nearby objects that matters most
Cosine similarity is a kind of "delta" / inverse distance between the represenation of two entities, in the case of these models.
[1] https://arxiv.org/abs/2103.00020
[2] https://www.sbert.net/docs/training/overview.html
Another thing you have to keep in mind is that these embeddings are in n dimensional space. Intuitions about the real world does not apply there
[1] https://code.google.com/archive/p/word2vec/
A direction can be given in terms of an angle measure, such as cosine.
In addition to the _quality_ of any proposed alternative(s), computational speed also has to be a consideration. I've run into multiple situations where you want to measure similarities on the order of millions/billions of times. Especially for realtime applications (like RAG?) speed may even out weight quality.
Ha interesting I wrote a blog post where I pointed this out a few years ago [1], and how we got around it for item-item similarity at an old job (essentially an implicit re-projection to original space as noted in section 3).
https://swarbrickjones.wordpress.com/2016/11/24/note-on-an-i...