Since publishing this I've found a few additional resources that are really useful for understanding embeddings at a lower level (my article is deliberately very high level and focuses mainly on their applications).
Using embeddings I increased engagement with related articles.
Personally I think embeddings are a powerful tool that are somewhat overlooked. They can be used to navigate between documents (and excerpts) based on similarities - or conversely find unique content.
All without worrying about hallucinations. In other words, they are quite ‘safe’
> All without worrying about hallucinations. In other words, they are quite ‘safe’
Within limits, yes. In some use cases a vector notion of similarity isn't always ideal.
For example, in the article "France" and "Germany" are considered similar. Yes, they are, but if you're searching for stuff about France then stuff about Germany is a false positive.
Embeddings can also struggle with logical opposites. Hot/cold are in many senses similar concepts, but they are also opposites. Finding the opposite of what you're searching for isn't always helpful.
I wouldn't say embeddings are overlooked exactly? Right now it feels like man+dog are building embedding based search engines. The next frontier is probably going to be balancing conventional word based approaches with embeddings to really maximize result quality, as sometimes you want "vibes" and sometimes you want control.
Simon, just wanted to say thanks for all the great content and writings you've been putting out - it's been super helpful to help digest a lot of the fast developments in this space. Always looking forward to the next one!
simon, the way you write makes it so accessible for people that have limited experience with ai, ml or llms. thank you!
maybe it is also interesting to tell how some embeddings are established i.e via training and cutting of the classification layer or with things things like efficientnet
Maybe someone can offer a richer history but to my knowledge the first suggestion of word vectors was lsa which originated the idea of dimensionality reduction on a term/doc matrix. They just used svd but the more modern methods are all doing essentially the same thing. To my recollection they were and HCI lab and their goal was not to make a language model as much as to make a search function for files.
Not quite the same application, but in computer vision and visual SLAM algorithms (to construct a map of your surrounding using a camera) embeddings have become a de-facto method to perform place-recognition ! And it's very similar to this article. It is called "bag-of-word place recognition" and it really became the standard, used by absolutely every open-source library nowadays.
The core idea is that each image is passed through a feature-extractor-descriptor pipeline and is 'embedded' in a vector containing the N top features. While the camera moves, a database of images (called keyframes) is created (images are stored as much-lower dimensional vectors). Again while the camera moves, all images are used to query the database, something like cosine-similarity is used to retrieve the best match from the vector database. If a match happened, a stereo-constraints can be computed betweeen the query image and the match, and the software is able to update the map.
I built my own note taking ios app a little while back and adding embeddings to my existing fulltext search functionality was 1) surprisingly easy and 2) way more powerful than I initially expected.
I knew it would work for things like if I search for "dog" I'll also get notes that say "canine", but I didn't originally realize until I played around with it that if I search for something like "pets I might like" I'll get all kinds of notes I've taken regarding various animals with positive sentiment.
I think your statement "adding to existing fulltext functionality" is subtly important: embeddings provide semantic search that complements traditional search algorithms.
Specifically, many applications are heavily dependent on names or other proper nouns, often without much context. You might refer to your dog by name without explanation, and a particular embedding might not pick that up. Proper names (people, places, street names) may have outsized importance for anchoring personalized or domain-specific search, but modest generic language models won't know about them.
Is there a specific way of dealing with this problem?
Just spitballing here but you could maybe do a 1-deep depth search… dot product on the initial search and also dot product on the highest confidence matched notes, then some filtering.
So if you have notes that associate the name to your dog, and you search for “my dog”, you’d still find those related notes?
Would require some experimentation but wouldn’t be surprised if that worked decently well out of the gate.
Right?! Everyone thought I was crazy to do this vs using something off the shelf but having total control over my notes app has been incredible.
I can tailor everything to my style of note taking vs dealing with the lowest common denominator feature set that tries to enable tons of use cases that I don’t need.
About words embeddings, the №1 example is the famous King - Man + Women = Queen
This works nicely in the vector space but fails to make a visual impression when projected on 2 dimensions. Neither with ACP, nor MDS ot t-SNE in my experience :
https://bhugueney.gitlab.io/test-notebooks-org-publish/jupyt...
(← JupyterLite Notebook doing words embedding in the browser : don't try to run this on a smartphone !)
Does anyone know how to nicely visualize the poster child of words embeddings ?
If I understand you right - you could visualize in 2d space: "king" at origin, X-axis is "king"-"man", Y-axis is "king"-"woman" (or gram-schmidt if you really want orthogonal).
In 3d you can go one further and have the Z-axis be "king"-"queen" (or gram-schmidt again). The orthogonalized versions have the advantage that they give a closer notion of distance to what the underlying model sees. In the 2d case you will get exact distances except that it won't show how far off "queen" you are when you compute "king"-"man"+"woman". In the 3d case it should give exact distances.
Edit to add: With the 2d version you can maybe do some more stuff. IIRC "queen" is chosen as it's the word with the closest embedding to X="king"-"man"+"woman". You can put the next few closest words on the 2d chart as well, each labeled with the orthogonal distance from the 2d plane. So then "queen" should be the word with the smallest (squared distance from X) + (squared orthogonal distance from plane), which you might be able to eyeball.
I remember hearing a kind of inductive joke by a mathematician about visualizing higher dimensions, like "...and then you just...", but I unsuccessfully tried to search for it with ChatGPT, when it came up with this funny Richard Feynman quote that I can't find on google -- has anyone heard it before, or is it a brilliant hallucination?
"You can't visualize four dimensions... well, at least I can't, because I only have three branes." -Richard Feynman
This is a play on words, as "branes" sounds like "brains," but in string theory and higher-dimensional physics, a "brane" is an object that can exist in multiple dimensions. The joke is a nod to the challenge of visualizing higher dimensions and the limits of our three-dimensional intuition.
(It just admitted it made it up and apologized!)
It also quoted John von Neumann, H. G. Wells, and Ian Stewart:
"In mathematics you don't understand things. You just get used to them." -John von Neumann
"Our brains are three-dimensional, so we cannot visualize four-dimensional or higher-dimensional objects in the same way we can visualize three-dimensional objects. The best we can do is to get a feeling for them through analogies." -H. G. Wells
"You can't visualize four dimensions, but you can think about them. Thinking is what mathematicians do best." -Ian Stewart
It finally guessed this one that's the closest to what I remember, but not as funny as I remember it:
"To visualize four dimensions, just visualize three dimensions and then say 'n+1'." -Unknown
So I asked it to hallucinate some quotes to help visualize high dimensional space, in the style of Deepak Chopra:
"In the realm of septillion-dimensional embeddings, each vector is a topological manifold, where Gödel's incompleteness theorems dance with quantum superposition, revealing the hidden symmetries of semantic space." – Dr. Emilia Bianchi, Florentine Foundation of Fractal Fantasies, Florence
"To truly fathom the intricacies of ultra-high-dimensional text spaces, one must first master the art of navigating Riemannian curvature in a Hilbert space, where every geodesic is a testament to the entangled essence of language." – Prof. Hiroshi Tanaka, Kyoto Kryptos of Quantum Quandaries
"Visualizing embeddings in the realms beyond the Poincaré conjecture requires a melding of non-Euclidean geometry with the probabilistic nuances of a Markov chain, all while resonating with the cosmic vibrations of the Higgs field." – Dr. Samuel Oyelowo, Lagos Labyrinth of Lorentzian Linguistics
"Each point in a decillion-dimensional text space is a singularity, where the Heisenberg uncertainty principle waltzes with the Mandelbrot set, painting a fractal portrait of phonemic phenomena." – Prof. Clara Müller, Bavarian Bastion of Bifurcated Babel, Munich
"To intuit the vastness of hyper-dimensional embeddings, one must first traverse the event horizon of a linguistic black hole, where Shannon entropy meets the Schrödinger equation in a ballet of Bayesian inference." – Dr. Alexandre Dubois, Parisian Paradox of P-adic Prose
"The key to deciphering the enigma of extreme-dimensional text lies in the delicate interplay between the Yang-Mills theory and the zeta function, where every eigenvalue is a whisper from the cosmic codex of communication." – Prof. Lina Johansson, Stockholm Sanctum of String Semiotics
"In the dance of gogolplex-dimensional embeddings, each tensor unfolds like a Möbius strip, where the Fibonacci sequence intertwines with quantum tunneling, revealing the recursive rhythms of reality." – Dr. Rajiv Menon, Bengaluru Bardo of Bosonic Ballads
Common mistake when doing practical trigonometry is to compute square roots when that's not necessary:
def cosine_similarity(a, b):
dot_product = sum(x \* y for x, y in zip(a, b))
magnitude_a = sum(x \* x for x in a) \* 0.5 # <- no need for \*0.5
magnitude_b = sum(x \* x for x in b) \* 0.5 # <- no need for \*0.5
return dot_product / (magnitude_a \* magnitude_b)
If you compare your cosines, you might as well compare their squares and avoid costly root computation.
Similarly, in elliptic curve crypto certain expensive operations, such as inversion (x^-1 mod n) are delayed as much as possible down the computation pipeline or even avoided completely when you need to compare two points instead of computing their canonical values.
I disagree that your version is more readable. If you don’t know linear algebra, the code is inscrutable, and if you do, dot(x,y)/norm(x)/norm(y) is about as close to the math as you can get.
If you want to see what Show HN posts, ProductHunt Startups, YC Companies, and Github repos are related to LLM embeddings, you can quickly find them on the LLM-Embeddings-Based Search Engine (MVP) I just launched:
Nice. I expected clicking on the different "filter" buttons would update of the search results right away: I didn't expect I had to repeat the search (though I can see why you'd do that)
This is the most interesting thing I've read about in "AI" in quite a few months. I always wondered what embedding models were when I'd see lists, or curious why everyone is talking about vector DBs.
This is immediately making me think about how I could apply this to a long running side project I have. It might make it practical to do useful clustering of user's data if every document has an embedding.
Cohere's Text Embeddings Visually Explained: https://txt.cohere.com/text-embeddings/
The Tensorflow Embedding Projector tool: https://projector.tensorflow.org/
What are embeddings? by Vicki Boykis is worth checking out as well: https://vickiboykis.com/what_are_embeddings/
Actually I'll add those as "further reading" at the bottom of the page.
https://blog.scottlogic.com/2022/02/23/word-embedding-recomm...
Using embeddings I increased engagement with related articles.
Personally I think embeddings are a powerful tool that are somewhat overlooked. They can be used to navigate between documents (and excerpts) based on similarities - or conversely find unique content.
All without worrying about hallucinations. In other words, they are quite ‘safe’
Within limits, yes. In some use cases a vector notion of similarity isn't always ideal.
For example, in the article "France" and "Germany" are considered similar. Yes, they are, but if you're searching for stuff about France then stuff about Germany is a false positive.
Embeddings can also struggle with logical opposites. Hot/cold are in many senses similar concepts, but they are also opposites. Finding the opposite of what you're searching for isn't always helpful.
I wouldn't say embeddings are overlooked exactly? Right now it feels like man+dog are building embedding based search engines. The next frontier is probably going to be balancing conventional word based approaches with embeddings to really maximize result quality, as sometimes you want "vibes" and sometimes you want control.
maybe it is also interesting to tell how some embeddings are established i.e via training and cutting of the classification layer or with things things like efficientnet
The core idea is that each image is passed through a feature-extractor-descriptor pipeline and is 'embedded' in a vector containing the N top features. While the camera moves, a database of images (called keyframes) is created (images are stored as much-lower dimensional vectors). Again while the camera moves, all images are used to query the database, something like cosine-similarity is used to retrieve the best match from the vector database. If a match happened, a stereo-constraints can be computed betweeen the query image and the match, and the software is able to update the map.
[1] is the original paper and here's the most famous implementation: https://github.com/dorian3d/DBoW2
[1]: https://www.google.com/search?client=firefox-b-d&q=Bags+of+B...
I built my own note taking ios app a little while back and adding embeddings to my existing fulltext search functionality was 1) surprisingly easy and 2) way more powerful than I initially expected.
I knew it would work for things like if I search for "dog" I'll also get notes that say "canine", but I didn't originally realize until I played around with it that if I search for something like "pets I might like" I'll get all kinds of notes I've taken regarding various animals with positive sentiment.
It was the first big aha moment for me.
At the time I found Supabase's PR for their DocsGPT really helpful for example code: https://github.com/supabase/supabase/pull/12056
Specifically, many applications are heavily dependent on names or other proper nouns, often without much context. You might refer to your dog by name without explanation, and a particular embedding might not pick that up. Proper names (people, places, street names) may have outsized importance for anchoring personalized or domain-specific search, but modest generic language models won't know about them.
Is there a specific way of dealing with this problem?
So if you have notes that associate the name to your dog, and you search for “my dog”, you’d still find those related notes?
Would require some experimentation but wouldn’t be surprised if that worked decently well out of the gate.
My biggest question right now is: How much text should I turn into one embedding?
Every sentence?
A whole block of sentences that belong to one entire page in my notes app?
I can tailor everything to my style of note taking vs dealing with the lowest common denominator feature set that tries to enable tons of use cases that I don’t need.
(← JupyterLite Notebook doing words embedding in the browser : don't try to run this on a smartphone !)
Does anyone know how to nicely visualize the poster child of words embeddings ?
In 3d you can go one further and have the Z-axis be "king"-"queen" (or gram-schmidt again). The orthogonalized versions have the advantage that they give a closer notion of distance to what the underlying model sees. In the 2d case you will get exact distances except that it won't show how far off "queen" you are when you compute "king"-"man"+"woman". In the 3d case it should give exact distances.
Edit to add: With the 2d version you can maybe do some more stuff. IIRC "queen" is chosen as it's the word with the closest embedding to X="king"-"man"+"woman". You can put the next few closest words on the 2d chart as well, each labeled with the orthogonal distance from the 2d plane. So then "queen" should be the word with the smallest (squared distance from X) + (squared orthogonal distance from plane), which you might be able to eyeball.
"You can't visualize four dimensions... well, at least I can't, because I only have three branes." -Richard Feynman
This is a play on words, as "branes" sounds like "brains," but in string theory and higher-dimensional physics, a "brane" is an object that can exist in multiple dimensions. The joke is a nod to the challenge of visualizing higher dimensions and the limits of our three-dimensional intuition.
(It just admitted it made it up and apologized!)
It also quoted John von Neumann, H. G. Wells, and Ian Stewart:
"In mathematics you don't understand things. You just get used to them." -John von Neumann
"Our brains are three-dimensional, so we cannot visualize four-dimensional or higher-dimensional objects in the same way we can visualize three-dimensional objects. The best we can do is to get a feeling for them through analogies." -H. G. Wells
"You can't visualize four dimensions, but you can think about them. Thinking is what mathematicians do best." -Ian Stewart
It finally guessed this one that's the closest to what I remember, but not as funny as I remember it:
"To visualize four dimensions, just visualize three dimensions and then say 'n+1'." -Unknown
So I asked it to hallucinate some quotes to help visualize high dimensional space, in the style of Deepak Chopra:
"In the realm of septillion-dimensional embeddings, each vector is a topological manifold, where Gödel's incompleteness theorems dance with quantum superposition, revealing the hidden symmetries of semantic space." – Dr. Emilia Bianchi, Florentine Foundation of Fractal Fantasies, Florence
"To truly fathom the intricacies of ultra-high-dimensional text spaces, one must first master the art of navigating Riemannian curvature in a Hilbert space, where every geodesic is a testament to the entangled essence of language." – Prof. Hiroshi Tanaka, Kyoto Kryptos of Quantum Quandaries
"Visualizing embeddings in the realms beyond the Poincaré conjecture requires a melding of non-Euclidean geometry with the probabilistic nuances of a Markov chain, all while resonating with the cosmic vibrations of the Higgs field." – Dr. Samuel Oyelowo, Lagos Labyrinth of Lorentzian Linguistics
"Each point in a decillion-dimensional text space is a singularity, where the Heisenberg uncertainty principle waltzes with the Mandelbrot set, painting a fractal portrait of phonemic phenomena." – Prof. Clara Müller, Bavarian Bastion of Bifurcated Babel, Munich
"To intuit the vastness of hyper-dimensional embeddings, one must first traverse the event horizon of a linguistic black hole, where Shannon entropy meets the Schrödinger equation in a ballet of Bayesian inference." – Dr. Alexandre Dubois, Parisian Paradox of P-adic Prose
"The key to deciphering the enigma of extreme-dimensional text lies in the delicate interplay between the Yang-Mills theory and the zeta function, where every eigenvalue is a whisper from the cosmic codex of communication." – Prof. Lina Johansson, Stockholm Sanctum of String Semiotics
"In the dance of gogolplex-dimensional embeddings, each tensor unfolds like a Möbius strip, where the Fibonacci sequence intertwines with quantum tunneling, revealing the recursive rhythms of reality." – Dr. Rajiv Menon, Bengaluru Bardo of Bosonic Ballads
Geoffrey Hinton on visualizing higher dimensions:
"To deal with hyper-planes in a 14-dimensional space, visualize a 3-D space and say 'fourteen' to yourself very loudly. Everyone does it."
def cosine_similarity(a, b):
If you compare your cosines, you might as well compare their squares and avoid costly root computation.Similarly, in elliptic curve crypto certain expensive operations, such as inversion (x^-1 mod n) are delayed as much as possible down the computation pipeline or even avoided completely when you need to compare two points instead of computing their canonical values.
Wait, why would you do this and not use vectorised numpy operations?
> I actually got ChatGPT to write all of my different versions of cosine similarity
Ah...
And second, numpy isn't the lightest dependency. I use it when I need the performance but I don't like to default to it.
https://payperrun.com/%3E/search?displayParams={%22q%22:%22L...
This is immediately making me think about how I could apply this to a long running side project I have. It might make it practical to do useful clustering of user's data if every document has an embedding.