Embeddings: What they are and why they matter

Since publishing this I've found a few additional resources that are really useful for understanding embeddings at a lower level (my article is deliberately very high level and focuses mainly on their applications).

Cohere's Text Embeddings Visually Explained: https://txt.cohere.com/text-embeddings/

The Tensorflow Embedding Projector tool: https://projector.tensorflow.org/

What are embeddings? by Vicki Boykis is worth checking out as well: https://vickiboykis.com/what_are_embeddings/

Actually I'll add those as "further reading" at the bottom of the page.

ColinEberhardt · 2 years ago

I had exactly the same idea a while back:

https://blog.scottlogic.com/2022/02/23/word-embedding-recomm...

Using embeddings I increased engagement with related articles.

Personally I think embeddings are a powerful tool that are somewhat overlooked. They can be used to navigate between documents (and excerpts) based on similarities - or conversely find unique content.

All without worrying about hallucinations. In other words, they are quite ‘safe’

mike_hearn · 2 years ago

> All without worrying about hallucinations. In other words, they are quite ‘safe’

Within limits, yes. In some use cases a vector notion of similarity isn't always ideal.

For example, in the article "France" and "Germany" are considered similar. Yes, they are, but if you're searching for stuff about France then stuff about Germany is a false positive.

Embeddings can also struggle with logical opposites. Hot/cold are in many senses similar concepts, but they are also opposites. Finding the opposite of what you're searching for isn't always helpful.

I wouldn't say embeddings are overlooked exactly? Right now it feels like man+dog are building embedding based search engines. The next frontier is probably going to be balancing conventional word based approaches with embeddings to really maximize result quality, as sometimes you want "vibes" and sometimes you want control.

forgingahead · 2 years ago

Simon, just wanted to say thanks for all the great content and writings you've been putting out - it's been super helpful to help digest a lot of the fast developments in this space. Always looking forward to the next one!

simonw · 2 years ago

Thanks for saying that!

hhthrowaway1230 · 2 years ago

simon, the way you write makes it so accessible for people that have limited experience with ai, ml or llms. thank you!

maybe it is also interesting to tell how some embeddings are established i.e via training and cutting of the classification layer or with things things like efficientnet

3abiton · 2 years ago

Did you stumble upon any resources discussing the history of embeddings and its use in CS and LLMs? It's becoming a cornerstone of ML.

blululu · 2 years ago

Maybe someone can offer a richer history but to my knowledge the first suggestion of word vectors was lsa which originated the idea of dimensionality reduction on a term/doc matrix. They just used svd but the more modern methods are all doing essentially the same thing. To my recollection they were and HCI lab and their goal was not to make a language model as much as to make a search function for files.

simonw · 2 years ago

Not aside from Word2Vec and I'd like to learn more about that.

freeman94 · 2 years ago

nice article, thanks

About words embeddings, the №1 example is the famous King - Man + Women = Queen This works nicely in the vector space but fails to make a visual impression when projected on 2 dimensions. Neither with ACP, nor MDS ot t-SNE in my experience : https://bhugueney.gitlab.io/test-notebooks-org-publish/jupyt...

(← JupyterLite Notebook doing words embedding in the browser : don't try to run this on a smartphone !)

Does anyone know how to nicely visualize the poster child of words embeddings ?

lindenr · 2 years ago

If I understand you right - you could visualize in 2d space: "king" at origin, X-axis is "king"-"man", Y-axis is "king"-"woman" (or gram-schmidt if you really want orthogonal).

In 3d you can go one further and have the Z-axis be "king"-"queen" (or gram-schmidt again). The orthogonalized versions have the advantage that they give a closer notion of distance to what the underlying model sees. In the 2d case you will get exact distances except that it won't show how far off "queen" you are when you compute "king"-"man"+"woman". In the 3d case it should give exact distances.

Edit to add: With the 2d version you can maybe do some more stuff. IIRC "queen" is chosen as it's the word with the closest embedding to X="king"-"man"+"woman". You can put the next few closest words on the 2d chart as well, each labeled with the orthogonal distance from the 2d plane. So then "queen" should be the word with the smallest (squared distance from X) + (squared orthogonal distance from plane), which you might be able to eyeball.

helloooooooo · 2 years ago

You should try UMAP

DonHopkins · 2 years ago

I remember hearing a kind of inductive joke by a mathematician about visualizing higher dimensions, like "...and then you just...", but I unsuccessfully tried to search for it with ChatGPT, when it came up with this funny Richard Feynman quote that I can't find on google -- has anyone heard it before, or is it a brilliant hallucination?

"You can't visualize four dimensions... well, at least I can't, because I only have three branes." -Richard Feynman

This is a play on words, as "branes" sounds like "brains," but in string theory and higher-dimensional physics, a "brane" is an object that can exist in multiple dimensions. The joke is a nod to the challenge of visualizing higher dimensions and the limits of our three-dimensional intuition.

(It just admitted it made it up and apologized!)

It also quoted John von Neumann, H. G. Wells, and Ian Stewart:

"In mathematics you don't understand things. You just get used to them." -John von Neumann

"Our brains are three-dimensional, so we cannot visualize four-dimensional or higher-dimensional objects in the same way we can visualize three-dimensional objects. The best we can do is to get a feeling for them through analogies." -H. G. Wells

"You can't visualize four dimensions, but you can think about them. Thinking is what mathematicians do best." -Ian Stewart

It finally guessed this one that's the closest to what I remember, but not as funny as I remember it:

"To visualize four dimensions, just visualize three dimensions and then say 'n+1'." -Unknown

So I asked it to hallucinate some quotes to help visualize high dimensional space, in the style of Deepak Chopra:

"In the realm of septillion-dimensional embeddings, each vector is a topological manifold, where Gödel's incompleteness theorems dance with quantum superposition, revealing the hidden symmetries of semantic space." – Dr. Emilia Bianchi, Florentine Foundation of Fractal Fantasies, Florence

"To truly fathom the intricacies of ultra-high-dimensional text spaces, one must first master the art of navigating Riemannian curvature in a Hilbert space, where every geodesic is a testament to the entangled essence of language." – Prof. Hiroshi Tanaka, Kyoto Kryptos of Quantum Quandaries

"Visualizing embeddings in the realms beyond the Poincaré conjecture requires a melding of non-Euclidean geometry with the probabilistic nuances of a Markov chain, all while resonating with the cosmic vibrations of the Higgs field." – Dr. Samuel Oyelowo, Lagos Labyrinth of Lorentzian Linguistics

"Each point in a decillion-dimensional text space is a singularity, where the Heisenberg uncertainty principle waltzes with the Mandelbrot set, painting a fractal portrait of phonemic phenomena." – Prof. Clara Müller, Bavarian Bastion of Bifurcated Babel, Munich

"To intuit the vastness of hyper-dimensional embeddings, one must first traverse the event horizon of a linguistic black hole, where Shannon entropy meets the Schrödinger equation in a ballet of Bayesian inference." – Dr. Alexandre Dubois, Parisian Paradox of P-adic Prose

"The key to deciphering the enigma of extreme-dimensional text lies in the delicate interplay between the Yang-Mills theory and the zeta function, where every eigenvalue is a whisper from the cosmic codex of communication." – Prof. Lina Johansson, Stockholm Sanctum of String Semiotics

"In the dance of gogolplex-dimensional embeddings, each tensor unfolds like a Möbius strip, where the Fibonacci sequence intertwines with quantum tunneling, revealing the recursive rhythms of reality." – Dr. Rajiv Menon, Bengaluru Bardo of Bosonic Ballads

dekhn · 2 years ago

Did you mean this?

Geoffrey Hinton on visualizing higher dimensions:

"To deal with hyper-planes in a 14-dimensional space, visualize a 3-D space and say 'fourteen' to yourself very loudly. Everyone does it."

dot_product = sum(x \* y for x, y in zip(a, b)) magnitude_a = sum(x \* x for x in a) \* 0.5 # <- no need for \*0.5 magnitude_b = sum(x \* x for x in b) \* 0.5 # <- no need for \*0.5 return dot_product / (magnitude_a \* magnitude_b)

abricq · 2 years ago

Not quite the same application, but in computer vision and visual SLAM algorithms (to construct a map of your surrounding using a camera) embeddings have become a de-facto method to perform place-recognition ! And it's very similar to this article. It is called "bag-of-word place recognition" and it really became the standard, used by absolutely every open-source library nowadays.

The core idea is that each image is passed through a feature-extractor-descriptor pipeline and is 'embedded' in a vector containing the N top features. While the camera moves, a database of images (called keyframes) is created (images are stored as much-lower dimensional vectors). Again while the camera moves, all images are used to query the database, something like cosine-similarity is used to retrieve the best match from the vector database. If a match happened, a stereo-constraints can be computed betweeen the query image and the match, and the software is able to update the map.

[1] is the original paper and here's the most famous implementation: https://github.com/dorian3d/DBoW2

[1]: https://www.google.com/search?client=firefox-b-d&q=Bags+of+B...

quartz · 2 years ago

This is a great zero to one reference.

I built my own note taking ios app a little while back and adding embeddings to my existing fulltext search functionality was 1) surprisingly easy and 2) way more powerful than I initially expected.

I knew it would work for things like if I search for "dog" I'll also get notes that say "canine", but I didn't originally realize until I played around with it that if I search for something like "pets I might like" I'll get all kinds of notes I've taken regarding various animals with positive sentiment.

It was the first big aha moment for me.

At the time I found Supabase's PR for their DocsGPT really helpful for example code: https://github.com/supabase/supabase/pull/12056

mhalle · 2 years ago

I think your statement "adding to existing fulltext functionality" is subtly important: embeddings provide semantic search that complements traditional search algorithms.

Specifically, many applications are heavily dependent on names or other proper nouns, often without much context. You might refer to your dog by name without explanation, and a particular embedding might not pick that up. Proper names (people, places, street names) may have outsized importance for anchoring personalized or domain-specific search, but modest generic language models won't know about them.

Is there a specific way of dealing with this problem?

Just spitballing here but you could maybe do a 1-deep depth search… dot product on the initial search and also dot product on the highest confidence matched notes, then some filtering.

So if you have notes that associate the name to your dog, and you search for “my dog”, you’d still find those related notes?

Would require some experimentation but wouldn’t be surprised if that worked decently well out of the gate.

mbrochh · 2 years ago

I'm working on something like this for my logseq notes as well.

My biggest question right now is: How much text should I turn into one embedding?

Every sentence?

A whole block of sentences that belong to one entire page in my notes app?

nunodonato · 2 years ago

thats awesome! I'm building my own simple note-taking app powered by embeddings as well, finally I dont loose stuff and its easy to find :D

Right?! Everyone thought I was crazy to do this vs using something off the shelf but having total control over my notes app has been incredible.

I can tailor everything to my style of note taking vs dealing with the lowest common denominator feature set that tries to enable tons of use cases that I don’t need.

turnsout · 2 years ago

I'm curious if you're using an off-device API to generate these embeddings, and if you're searching on-device?

Beefin · 2 years ago

good question, has anyone ported HNSW for objective C?

wolfgangK · 2 years ago

oleganza · 2 years ago

Common mistake when doing practical trigonometry is to compute square roots when that's not necessary:

def cosine_similarity(a, b):

If you compare your cosines, you might as well compare their squares and avoid costly root computation.

Similarly, in elliptic curve crypto certain expensive operations, such as inversion (x^-1 mod n) are delayed as much as possible down the computation pipeline or even avoided completely when you need to compare two points instead of computing their canonical values.

dist-epoch · 2 years ago

This code is written to be easy to understand. Otherwise you would replace it with some low-level SIMD code.

lordgrenville · 2 years ago

> dot_product = sum(x * y for x, y in zip(a, b))

Wait, why would you do this and not use vectorised numpy operations?

> I actually got ChatGPT to write all of my different versions of cosine similarity

Ah...

Two reasons. First, when I'm trying to explain stuff to people I find numpy syntax gets in the way.

And second, numpy isn't the lightest dependency. I use it when I need the performance but I don't like to default to it.

Makes sense. And sorry for the snark - I enjoyed the post, and generally enjoy your writing as well.

toxik · 2 years ago

I disagree that your version is more readable. If you don’t know linear algebra, the code is inscrutable, and if you do, dot(x,y)/norm(x)/norm(y) is about as close to the math as you can get.

eneuman · 2 years ago

If you want to see what Show HN posts, ProductHunt Startups, YC Companies, and Github repos are related to LLM embeddings, you can quickly find them on the LLM-Embeddings-Based Search Engine (MVP) I just launched:

https://payperrun.com/%3E/search?displayParams={%22q%22:%22L...

thingification · 2 years ago

Nice. I expected clicking on the different "filter" buttons would update of the search results right away: I didn't expect I had to repeat the search (though I can see why you'd do that)

Thanks for the feedback, I just fixed this!

(My Show HN post here) https://news.ycombinator.com/item?id=38011802

pseudosavant · 2 years ago

This is the most interesting thing I've read about in "AI" in quite a few months. I always wondered what embedding models were when I'd see lists, or curious why everyone is talking about vector DBs.

This is immediately making me think about how I could apply this to a long running side project I have. It might make it practical to do useful clustering of user's data if every document has an embedding.