Don't use all-MiniLM-L6-v2 for new vector embeddings datasets.
Yes, it's the open-weights embedding model used in all the tutorials and it was the most pragmatic model to use in sentence-transformers when vector stores were in their infancy, but it's old and does not implement the newest advances in architectures and data training pipelines, and it has a low context length of 512 when embedding models can do 2k+ with even more efficient tokenizers.
It's a shame EmbeddingGemma is under the shonky Gemma license. I'll be honest: I don't remember what was shonky about it, but that in itself is a problem because now I have to care about, read and maybe even get legal advice before I build anything interesting on top of it!
(Just took a look and it has the problem that it forbids certain "restricted uses" that are listed in another document which it says it "is hereby incorporated by reference into this Agreement" - in other words Google could at any point in the future decide that the thing you are building is now a restricted use and ban you from continuing to use Gemma.)
Can someone explain what's technically better in the recent embedding models. Has there been a big change in their architecture or is it lighter on memory or can handle longer context because of improved training?
I am trying sentence-transformers/multi-qa-MiniLM-L6-cos-v1 for deploying a light weight transformer on CPU machine -its output dimension is 384. I want to keep the dimension low as possible. nomic-embed-text offers lower dimensions upto 64. I will need to test my dataset. Will comeback with the results.
Great comment. For what its worth, really think about your vectors before creating them! Any model can be a vector model, you just use the final hidden states... with that, think about your corpus and the model latent space and try to pair them appropriately. For instance, I vectorize and search network data using a model trained on coding, systems, data, etc.
One thing that's still compelling about all-Mini is that it's feasible to use it client-side. IIRC it's a 70MB download, versus 300MB for EmbeddingGemma (or perhaps it was 700MB?)
Are there any solid models that can be downloaded client-side in less than 100MB?
I tried out EmbeddingGemma a few weeks back in AB testing against nomic-embed-text-v1. I got way better results out of the nomic model. Runs fine on CPU as well.
The running costs are very low. Since posting it today we burned 30 cents in DeepSeek inference. Postgres instance though costs me $40 a month on Railway; mostly due to RAM usage during to HNSW incremental update.
That's cool! Some immediate UI feedback after search button is clicked would be nice, I had to press it several times until I noticed some feedback. Maybe just disable it once clicked, my 2 cents
Daily updates I do on my m4 mac air: takes about 5 minutes to process roughly 10k fresh comments. Historic backfill was done on an Nvidia GPU rented on vast.ai for a few dollars. If I recall correctly took about an hour or so. It’s mentioned in the README.md on GitHub.
GDPR still holds, so I don’t see why not if that’s what your request is under.
However, it’s out there- and you have no idea where, so there’s not really a moral or feasible way to get rid of it everywhere. (Please don’t nuke the world just to clean your rep.)
Maybe I’m reading this wrong, but commercial use of comments is prohibited by the HN Privacy and data Policy. So is creating derivative works (so technically a vector representation)
> Commercial Use: Unless otherwise expressly authorized herein or in the Site, you agree not to display, distribute, license, perform, publish, reproduce, duplicate, copy, create derivative works from, modify, sell, resell, exploit, transfer or upload for any commercial purposes, any portion of the Site, use of the Site, or access to the Site.
> The buying, exchanging, selling and/or promotion (commercial or otherwise) of upvotes, comments, submissions, accounts (or any aspect of your account or any other account), karma, and/or content is strictly prohibited, constitutes a material breach of these Terms of Use, and could result in legal liability.
From [1] Terms of Use | Intellectual Property Rights:
> Except as expressly authorized by Y Combinator, you agree not to modify, copy, frame, scrape, rent, lease, loan, sell, distribute or create derivative works based on the Site or the Site Content, in whole or in part, except that the foregoing does not apply to your own User Content (as defined below) that you legally upload to the Site.
> In connection with your use of the Site you will not engage in or use any data mining, robots, scraping or similar data gathering or extraction methods.
Certainly it is literally derivative. But so are my memories of my time on the site. And in fact I do intend to make commercial use of some of those derivations. I believe it should be a right to make an external prosthesis for those memories in the form of a vector database.
That’s not the same as using it to build models. You as an individual have the right to access this content as this is the purpose of this website. The content becoming the core of some model is not.
Embeddings are encodings of shared abstract concepts statistically inferred from many works or expressions of thoughts possessed by all humans.
With text embeddings, we get a many-to-one, lossy map: many possible texts ↝ one vector that preserves some structure about meaning and some structure about style, but not enough to reconstruct the original in general, and there is no principled way to say «this vector is derived specifically from that paragraph by authored by XYZ».
Does the encoded representation of the abstract concepts represent the derivate work? If yes, then every statement ever made by a human being represents the work derivative of someone else's by virtue of learning how to speak in the childhood – they create a derivative work of all prior speakers.
Technically, the3re is a strong argument against treating ordinary embedding vectors as derivative works, because:
- Embeddings are not uniquely reversible and, in general, it is not possible reconstruct the original text from the embedding;
- The embedding is one of an uncountable number of vectors in a space where nearby points correspond to many different possible sentences;
- Any individual vector is not meaningfully «the same» as the original work in the way that a translation or an adaptation is.
Please do note that this is the philosophical take and it glosses over the legally relevant differences between human and machine learning as the legal question ultimately depends on statutes, case law and policy choices that are still evolving.
Where it gets more complicated.
If the embeddings model has been trained on a large number of languages, it makes the cross-lingual search easily possible by using an abstract search concept in any language that the model has been trained on. The quality of such search results across languages X, Y and Z will be directly proportional to the scale and quality of the corpus of text that was used in the model training in the said languages.
Therefore, I can search for «the meaning of life»[0] in English and arrive at a highly relevant cluster of search results written in different languages by different people at different times, and the question becomes is «what exactly it has been statistically[1] derived from?».
[0] The cross-lingual search is what I did with my engineers last year to our surprise and delight of how well it actually worked.
[1] In the legal sense, if one can't trace a given vector uniquely back to a specific underlying copyrighted expression, and demonstrate substantial similarity of expression rather than idea, the «derivative work» argument in the legal sense becomes strained.
To think that any company anywhere actually removes all data upon request is a bit naive to me. Sure, maybe I'm too pessimistic, but there's just not enough evidence these deletes are not soft deletes. The data is just too valuable to them.
I think it would be useful to add a right-click menu option to HN content, like "similar sentences", which displays a list of links to them. I wonder if it would tell me that this suggestion has been made before.
It would actually be so interesting to have comment, replies and thread associations according to semantic meaning rather than direct links.
I wonder how many times the same discussion thread has been repeated across different posts. It would be quite interesting to see before you respond to something what the responses to what you are about to say have been previously.
Semantic threads or something would be the general idea... Pretty cool concept actually...
You'd get sentences full of words like: tangential, orthogonal, externalities, anecdote, anecdata, cargo cult, enshittification, grok, Hanlon's razor, Occam's razor, any other razor, Godwin's law, Murphy's law, other laws.
Someone made a tool a few years ago that basically unmasked all HN secondary accounts with a high degree of certainty. It scared the shit out of me how easy it picked out my alts based on writing style.
I think that original post was taken down after a short while but antirez was similarly nerd sniped by it and posted this which i keep a link to for posterity: https://antirez.com/news/150
I know it's unrelated but does anyone knows a good paper comparing vector searches vs "normal" full text search? Sometimes I ask myself of the squeeze worth the juice
“Normal search” is generally called bm25 in retrieval papers. Many, if not all, retrieval papers about modeling will use or list bm25 as a baseline. Hope this helps!
I imagine that's mostly embeddings actually. My database has all the posts and comments from Hacker News, and the table takes up 17.68 GB uncompressed and 5.67 GB compressed.
Wow! That's a really great point of reference. I always knew text-based social media(ish) stuff should be "small", but I never had any idea if that meant a site like HN could store it's content in 1-2 TB, or if it was more like a few hundred gigs or what. To learn that it's really only tens of gigs is very surprising!
That’s crazy small. So is it fair to say that words are actually the best compression algorithm we have? You can explain complex ideas in just a few hundred words.
Yes, a picture is worth a thousand words, but imagine how much information is in those 17GB of text.
Fun project. I'm sure it will get a lot of interest here.
For those into vector storage in general, one thing that has interested me lately is the idea of storing vectors as GGUF files and bring the familiar llama.cpp style quants to it (i.e. Q4_K, MXFP4 etc). An example of this is below.
I don't know how to feel about this. Is the only purpose of the comments here is to train some commercial model? I have a feeling that, this might affect my involvement here going forward.
Not me. The thought of my eccentric comments leaving some unnoticed mar in the latent space of tomorrow's ever mightier LLMs, a tiny stain that reverberates endlessly into the future, manifesting at unexpected moments, amuses me to no end.
Yes, it's the open-weights embedding model used in all the tutorials and it was the most pragmatic model to use in sentence-transformers when vector stores were in their infancy, but it's old and does not implement the newest advances in architectures and data training pipelines, and it has a low context length of 512 when embedding models can do 2k+ with even more efficient tokenizers.
For open-weights, I would recommend EmbeddingGemma (https://huggingface.co/google/embeddinggemma-300m) instead which has incredible benchmarks and a 2k context window: although it's larger/slower to encode, the payoff is worth it. For a compromise, bge-base-en-v1.5 (https://huggingface.co/BAAI/bge-base-en-v1.5) or nomic-embed-text-v1.5 (https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) are also good.
Open weights, multilingual, 32k context.
I have ~50 million sentences from english project gutenberg novels embedded with this.
(Just took a look and it has the problem that it forbids certain "restricted uses" that are listed in another document which it says it "is hereby incorporated by reference into this Agreement" - in other words Google could at any point in the future decide that the thing you are building is now a restricted use and ban you from continuing to use Gemma.)
Are there any solid models that can be downloaded client-side in less than 100MB?
Never used it, can't vouch for it. But it's under 100 MB. The model it's based on, gte-tiny, is only 46 MB.
https://huggingface.co/MongoDB/mdbr-leaf-ir
I have troubles navigating this space as there’s so much choice, and I don’t know exactly how to “benchmark” an embedding model for my use cases.
Deleted Comment
Source is at https://github.com/afiodorov/hn-search
I tried "Who's Gary Marcus" - HN / your thing was considerably more negative about him than Google.
However, it’s out there- and you have no idea where, so there’s not really a moral or feasible way to get rid of it everywhere. (Please don’t nuke the world just to clean your rep.)
[1] https://www.ycombinator.com/legal/#tou
From [1] Terms of Use | Intellectual Property Rights:Sure and some people would want a "gun prosthesis" as an aid to quickly throw small metallic objects, and it wouldn't be allowed either.
Deleted Comment
I am not sure if it is that clear cut.
Embeddings are encodings of shared abstract concepts statistically inferred from many works or expressions of thoughts possessed by all humans.
With text embeddings, we get a many-to-one, lossy map: many possible texts ↝ one vector that preserves some structure about meaning and some structure about style, but not enough to reconstruct the original in general, and there is no principled way to say «this vector is derived specifically from that paragraph by authored by XYZ».
Does the encoded representation of the abstract concepts represent the derivate work? If yes, then every statement ever made by a human being represents the work derivative of someone else's by virtue of learning how to speak in the childhood – they create a derivative work of all prior speakers.
Technically, the3re is a strong argument against treating ordinary embedding vectors as derivative works, because:
- Embeddings are not uniquely reversible and, in general, it is not possible reconstruct the original text from the embedding;
- The embedding is one of an uncountable number of vectors in a space where nearby points correspond to many different possible sentences;
- Any individual vector is not meaningfully «the same» as the original work in the way that a translation or an adaptation is.
Please do note that this is the philosophical take and it glosses over the legally relevant differences between human and machine learning as the legal question ultimately depends on statutes, case law and policy choices that are still evolving.
Where it gets more complicated.
If the embeddings model has been trained on a large number of languages, it makes the cross-lingual search easily possible by using an abstract search concept in any language that the model has been trained on. The quality of such search results across languages X, Y and Z will be directly proportional to the scale and quality of the corpus of text that was used in the model training in the said languages.
Therefore, I can search for «the meaning of life»[0] in English and arrive at a highly relevant cluster of search results written in different languages by different people at different times, and the question becomes is «what exactly it has been statistically[1] derived from?».
[0] The cross-lingual search is what I did with my engineers last year to our surprise and delight of how well it actually worked.
[1] In the legal sense, if one can't trace a given vector uniquely back to a specific underlying copyrighted expression, and demonstrate substantial similarity of expression rather than idea, the «derivative work» argument in the legal sense becomes strained.
I wonder how many times the same discussion thread has been repeated across different posts. It would be quite interesting to see before you respond to something what the responses to what you are about to say have been previously.
Semantic threads or something would be the general idea... Pretty cool concept actually...
No it didn't. As the top comment in that thread points out, there were a large number of false positives.
https://bsky.app/profile/reachsumit.com
Yes, a picture is worth a thousand words, but imagine how much information is in those 17GB of text.
Around 4 millions of web pages as markdown is like 1-2GB
wanted to do this for my own upvotes so I can see the kind of things I like, or find them again easier or when relevant
For those into vector storage in general, one thing that has interested me lately is the idea of storing vectors as GGUF files and bring the familiar llama.cpp style quants to it (i.e. Q4_K, MXFP4 etc). An example of this is below.
https://gist.github.com/davidmezzetti/ca31dff155d2450ea1b516...
It used to be about helping strangers in some small way. Now it's helping people I don't like more than people I do like.