28M Hacker News comments as vector embedding search dataset

Don't use all-MiniLM-L6-v2 for new vector embeddings datasets.

Yes, it's the open-weights embedding model used in all the tutorials and it was the most pragmatic model to use in sentence-transformers when vector stores were in their infancy, but it's old and does not implement the newest advances in architectures and data training pipelines, and it has a low context length of 512 when embedding models can do 2k+ with even more efficient tokenizers.

For open-weights, I would recommend EmbeddingGemma (https://huggingface.co/google/embeddinggemma-300m) instead which has incredible benchmarks and a 2k context window: although it's larger/slower to encode, the payoff is worth it. For a compromise, bge-base-en-v1.5 (https://huggingface.co/BAAI/bge-base-en-v1.5) or nomic-embed-text-v1.5 (https://huggingface.co/nomic-ai/nomic-embed-text-v1.5) are also good.

xfalcox · 22 days ago

I am partial to https://huggingface.co/Qwen/Qwen3-Embedding-0.6B nowadays.

Open weights, multilingual, 32k context.

SteveJS · 22 days ago

Also matryoshka and the ability to guide matches by using prefix instructions on the query.

I have ~50 million sentences from english project gutenberg novels embedded with this.

greenavocado · 22 days ago

It's junk compared to BGE M3 on my retrieval tasks

simonw · 22 days ago

It's a shame EmbeddingGemma is under the shonky Gemma license. I'll be honest: I don't remember what was shonky about it, but that in itself is a problem because now I have to care about, read and maybe even get legal advice before I build anything interesting on top of it!

(Just took a look and it has the problem that it forbids certain "restricted uses" that are listed in another document which it says it "is hereby incorporated by reference into this Agreement" - in other words Google could at any point in the future decide that the thing you are building is now a restricted use and ban you from continuing to use Gemma.)

minimaxir · 22 days ago

For the use cases of embeddings anyways, the issues with the Gemma license should be less significant.

wanderingmind · 22 days ago

Can someone explain what's technically better in the recent embedding models. Has there been a big change in their architecture or is it lighter on memory or can handle longer context because of improved training?

tifa2up · 22 days ago

https://agentset.ai/leaderboard/embeddings good rundown of other open-source embedding models

melvinodsa · 19 days ago

I am trying sentence-transformers/multi-qa-MiniLM-L6-cos-v1 for deploying a light weight transformer on CPU machine -its output dimension is 384. I want to keep the dimension low as possible. nomic-embed-text offers lower dimensions upto 64. I will need to test my dataset. Will comeback with the results.

spacecadet · 22 days ago

Great comment. For what its worth, really think about your vectors before creating them! Any model can be a vector model, you just use the final hidden states... with that, think about your corpus and the model latent space and try to pair them appropriately. For instance, I vectorize and search network data using a model trained on coding, systems, data, etc.

kaycebasques · 22 days ago

One thing that's still compelling about all-Mini is that it's feasible to use it client-side. IIRC it's a 70MB download, versus 300MB for EmbeddingGemma (or perhaps it was 700MB?)

Are there any solid models that can be downloaded client-side in less than 100MB?

intalentive · 22 days ago

This is the smallest model in the top 100 of HF's MTEB Leaderboard: https://huggingface.co/Mihaiii/Ivysaur

Never used it, can't vouch for it. But it's under 100 MB. The model it's based on, gte-tiny, is only 46 MB.

nijaru · 22 days ago

For something under 100 MB, this is probably the strongest option right now.

https://huggingface.co/MongoDB/mdbr-leaf-ir

stingraycharles · 22 days ago

How do the commercial embedding models compare against each other? Eg Cohere vs OpenAI small vs OpenAI large etc?

I have troubles navigating this space as there’s so much choice, and I don’t know exactly how to “benchmark” an embedding model for my use cases.

Deleted Comment

dangoodmanUT · 22 days ago

yeah this, there's much better open weights models out there...

SamInTheShell · 22 days ago

I tried out EmbeddingGemma a few weeks back in AB testing against nomic-embed-text-v1. I got way better results out of the nomic model. Runs fine on CPU as well.

Maybe I’m reading this wrong, but commercial use of comments is prohibited by the HN Privacy and data Policy. So is creating derivative works (so technically a vector representation)

araes · 22 days ago

From Legal | Y Combinator | Terms of Use | Conditions of Use [1]

[1] https://www.ycombinator.com/legal/#tou

  > Commercial Use: Unless otherwise expressly authorized herein or in the Site, you agree not to display, distribute, license, perform, publish, reproduce, duplicate, copy, create derivative works from, modify, sell, resell, exploit, transfer or upload for any commercial purposes, any portion of the Site, use of the Site, or access to the Site.

  > The buying, exchanging, selling and/or promotion (commercial or otherwise) of upvotes, comments, submissions, accounts (or any aspect of your account or any other account), karma, and/or content is strictly prohibited, constitutes a material breach of these Terms of Use, and could result in legal liability.

From [1] Terms of Use | Intellectual Property Rights:

  > Except as expressly authorized by Y Combinator, you agree not to modify, copy, frame, scrape, rent, lease, loan, sell, distribute or create derivative works based on the Site or the Site Content, in whole or in part, except that the foregoing does not apply to your own User Content (as defined below) that you legally upload to the Site.

  > In connection with your use of the Site you will not engage in or use any data mining, robots, scraping or similar data gathering or extraction methods.

larodi · 22 days ago

Surely plenty of YC companies scrap whatnot for derivatives and everyone’s fine with that…

delichon · 22 days ago

Certainly it is literally derivative. But so are my memories of my time on the site. And in fact I do intend to make commercial use of some of those derivations. I believe it should be a right to make an external prosthesis for those memories in the form of a vector database.

isodev · 22 days ago

That’s not the same as using it to build models. You as an individual have the right to access this content as this is the purpose of this website. The content becoming the core of some model is not.

amelius · 22 days ago

> I believe it should be a right to make an external prosthesis

Sure and some people would want a "gun prosthesis" as an aid to quickly throw small metallic objects, and it wouldn't be allowed either.

Deleted Comment

inkyoto · 22 days ago

> Certainly it is literally derivative.

I am not sure if it is that clear cut.

Embeddings are encodings of shared abstract concepts statistically inferred from many works or expressions of thoughts possessed by all humans.

With text embeddings, we get a many-to-one, lossy map: many possible texts ↝ one vector that preserves some structure about meaning and some structure about style, but not enough to reconstruct the original in general, and there is no principled way to say «this vector is derived specifically from that paragraph by authored by XYZ».

Does the encoded representation of the abstract concepts represent the derivate work? If yes, then every statement ever made by a human being represents the work derivative of someone else's by virtue of learning how to speak in the childhood – they create a derivative work of all prior speakers.

Technically, the3re is a strong argument against treating ordinary embedding vectors as derivative works, because:

- Embeddings are not uniquely reversible and, in general, it is not possible reconstruct the original text from the embedding;

- The embedding is one of an uncountable number of vectors in a space where nearby points correspond to many different possible sentences;

- Any individual vector is not meaningfully «the same» as the original work in the way that a translation or an adaptation is.

Please do note that this is the philosophical take and it glosses over the legally relevant differences between human and machine learning as the legal question ultimately depends on statutes, case law and policy choices that are still evolving.

Where it gets more complicated.

If the embeddings model has been trained on a large number of languages, it makes the cross-lingual search easily possible by using an abstract search concept in any language that the model has been trained on. The quality of such search results across languages X, Y and Z will be directly proportional to the scale and quality of the corpus of text that was used in the model training in the said languages.

Therefore, I can search for «the meaning of life»[0] in English and arrive at a highly relevant cluster of search results written in different languages by different people at different times, and the question becomes is «what exactly it has been statistically[1] derived from?».

[0] The cross-lingual search is what I did with my engineers last year to our surprise and delight of how well it actually worked.

[1] In the legal sense, if one can't trace a given vector uniquely back to a specific underlying copyrighted expression, and demonstrate substantial similarity of expression rather than idea, the «derivative work» argument in the legal sense becomes strained.

chasd00 · 22 days ago

Ha I was about to ask for all my comments to be removed as a joke. I guess I don’t have to.

dylan604 · 22 days ago

To think that any company anywhere actually removes all data upon request is a bit naive to me. Sure, maybe I'm too pessimistic, but there's just not enough evidence these deletes are not soft deletes. The data is just too valuable to them.

adrianwaj · 21 days ago

I wonder if it'd be okay to do an AirDropped coin based on them? Does one exist? Maybe as a YC startup idea?

hammock · 22 days ago

Someone better go tell Open AI

isodev · 22 days ago

I think a number of lawsuits are in progress of teaching them that particular lesson.

afiodorov · 22 days ago

I've been embedding all HN comments since 2023 from BigQuery and hosting at https://hn.fiodorov.es

Source is at https://github.com/afiodorov/hn-search

kylecazar · 22 days ago

I appreciate the architectural info and details in the GH repo. Cool project.

tim333 · 22 days ago

That's cool - it gave me quite a good answer when I tried it. Does it cost you much to run?

I tried "Who's Gary Marcus" - HN / your thing was considerably more negative about him than Google.

The running costs are very low. Since posting it today we burned 30 cents in DeepSeek inference. Postgres instance though costs me $40 a month on Railway; mostly due to RAM usage during to HNSW incremental update.

victorbuilds · 22 days ago

That's cool! Some immediate UI feedback after search button is clicked would be nice, I had to press it several times until I noticed some feedback. Maybe just disable it once clicked, my 2 cents

simlevesque · 22 days ago

I have a question: what hardware did you use and how long did you need to generate the embeddings ?

Daily updates I do on my m4 mac air: takes about 5 minutes to process roughly 10k fresh comments. Historic backfill was done on an Nvidia GPU rented on vast.ai for a few dollars. If I recall correctly took about an hour or so. It’s mentioned in the README.md on GitHub.

shortrounddev2 · 21 days ago

What mechanisms do you have to allow people to remove their comments from your databae

Gerardo1 · 22 days ago

Can users here submit an issue to have data associated with their account removed?

vilocrptr · 22 days ago

GDPR still holds, so I don’t see why not if that’s what your request is under.

However, it’s out there- and you have no idea where, so there’s not really a moral or feasible way to get rid of it everywhere. (Please don’t nuke the world just to clean your rep.)

rubenvanwyk · 22 days ago

Very cool, well done!

I think it would be useful to add a right-click menu option to HN content, like "similar sentences", which displays a list of links to them. I wonder if it would tell me that this suggestion has been made before.

adverbly · 22 days ago

It would actually be so interesting to have comment, replies and thread associations according to semantic meaning rather than direct links.

I wonder how many times the same discussion thread has been repeated across different posts. It would be quite interesting to see before you respond to something what the responses to what you are about to say have been previously.

Semantic threads or something would be the general idea... Pretty cool concept actually...

JacobThreeThree · 22 days ago

You'd get sentences full of words like: tangential, orthogonal, externalities, anecdote, anecdata, cargo cult, enshittification, grok, Hanlon's razor, Occam's razor, any other razor, Godwin's law, Murphy's law, other laws.

pessimizer · 22 days ago

Clicking "Betteridge's" would bring down the site.

iwontberude · 22 days ago

Someone made a tool a few years ago that basically unmasked all HN secondary accounts with a high degree of certainty. It scared the shit out of me how easy it picked out my alts based on writing style.

CraigJPerry · 22 days ago

I think that original post was taken down after a short while but antirez was similarly nerd sniped by it and posted this which i keep a link to for posterity: https://antirez.com/news/150

walterbell · 22 days ago

"Show HN: Using stylometry to find HN users with alternate account" (2022), 500 comments, https://news.ycombinator.com/item?id=33755016

hobofan · 22 days ago

> with a high degree of certainty

No it didn't. As the top comment in that thread points out, there were a large number of false positives.

SchwKatze · 22 days ago

I know it's unrelated but does anyone knows a good paper comparing vector searches vs "normal" full text search? Sometimes I ask myself of the squeeze worth the juice

stephantul · 22 days ago

“Normal search” is generally called bm25 in retrieval papers. Many, if not all, retrieval papers about modeling will use or list bm25 as a baseline. Hope this helps!

verdverm · 22 days ago

Not aware of a specific paper. This account on Bluesky focuses on RAG and general information retrieval

https://bsky.app/profile/reachsumit.com

arboles · 22 days ago

Compared in what? Server load, user experience?

catapart · 22 days ago

Am I misunderstanding what a parquet file is, or are all of the HN posts along with the embedding metadata a total of 55GB?

gkbrk · 22 days ago

I imagine that's mostly embeddings actually. My database has all the posts and comments from Hacker News, and the table takes up 17.68 GB uncompressed and 5.67 GB compressed.

Wow! That's a really great point of reference. I always knew text-based social media(ish) stuff should be "small", but I never had any idea if that meant a site like HN could store it's content in 1-2 TB, or if it was more like a few hundred gigs or what. To learn that it's really only tens of gigs is very surprising!

atonse · 22 days ago

That’s crazy small. So is it fair to say that words are actually the best compression algorithm we have? You can explain complex ideas in just a few hundred words.

Yes, a picture is worth a thousand words, but imagine how much information is in those 17GB of text.

edwardzcn · 22 days ago

Thanks, that's really helpful to guys like me to start up my "own database". BTW what database you choose for it?

you'd be surprised. I have a lot of text data and Parquet files with brotli compression can achieve impressive file sizes.

Around 4 millions of web pages as markdown is like 1-2GB

based on the table they show, that would be my inclination

wanted to do this for my own upvotes so I can see the kind of things I like, or find them again easier or when relevant

lazide · 22 days ago

Compressed, pretty believable.

dmezzetti · 22 days ago

Fun project. I'm sure it will get a lot of interest here.

For those into vector storage in general, one thing that has interested me lately is the idea of storing vectors as GGUF files and bring the familiar llama.cpp style quants to it (i.e. Q4_K, MXFP4 etc). An example of this is below.

https://gist.github.com/davidmezzetti/ca31dff155d2450ea1b516...

zkmon · 22 days ago

I don't know how to feel about this. Is the only purpose of the comments here is to train some commercial model? I have a feeling that, this might affect my involvement here going forward.

creata · 22 days ago

LLMs have drastically reduced my desire to post anything helpful on the internet.

It used to be about helping strangers in some small way. Now it's helping people I don't like more than people I do like.

ThrowawayR2 · 22 days ago

Not me. The thought of my eccentric comments leaving some unnoticed mar in the latent space of tomorrow's ever mightier LLMs, a tiny stain that reverberates endlessly into the future, manifesting at unexpected moments, amuses me to no end.

wiseowise · 22 days ago

Okay, okay, party poopers.

"Don't be snarky" -- the first line of HN guidelines for posts.

josfredo · 22 days ago

This is the first snarky comment I've read here that's hilarious.