Readit News logoReadit News
daturkel · 2 years ago
Annoy came out of Spotify, and they just announced their successor library Voyager [1] last week [2].

[1]: https://github.com/spotify/voyager [2]: https://engineering.atspotify.com/2023/10/introducing-voyage...

gregsadetsky · 2 years ago
Amazing, thanks for the reference!

The biggest immediate useful difference that I see is that Annoy uses read-only index files (from the docs: "you can not add more items once the tree has been created" [0]), while Voyager allows you to call `.add_item` at any time (I just pip installed to double check and yes -- it's great).

[0] https://github.com/spotify/annoy#summary-of-features

PLenz · 2 years ago
The great thing about Annoy is that you can write the index to disk and thus do big data work on tiny workers at the edge. I've never really seen anything else do the same.
danbruc · 2 years ago
That sound pretty much like binary space partitioning [1] which is probably best known for being used in Doom to accelerate the rendering. Also, if you only search the leaf subspace that you query point lies in, then you can in principle be arbitrarily far off from the true nearest neighbor.

[1] https://en.wikipedia.org/wiki/Binary_space_partitioning

jhj · 2 years ago
k-D trees and BSPs don't work well in high dimensions (say >20 to 30 dims), since your true nearest neighbor is highly likely to lie on either side of any dividing hyperplane.

If each dividing hyperplane separates a set of vectors into two roughly equal parts, then for N vectors you have something on the order of log2(N) separating planes in order to get down to a single vector per terminal node. But for, say, a billion vectors in a 100-dimensional space (and assuming something like a k-D tree where you partition each dimension at each level), you have only partitioned log2(10^9) ~= 30 dimensions this way, yet there are 70 other dimensions remaining through which you can travel to find your true nearest neighbor (the curse of dimensionality; in high dimensions everything is "near" everything else) that you have not checked if you are discarding subspaces along the way.

This is a problem for other approximate methods as well though. IVF methods still do a form of partitioning (via Voronoi cells) but this is obtained by clustering the data so it is more attuned to the global structure of the data rather than greedy divide and conquer. But in IVF methods it is still possible that you need to check every IVF cell in order to find your true nearest neighbor, just that it is less likely to happen in practice because the partitioning is more attuned to the overall structure of data rather than via greedy subdivision.

danbruc · 2 years ago
I wonder if nearest neighbor is actually what you want to begin with. Imagine points distributed in a U-shape, the nearest neighbor might require jumping across the gap but a better match could be a bit further away but somewhere along the shape. Especially if the dimension do not have a meaningful intrinsic scale, I could just scale the x dimension and pull the two sides of the U apart. So maybe I could actually be more useful to look at the topology of the points, something like a Vietoris–Rips complex [1] but invariant under [local] rescaling of dimensions. Not sure if something like that exists or is possible in a robust and efficient way.

[1] https://en.wikipedia.org/wiki/Vietoris%E2%80%93Rips_complex

esafak · 2 years ago
And LSH uses random hyperplanes.
taneq · 2 years ago
Yeah, it looks pretty similar to me too.

I haven’t read up on vector DBs but I wonder of k-d trees would be a better fit?

PaulHoule · 2 years ago
Funny I got interested in nearest-neighbor search for this kind of vectors in the early 2000s and there were a lot of papers on the subject that were depressing (had to make a tradeoff that wasn't really attractive) compared to the literature on low-dimensional indexes. Back then I had 250,000 or so vectors and a full scan wasn't that bad.

In the 2010s I worked on a dense vector search engine for patents and we had maybe 10 million vectors and still used a scan, I mean, you couldn't ask for a more memory hierarchy and SIMD friendly algorithm.

Today I am looking at 1M (larger) vectors and the full scan is still possible but I am using FAISS because it is a bird in the hand and I decided I can live with the tradeoff.

kristjansson · 2 years ago
I mean FAISS has IndexFlatL2 if you have it in hand and _want_ the full scan.
dmezzetti · 2 years ago
If you want an easy way to evaluate Faiss, Hnswlib and Annoy vector backends, check out txtai - https://github.com/neuml/txtai. txtai also supports NumPy and PyTorch vector storage.

Disclaimer: I am the author of txtai

Scene_Cast2 · 2 years ago
Would you have any quick links or data on how they compare in terms of speed and recall?
dmezzetti · 2 years ago
https://ann-benchmarks.com/ is a good resource covering those libraries and much more.
smokracek · 2 years ago
Does anyone have any recommendations for a decent crash course on using vector DBs in conjuction with LLMs? I wanna do some experimentation with getting a model to comment on the similarity of data vectors etc. and I don't really know where to start.
minimaxir · 2 years ago
If you want to experiment with vector stores, you can do that locally with something like faiss which has good multiplatform support and sufficient tutorials: https://github.com/facebookresearch/faiss

Doing full retrieval-augmented generation (RAG) and getting LLMs to interpret the results has more steps but you get a lot of flexibility, and despite what AI influencers say there's no standard best-practice. When you query a vector DB you get the most similar texts back (or an index integer in the case of faiss), you then feed those result to an LLM like a normal prompt, which can be optimized with prompt engineering.

The codifer for the RAG workflow is LangChain, but their demo is substantially more complex and harder-to-use than even a homegrown implementation: https://minimaxir.com/2023/07/langchain-problem/

marcyb5st · 2 years ago
Also, if what you look up has no semantic meaning like parts number you might be better off with an inverted index in addition to ANN lookups. Especially if the embedding model has been trained on a dataset that is not similar to what you use it for. That's a common situation right now with embedding models based on LLMs.
alumic · 2 years ago
You might also check out this previous thread on the subject. It offers some pretty fascinating discussions:

https://news.ycombinator.com/item?id=35826929

Deleted Comment

bobvanluijt · 2 years ago
(I’m affiliated with Weaviate) You might want to check out this getting started guide. It takes a couple of minutes, and you're good to go https://weaviate.io/developers/weaviate/quickstart
pokpokpok · 2 years ago
I recommend pgvector, it's simple and well featured with good example code. Once you have a dataset of vectors loaded in, the next step is called rag / retrieval augmented generation
malaya_zemlya · 2 years ago
deeplearning.ai has a short coursee on the topic https://www.deeplearning.ai/short-courses/large-language-mod...
superdug · 2 years ago
I was so saddened by this title.. I thought there was a way to find annoying neighbors (at homes I’m interested in acquiring) and assumed this was a service for literally finding nearby annoy[ing] neighbors. To then go through the comments and find that this is a service with a minimum input of addresses and census data which are both freely available as is this service. Bravo hn!
lootsauce · 2 years ago
This popped up in the new feed yesterday and seems significant in the ANN space https://news.ycombinator.com/item?id=38054537
lootsauce · 2 years ago
debo_ · 2 years ago
When I saw the "oh yeah" I assumed this was going to be a deepfake Macho Man Randy Savage explaining ANN. I am disappointed.