The biggest immediate useful difference that I see is that Annoy uses read-only index files (from the docs: "you can not add more items once the tree has been created" [0]), while Voyager allows you to call `.add_item` at any time (I just pip installed to double check and yes -- it's great).
The great thing about Annoy is that you can write the index to disk and thus do big data work on tiny workers at the edge. I've never really seen anything else do the same.
That sound pretty much like binary space partitioning [1] which is probably best known for being used in Doom to accelerate the rendering. Also, if you only search the leaf subspace that you query point lies in, then you can in principle be arbitrarily far off from the true nearest neighbor.
k-D trees and BSPs don't work well in high dimensions (say >20 to 30 dims), since your true nearest neighbor is highly likely to lie on either side of any dividing hyperplane.
If each dividing hyperplane separates a set of vectors into two roughly equal parts, then for N vectors you have something on the order of log2(N) separating planes in order to get down to a single vector per terminal node. But for, say, a billion vectors in a 100-dimensional space (and assuming something like a k-D tree where you partition each dimension at each level), you have only partitioned log2(10^9) ~= 30 dimensions this way, yet there are 70 other dimensions remaining through which you can travel to find your true nearest neighbor (the curse of dimensionality; in high dimensions everything is "near" everything else) that you have not checked if you are discarding subspaces along the way.
This is a problem for other approximate methods as well though. IVF methods still do a form of partitioning (via Voronoi cells) but this is obtained by clustering the data so it is more attuned to the global structure of the data rather than greedy divide and conquer. But in IVF methods it is still possible that you need to check every IVF cell in order to find your true nearest neighbor, just that it is less likely to happen in practice because the partitioning is more attuned to the overall structure of data rather than via greedy subdivision.
I wonder if nearest neighbor is actually what you want to begin with. Imagine points distributed in a U-shape, the nearest neighbor might require jumping across the gap but a better match could be a bit further away but somewhere along the shape. Especially if the dimension do not have a meaningful intrinsic scale, I could just scale the x dimension and pull the two sides of the U apart. So maybe I could actually be more useful to look at the topology of the points, something like a Vietoris–Rips complex [1] but invariant under [local] rescaling of dimensions. Not sure if something like that exists or is possible in a robust and efficient way.
Funny I got interested in nearest-neighbor search for this kind of vectors in the early 2000s and there were a lot of papers on the subject that were depressing (had to make a tradeoff that wasn't really attractive) compared to the literature on low-dimensional indexes. Back then I had 250,000 or so vectors and a full scan wasn't that bad.
In the 2010s I worked on a dense vector search engine for patents and we had maybe 10 million vectors and still used a scan, I mean, you couldn't ask for a more memory hierarchy and SIMD friendly algorithm.
Today I am looking at 1M (larger) vectors and the full scan is still possible but I am using FAISS because it is a bird in the hand and I decided I can live with the tradeoff.
If you want an easy way to evaluate Faiss, Hnswlib and Annoy vector backends, check out txtai - https://github.com/neuml/txtai. txtai also supports NumPy and PyTorch vector storage.
Does anyone have any recommendations for a decent crash course on using vector DBs in conjuction with LLMs? I wanna do some experimentation with getting a model to comment on the similarity of data vectors etc. and I don't really know where to start.
If you want to experiment with vector stores, you can do that locally with something like faiss which has good multiplatform support and sufficient tutorials: https://github.com/facebookresearch/faiss
Doing full retrieval-augmented generation (RAG) and getting LLMs to interpret the results has more steps but you get a lot of flexibility, and despite what AI influencers say there's no standard best-practice. When you query a vector DB you get the most similar texts back (or an index integer in the case of faiss), you then feed those result to an LLM like a normal prompt, which can be optimized with prompt engineering.
The codifer for the RAG workflow is LangChain, but their demo is substantially more complex and harder-to-use than even a homegrown implementation: https://minimaxir.com/2023/07/langchain-problem/
Also, if what you look up has no semantic meaning like parts number you might be better off with an inverted index in addition to ANN lookups. Especially if the embedding model has been trained on a dataset that is not similar to what you use it for. That's a common situation right now with embedding models based on LLMs.
I recommend pgvector, it's simple and well featured with good example code. Once you have a dataset of vectors loaded in, the next step is called rag / retrieval augmented generation
I was so saddened by this title.. I thought there was a way to find annoying neighbors (at homes I’m interested in acquiring) and assumed this was a service for literally finding nearby annoy[ing] neighbors. To then go through the comments and find that this is a service with a minimum input of addresses and census data which are both freely available as is this service. Bravo hn!
[1]: https://github.com/spotify/voyager [2]: https://engineering.atspotify.com/2023/10/introducing-voyage...
The biggest immediate useful difference that I see is that Annoy uses read-only index files (from the docs: "you can not add more items once the tree has been created" [0]), while Voyager allows you to call `.add_item` at any time (I just pip installed to double check and yes -- it's great).
[0] https://github.com/spotify/annoy#summary-of-features
[1] https://en.wikipedia.org/wiki/Binary_space_partitioning
If each dividing hyperplane separates a set of vectors into two roughly equal parts, then for N vectors you have something on the order of log2(N) separating planes in order to get down to a single vector per terminal node. But for, say, a billion vectors in a 100-dimensional space (and assuming something like a k-D tree where you partition each dimension at each level), you have only partitioned log2(10^9) ~= 30 dimensions this way, yet there are 70 other dimensions remaining through which you can travel to find your true nearest neighbor (the curse of dimensionality; in high dimensions everything is "near" everything else) that you have not checked if you are discarding subspaces along the way.
This is a problem for other approximate methods as well though. IVF methods still do a form of partitioning (via Voronoi cells) but this is obtained by clustering the data so it is more attuned to the global structure of the data rather than greedy divide and conquer. But in IVF methods it is still possible that you need to check every IVF cell in order to find your true nearest neighbor, just that it is less likely to happen in practice because the partitioning is more attuned to the overall structure of data rather than via greedy subdivision.
[1] https://en.wikipedia.org/wiki/Vietoris%E2%80%93Rips_complex
I haven’t read up on vector DBs but I wonder of k-d trees would be a better fit?
In the 2010s I worked on a dense vector search engine for patents and we had maybe 10 million vectors and still used a scan, I mean, you couldn't ask for a more memory hierarchy and SIMD friendly algorithm.
Today I am looking at 1M (larger) vectors and the full scan is still possible but I am using FAISS because it is a bird in the hand and I decided I can live with the tradeoff.
Disclaimer: I am the author of txtai
Doing full retrieval-augmented generation (RAG) and getting LLMs to interpret the results has more steps but you get a lot of flexibility, and despite what AI influencers say there's no standard best-practice. When you query a vector DB you get the most similar texts back (or an index integer in the case of faiss), you then feed those result to an LLM like a normal prompt, which can be optimized with prompt engineering.
The codifer for the RAG workflow is LangChain, but their demo is substantially more complex and harder-to-use than even a homegrown implementation: https://minimaxir.com/2023/07/langchain-problem/
https://news.ycombinator.com/item?id=35826929
Deleted Comment