Approximate Nearest Neighbor Oh Yeah (Annoy)

That sound pretty much like binary space partitioning [1] which is probably best known for being used in Doom to accelerate the rendering. Also, if you only search the leaf subspace that you query point lies in, then you can in principle be arbitrarily far off from the true nearest neighbor.

[1] https://en.wikipedia.org/wiki/Binary_space_partitioning

jhj · 2 years ago

k-D trees and BSPs don't work well in high dimensions (say >20 to 30 dims), since your true nearest neighbor is highly likely to lie on either side of any dividing hyperplane.

If each dividing hyperplane separates a set of vectors into two roughly equal parts, then for N vectors you have something on the order of log2(N) separating planes in order to get down to a single vector per terminal node. But for, say, a billion vectors in a 100-dimensional space (and assuming something like a k-D tree where you partition each dimension at each level), you have only partitioned log2(10^9) ~= 30 dimensions this way, yet there are 70 other dimensions remaining through which you can travel to find your true nearest neighbor (the curse of dimensionality; in high dimensions everything is "near" everything else) that you have not checked if you are discarding subspaces along the way.

This is a problem for other approximate methods as well though. IVF methods still do a form of partitioning (via Voronoi cells) but this is obtained by clustering the data so it is more attuned to the global structure of the data rather than greedy divide and conquer. But in IVF methods it is still possible that you need to check every IVF cell in order to find your true nearest neighbor, just that it is less likely to happen in practice because the partitioning is more attuned to the overall structure of data rather than via greedy subdivision.

danbruc · 2 years ago

I wonder if nearest neighbor is actually what you want to begin with. Imagine points distributed in a U-shape, the nearest neighbor might require jumping across the gap but a better match could be a bit further away but somewhere along the shape. Especially if the dimension do not have a meaningful intrinsic scale, I could just scale the x dimension and pull the two sides of the U apart. So maybe I could actually be more useful to look at the topology of the points, something like a Vietoris–Rips complex [1] but invariant under [local] rescaling of dimensions. Not sure if something like that exists or is possible in a robust and efficient way.

[1] https://en.wikipedia.org/wiki/Vietoris%E2%80%93Rips_complex

esafak · 2 years ago

And LSH uses random hyperplanes.

taneq · 2 years ago

Yeah, it looks pretty similar to me too.

I haven’t read up on vector DBs but I wonder of k-d trees would be a better fit?

Does anyone have any recommendations for a decent crash course on using vector DBs in conjuction with LLMs? I wanna do some experimentation with getting a model to comment on the similarity of data vectors etc. and I don't really know where to start.

minimaxir · 2 years ago

If you want to experiment with vector stores, you can do that locally with something like faiss which has good multiplatform support and sufficient tutorials: https://github.com/facebookresearch/faiss

Doing full retrieval-augmented generation (RAG) and getting LLMs to interpret the results has more steps but you get a lot of flexibility, and despite what AI influencers say there's no standard best-practice. When you query a vector DB you get the most similar texts back (or an index integer in the case of faiss), you then feed those result to an LLM like a normal prompt, which can be optimized with prompt engineering.

The codifer for the RAG workflow is LangChain, but their demo is substantially more complex and harder-to-use than even a homegrown implementation: https://minimaxir.com/2023/07/langchain-problem/

marcyb5st · 2 years ago

Also, if what you look up has no semantic meaning like parts number you might be better off with an inverted index in addition to ANN lookups. Especially if the embedding model has been trained on a dataset that is not similar to what you use it for. That's a common situation right now with embedding models based on LLMs.

alumic · 2 years ago

You might also check out this previous thread on the subject. It offers some pretty fascinating discussions:

https://news.ycombinator.com/item?id=35826929

Deleted Comment

bobvanluijt · 2 years ago

(I’m affiliated with Weaviate) You might want to check out this getting started guide. It takes a couple of minutes, and you're good to go https://weaviate.io/developers/weaviate/quickstart

pokpokpok · 2 years ago

I recommend pgvector, it's simple and well featured with good example code. Once you have a dataset of vectors loaded in, the next step is called rag / retrieval augmented generation

malaya_zemlya · 2 years ago

deeplearning.ai has a short coursee on the topic https://www.deeplearning.ai/short-courses/large-language-mod...

daturkel · 2 years ago

Annoy came out of Spotify, and they just announced their successor library Voyager [1] last week [2].

[1]: https://github.com/spotify/voyager [2]: https://engineering.atspotify.com/2023/10/introducing-voyage...

gregsadetsky · 2 years ago

Amazing, thanks for the reference!

The biggest immediate useful difference that I see is that Annoy uses read-only index files (from the docs: "you can not add more items once the tree has been created" [0]), while Voyager allows you to call `.add_item` at any time (I just pip installed to double check and yes -- it's great).

[0] https://github.com/spotify/annoy#summary-of-features

PLenz · 2 years ago

The great thing about Annoy is that you can write the index to disk and thus do big data work on tiny workers at the edge. I've never really seen anything else do the same.

PaulHoule · 2 years ago

Funny I got interested in nearest-neighbor search for this kind of vectors in the early 2000s and there were a lot of papers on the subject that were depressing (had to make a tradeoff that wasn't really attractive) compared to the literature on low-dimensional indexes. Back then I had 250,000 or so vectors and a full scan wasn't that bad.

In the 2010s I worked on a dense vector search engine for patents and we had maybe 10 million vectors and still used a scan, I mean, you couldn't ask for a more memory hierarchy and SIMD friendly algorithm.

Today I am looking at 1M (larger) vectors and the full scan is still possible but I am using FAISS because it is a bird in the hand and I decided I can live with the tradeoff.

kristjansson · 2 years ago

I mean FAISS has IndexFlatL2 if you have it in hand and _want_ the full scan.

dmezzetti · 2 years ago

If you want an easy way to evaluate Faiss, Hnswlib and Annoy vector backends, check out txtai - https://github.com/neuml/txtai. txtai also supports NumPy and PyTorch vector storage.

Disclaimer: I am the author of txtai

Scene_Cast2 · 2 years ago

Would you have any quick links or data on how they compare in terms of speed and recall?

https://ann-benchmarks.com/ is a good resource covering those libraries and much more.

smokracek · 2 years ago

superdug · 2 years ago

I was so saddened by this title.. I thought there was a way to find annoying neighbors (at homes I’m interested in acquiring) and assumed this was a service for literally finding nearby annoy[ing] neighbors. To then go through the comments and find that this is a service with a minimum input of addresses and census data which are both freely available as is this service. Bravo hn!

lootsauce · 2 years ago

This popped up in the new feed yesterday and seems significant in the ANN space https://news.ycombinator.com/item?id=38054537

Found code for it https://github.com/microsoft/DiskANN

debo_ · 2 years ago

When I saw the "oh yeah" I assumed this was going to be a deepfake Macho Man Randy Savage explaining ANN. I am disappointed.