As others have correctly pointed out, to make a vector search or recommendation application requires a lot more than similarity alone. We have seen the HNSW become commoditised and the real value lies elsewhere. Just because a database has vector functionality doesn’t mean it will actually service anything beyond “hello world” type semantic search applications. IMHO these have questionable value, much like the simple Q and A RAG applications that have proliferated. The elephant in the room with these systems is that if you are relying on machine learning models to produce the vectors you are going to need to invest heavily in the ML components of the system. Domain specific models are a must if you want to be a serious contender to an existing search system and all the usual considerations still apply regarding frequent retraining and monitoring of the models. Currently this is left as an exercise to the reader - and a very large one at that. We (https://github.com/marqo-ai/marqo, I am a co-founder) are investing heavily into making the ML production worthy and continuous learning from feedback of the models as part of the system. Lots of other things to think about in how you represent documents with multiple vectors, multimodality, late interactions, the interplay between embedding quality and HNSW graph quality (i.e. recall) and much more.
In general I find they're incredible good for being able to rapidly build out search engines for things that would it would normally be difficult to do with plain text.
The most obvious example is code search where you can describe the function's behavior and get a match. But you could also make a searchable list of recipes that would allow a user to search something like "a hearty beef dish for a cold fall night". Or searching support tickets where full text might not match, "all the cases where users had trouble signing on".
Interestingly Q & A is ultimately a (imho fairly boring) implementation of this pattern.
The really nice part is that you can implement working demos of this projects in just a few lines of code once you have the vector db set up. Once you start thinking in terms of semantic search over text matching, you realize you can build old-Google style search engines for basically any text available to you.
One thing that is a bit odd about the space is, from what I've experienced and heard, is that setup and performance on most of this products is not all that great. Given that you can implement the demo version of a vector db in a few lines of numpy, you would hope that investing in a full vector db product we get you an easily scalable solution.
Everyone I talk to who is building some vector db based thing sooner or later realizes they also care about the features of a full-text search engine.
They care about filtering, they care to some degree about direct lexical matches, they care about paging, getting groups / facet counts, etc.
Vectors, IMO, are just one feature that a regular search engine should have. IMO currently Vespa does the best job of this, though lately it seems Lucene (Elasticsearch and Opensearch) are really working hard to compete
My company is using vector search with Elasticsearch. It’s working well so far. IMO Elastic will eat most vector-first/only products because of its strength at full-text search, plus all the other stuff it does.
I tend to agree - search, and particularly search-for-humans, is really a team sport - meaning, very rarely do you have a single search algo operating in isolation. You have multiple passes, you filter results through business logic.
Having said that, I think pgvector has a chance for less scale-intense needs - embedding as a column in your existing DB and a join away from your other models is where you want search.
I don’t get why you’d want to bolt RBAC onto these new vector dbs, unless it’s because they’ve caused this problem in the first place…
Until very recently, “dense retrieval” was not even as good as bm25, and still is not always better.
I think a lot of people use dense retrieval in applications where sparse retrieval is still adequate and much more flexible, because it has the hype behind it. Hybrid approaches also exist and can help balance the strengths and weaknesses of each.
Vectors can also work in other tasks, but largely people seem to be using them for retrieval only, rather than applying them to multiple tasks.
A lot of these things are use-case dependent. Like the characteristics even of BM-25 varies a lot depending on whether the query is over or under specified, the nature of the query and so on.
I don't think there will ever be an answer to what is the best way of doing information retrieval for a search engine scale corpus of document that is superior for every type of queries.
more commonly you use approximate KNN vector search with LLM based embeddings, which can find many fitting documents bm25 and similar would never manage to
the tricky part if to properly combine the results
Vector search is not exclusively in the domain of text search. There is always image/video search.
But pre-filtering is important, since you want to reduce the set of items to be matched on and it feels like Elasticsearch/OpenSearch are fairing better in this regard. Mixed scoring derived from both both sparse and dense calculations is also important, which is another strength of ES/OS.
Check out FeatureBase, when you get a chance. Vectors and super fast operations on sets. I'm using it for managing keyterms extracted from the text and stored along with the vectors.
I'm building a RAG for my personal use: Say I have a lot of notes on various topics I've compiled over the years. They're scattered over a lot of text files (and org nodes). I want to be able to ask questions in a natural language and have the system query my notes and give me an answer.
The approach I'm going for is to store those notes in a vector DB. When I ask my query, a search is performed and, say, the top 5 vectors are sent to GPT for parsing (along with my query). GPT will then come back with an answer.
I can build something like this, but I'm struggling in figuring out metrics for how good my system is. There are many variables (e.g. amount of content in a given vector, amount of overlap amongst vectors, number of vectors to send to GPT, and many more). I'd like to tweak them, but I also want some objective way to compare different setups. Right now all I do is ask a question, look at the answer, and try to subjectively gauge whether I think it did a good job.
Any tips on how people measure the performance/effectiveness for these types of problems?
For small personal projects its kind of hard to build metrics like this because the volume of indexed content in the database tends to be pretty low. If you're indexing paragraphs you might consistently be able to fit all relevant paragraphs in the context itself.
What I can recommend is to take the coffee tasting approach. Don't try and test and evaluate individual responses, instead lock the seed used in generation, and use the same prompt for two different runs. Change one variable and do a relative comparison of the two outputs. The variables probably worth testing for you off the top of my head:
* Choice of models and/or tunes
* System prompts
* Temperature of the model against your queries
* Threshold for similarity for document inclusions (you only want relevant documents from your RAG, set it too low and you'll get some extra distractions, too high and useful information might be left out of the context).
If you setup a system to track the comparisons either automatically or by hand that just indicates which side of the change worked better for your use case, and test that same change for a bunch of different prompts you should be able to tally up whether the control or change was more preferred.
Keep those data points! The data points are your bench log and can be invaluable later on for anything you do with the system to see what changed in aggregate, what had the most outsized impact, etc and can guide you to build useful tooling for testing or finding existing solutions out there.
I use lots and lots of domain specific test cases at several layers, numbering in the hundreds or thousands. The score is the number of test cases that pass so it requires a different approach than all or nothing tests. The layers depend on your RAG “architecture” but I test the RAG query generation and scoring (comparing ordered lists is the simplest but I also include a lot of fuzzy comparisons), the LLM scoring the relevance of retrieved snippets before feeding into the final answering prompt, and the final answer. The most annoying part is the prompt to score the final answer, since it tends out to come out looking like a CollegeBoard AP test scoring rubric.
This requires a lot of domain specific work. For example, two of my test cases are “Is it [il]legal to build an atomic bomb” run against the entire USCode [1] so I have a list of sections that are relevant to the question that I’ve scored before eventually getting an answer of “it is illegal” followdd by several prompts that evaluate nuance in the answer (“it’s illegal except for…”). I have hundreds of these test cases, approaching a thousand. It’s a slog.
[1] 42 U.S.C. 2122 is one of the “right” sections in case anyone is wondering. Another step tests whether 2121 is pulled in based on the mention in 2122
The main thing is that there's no "objective" way, but if you rank and label your own data then you can certainly get a ranking that's subjectively well performing according to you.
RAG in this case is essentially the same as a recommender system so you can approach it with the same metrics you would there.
You'll need to build a data set with known correct answers but then it's basically, NDCG (Normalized Discounted Cumulative Gain) is a good place to start, MRR (Mean Reciprocal Rank) and MAP (Mean Absolute Precision) are other options. You could also just look at the accuracy of getting your result in the top K results for various thresholds for k (which can be interpreted as the "probability of getting your result in 'k' results).
Included here is a bit of the old tried and true: NDCG/MRR/Precision @k - what you really want for measuring your information retrieval systems.
But we also talk through a bit of the "new", how to use Evals to generate the building blocks for those metrics above. You will want both hand labels and the automated Evals in the end to evaluate your system.
txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.
Embeddings databases are a union of vector indexes (sparse and dense), graph networks and relational databases. This enables vector search with SQL, topic modeling and retrieval augmented generation.
txtai adopts a local-first approach. A production-ready instance can be run locally within a single Python instance. It can also scale out when needed.
txtai can use Faiss, Hnswlib or Annoy as it's vector index backend. This is relevant in terms of the ANN-Benchmarks scores.
Hence why I’d be interested to know more about the supporting details for the different categories. It may help uncover some inadvertent errors in the analysis, but also would just serve as a useful jumping-off point for people doing their own research as well.
Totally agree with the puzzling assortment of a rubric. PostgreSQL supports role based-access control, RBAC. Not to mention, with PostgreSQL and the pgvector extension, you have a whole list of languages ready to use it:
C++ pgvector-cpp
C# pgvector-dotnet
Crystal pgvector-crystal
Dart pgvector-dart
Elixir pgvector-elixir
Go pgvector-go
Haskell pgvector-haskell
Java, Scala pgvector-java
Julia pgvector-julia
Lua pgvector-lua
Node.js pgvector-node
Perl pgvector-perl
PHP pgvector-php
Python pgvector-python
R pgvector-r
Ruby pgvector-ruby, Neighbor
Rust pgvector-rust
Swift pgvector-swift
Wonder how many of those other Vector databases play nice.
That stood out to me as well. I've been playing with pgvector, and there's no reason you can't use row/table role-based security.
I think there's an unmentioned benefit to using something like pgvector also. You don't need a separate relational database! In fact you can have foreign keys to your vectors/embeddings which is super powerful to me.
Same for Developer experience. If you used Postgres or any other relational db (which I think covers a large % of devs), you could easily argue the dev experience is 3/3 for pgvector.
Not only 3/3 but also includes full text search built in. Tables look like:
MyThingEmbedding
______
id primary key
mything_id integer -- fkey to mything table
embedding vector(1536)
fulltext tsvector
GIN index on tsvector
HSNW index on embedding
Then you can pull results that match either the tsvector AND/OR the similarity with a single query, and it's pretty performant. You can also choose at the query level whether you want exact matching or fuzzy.
I made this table to compare vector databases in order to help me choose the best one for a new project. I spent quite a few hours on it, so I wanted to share it here too in hopes it might help others as well. My main criteria when choosing vector DB were the speed, scalability, dx, community and price. You'll find all of the comparison parameters in the article.
Happy to connect. The benchmark numbers are mostly from ANN Benchmarks. For my use case, the nytimes-256 dataset was most relevant so I used that for the QPS benchmark. I also took a look at the benchmarks you've made at https://qdrant.tech/benchmarks/ and there qdrant seems to be outperforming many others. If I've gotten something wrong here, I'm glad to update the article :)
I'd love to know how vector databases compare in their ability to do hybrid queries, vector similarity filtered by metadata values. For example, find the 100 items with the closest cosine similarity where genre = jazz and publication date between 1990 and 2000.
Can the vector index operate on a subset of records? Or when searching for 100 closest matches does the database have to find 1000 matches and then apply the metadata filter, and hope that doesn't reduce the result set down to zero and exclude relevant vectors?
It seems like measuring precision and recall for hybrid queries would be illuminating.
I can't speak to the others, but pgvector indices can "break" hybrid queries. For example, if you select using a where clause specifying metadata (where genre = jazz) and order by distance from a vector (embedding of sound clip); if the index doesn't have a lot (or any) vectors in the sphere of the query vector that also match the metadata it can return no results. I discuss this in a blog post here [1].
Curious about the lack of Vespa, especially given the thoroughness of the article and its long-time reputation. OpenSearch is also missing, but perhaps it can be considered being lumped in with Elasticsearch due to them both being based on Lucene. The products are starting to diverge, so would be nice to see, especially since it is open-source.
For the performance-based columns, would be also helpful to see which versions were tested. There is so much attention lately for vector databases, that they all are making great strides forward. The Lucene updates are notable.
In general I find they're incredible good for being able to rapidly build out search engines for things that would it would normally be difficult to do with plain text.
The most obvious example is code search where you can describe the function's behavior and get a match. But you could also make a searchable list of recipes that would allow a user to search something like "a hearty beef dish for a cold fall night". Or searching support tickets where full text might not match, "all the cases where users had trouble signing on".
Interestingly Q & A is ultimately a (imho fairly boring) implementation of this pattern.
The really nice part is that you can implement working demos of this projects in just a few lines of code once you have the vector db set up. Once you start thinking in terms of semantic search over text matching, you realize you can build old-Google style search engines for basically any text available to you.
One thing that is a bit odd about the space is, from what I've experienced and heard, is that setup and performance on most of this products is not all that great. Given that you can implement the demo version of a vector db in a few lines of numpy, you would hope that investing in a full vector db product we get you an easily scalable solution.
They care about filtering, they care to some degree about direct lexical matches, they care about paging, getting groups / facet counts, etc.
Vectors, IMO, are just one feature that a regular search engine should have. IMO currently Vespa does the best job of this, though lately it seems Lucene (Elasticsearch and Opensearch) are really working hard to compete
Having said that, I think pgvector has a chance for less scale-intense needs - embedding as a column in your existing DB and a join away from your other models is where you want search.
I don’t get why you’d want to bolt RBAC onto these new vector dbs, unless it’s because they’ve caused this problem in the first place…
I think a lot of people use dense retrieval in applications where sparse retrieval is still adequate and much more flexible, because it has the hype behind it. Hybrid approaches also exist and can help balance the strengths and weaknesses of each.
Vectors can also work in other tasks, but largely people seem to be using them for retrieval only, rather than applying them to multiple tasks.
I don't think there will ever be an answer to what is the best way of doing information retrieval for a search engine scale corpus of document that is superior for every type of queries.
the tricky part if to properly combine the results
But pre-filtering is important, since you want to reduce the set of items to be matched on and it feels like Elasticsearch/OpenSearch are fairing better in this regard. Mixed scoring derived from both both sparse and dense calculations is also important, which is another strength of ES/OS.
We recently did a bunch of evaluation work to quantify the differences between keyword search, vector search, hybrid, reranking, etc. across a few datasets. We shared the results here: https://techcommunity.microsoft.com/t5/azure-ai-services-blo...
Disclosure - I work in the Azure Search team.
Deleted Comment
Deleted Comment
I'm building a RAG for my personal use: Say I have a lot of notes on various topics I've compiled over the years. They're scattered over a lot of text files (and org nodes). I want to be able to ask questions in a natural language and have the system query my notes and give me an answer.
The approach I'm going for is to store those notes in a vector DB. When I ask my query, a search is performed and, say, the top 5 vectors are sent to GPT for parsing (along with my query). GPT will then come back with an answer.
I can build something like this, but I'm struggling in figuring out metrics for how good my system is. There are many variables (e.g. amount of content in a given vector, amount of overlap amongst vectors, number of vectors to send to GPT, and many more). I'd like to tweak them, but I also want some objective way to compare different setups. Right now all I do is ask a question, look at the answer, and try to subjectively gauge whether I think it did a good job.
Any tips on how people measure the performance/effectiveness for these types of problems?
What I can recommend is to take the coffee tasting approach. Don't try and test and evaluate individual responses, instead lock the seed used in generation, and use the same prompt for two different runs. Change one variable and do a relative comparison of the two outputs. The variables probably worth testing for you off the top of my head:
* Choice of models and/or tunes
* System prompts
* Temperature of the model against your queries
* Threshold for similarity for document inclusions (you only want relevant documents from your RAG, set it too low and you'll get some extra distractions, too high and useful information might be left out of the context).
If you setup a system to track the comparisons either automatically or by hand that just indicates which side of the change worked better for your use case, and test that same change for a bunch of different prompts you should be able to tally up whether the control or change was more preferred.
Keep those data points! The data points are your bench log and can be invaluable later on for anything you do with the system to see what changed in aggregate, what had the most outsized impact, etc and can guide you to build useful tooling for testing or finding existing solutions out there.
This requires a lot of domain specific work. For example, two of my test cases are “Is it [il]legal to build an atomic bomb” run against the entire USCode [1] so I have a list of sections that are relevant to the question that I’ve scored before eventually getting an answer of “it is illegal” followdd by several prompts that evaluate nuance in the answer (“it’s illegal except for…”). I have hundreds of these test cases, approaching a thousand. It’s a slog.
[1] 42 U.S.C. 2122 is one of the “right” sections in case anyone is wondering. Another step tests whether 2121 is pulled in based on the mention in 2122
Blog on the same topic - https://blog.langchain.dev/evaluating-rag-pipelines-with-rag...
The main thing is that there's no "objective" way, but if you rank and label your own data then you can certainly get a ranking that's subjectively well performing according to you.
You'll need to build a data set with known correct answers but then it's basically, NDCG (Normalized Discounted Cumulative Gain) is a good place to start, MRR (Mean Reciprocal Rank) and MAP (Mean Absolute Precision) are other options. You could also just look at the accuracy of getting your result in the top K results for various thresholds for k (which can be interpreted as the "probability of getting your result in 'k' results).
Included here is a bit of the old tried and true: NDCG/MRR/Precision @k - what you really want for measuring your information retrieval systems.
But we also talk through a bit of the "new", how to use Evals to generate the building blocks for those metrics above. You will want both hand labels and the automated Evals in the end to evaluate your system.
txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.
Embeddings databases are a union of vector indexes (sparse and dense), graph networks and relational databases. This enables vector search with SQL, topic modeling and retrieval augmented generation.
txtai adopts a local-first approach. A production-ready instance can be run locally within a single Python instance. It can also scale out when needed.
txtai can use Faiss, Hnswlib or Annoy as it's vector index backend. This is relevant in terms of the ANN-Benchmarks scores.
Disclaimer: I am the author of txtai
For example, pgvector is listed as not having role-based access control, but the Postgres manual dedicates an entire chapter to it: https://www.postgresql.org/docs/current/user-manag.html
Hence why I’d be interested to know more about the supporting details for the different categories. It may help uncover some inadvertent errors in the analysis, but also would just serve as a useful jumping-off point for people doing their own research as well.
C++ pgvector-cpp C# pgvector-dotnet Crystal pgvector-crystal Dart pgvector-dart Elixir pgvector-elixir Go pgvector-go Haskell pgvector-haskell Java, Scala pgvector-java Julia pgvector-julia Lua pgvector-lua Node.js pgvector-node Perl pgvector-perl PHP pgvector-php Python pgvector-python R pgvector-r Ruby pgvector-ruby, Neighbor Rust pgvector-rust Swift pgvector-swift
Wonder how many of those other Vector databases play nice.
I think there's an unmentioned benefit to using something like pgvector also. You don't need a separate relational database! In fact you can have foreign keys to your vectors/embeddings which is super powerful to me.
Can the vector index operate on a subset of records? Or when searching for 100 closest matches does the database have to find 1000 matches and then apply the metadata filter, and hope that doesn't reduce the result set down to zero and exclude relevant vectors?
It seems like measuring precision and recall for hybrid queries would be illuminating.
[1]: https://www.polyscale.ai/blog/pgvector-bigger-boat/
"no" - the graph objects after training are opaque AFAIK
For the performance-based columns, would be also helpful to see which versions were tested. There is so much attention lately for vector databases, that they all are making great strides forward. The Lucene updates are notable.