The first unstated assumption is that similar vectors are relevant documents, and for many use cases that's just not true. Cosine similarity != relevance. So if your pipeline pulls 2 or 4 or 12 document chunks into the LLM's context, and half or more of them aren't relevant, does this make the LLM's response more or less relevant?
The second unstated assumption is that the vector index can accurately identify the top K vectors by cosine similarity, and that's not true either. If you retrieve the top K vectors according to the vector index (instead of computing all the pairwise similarities in advance), that set of 10 vectors will be missing documents that have a higher cosine similarity than that of the K'th vector retrieved.
All of this means you'll need to retrieve a multiple of K vectors, figure out some way to re-rank them to exclude the irrelevant ones, and have your own ground truth to measure the index's precision and recall.
> The vectors are literally constructed so that cosine similarity is semantic similarity.
Are they? A learned embedding doesn't guarantee this and a positional embedding certainly doesn't. Our latent embeddings don't either unless you are inferring this through the dot product in the attention mechanism. But that too is learned. There are no guarantees that the similarities that they learn are the same things we consider as similarities. High dimensional space is really weird.
And while we're at it, we should mention that methods like t-SNE and UMAP are clustering algorithms not dimensional reduction. Just because they can find ways to cluster the data in a lower dimensional projection (epic mapping) doesn't mean that they are similar in the higher dimensional space. It all depends on the ability to unknot in the higher dimensional space.
It is extremely important to do what the OP is doing and consider the assumptions of the model, data, and measurements. Good results do not necessarily mean good methods. I like to say that you don't need to know math to make a good model, but you do need to know math to know why your model is wrong. Your comment just comes off as dismissive rather than actually countering the claims. There's plenty more assumptions than OP listed too. But their assumptions don't mean the model won't work, it just means what constraints the model is working under. We want to understand the constraints/assumptions if we want to make better models. Large models have advantages because they can have larger latent spaces and that gives them a lot of freedom to unknot data and move them around as they please. But that doesn't mean the methods are efficient.
Switching to Word2Vec embeddings led to a substantial improvement in my cosine similarity evaluations for text similarity, but granted I was looking for actual similarity, not relevance. I tried many different methods and had lots of mediocre results initially.
Interesting, do you happen to have some quantitative results on this/additional insights/etc?
I've interpreted transformer vector similarity as 'likelihood to be followed by the same thing' which is close to word2vec's 'sum of likelihoods of all words to be replaced by the other set' (kinda), but also very different in some contexts.
"Cosine similarity != relevance" In all ML search products, there's a tradeoff between precision and recall, and moreover there's almost never any "gold" data that ensures the "correctness" of surfaced results. I mean, Bing and Google have both invested millions of dollars in labeling web pages and even evaluating search results, but those labels can become useless as your set of documents change.
Cosine similar is a useful compromise and yes a lot of authors take this for granted. At the end of the day, an LLM product probably won't be evaluated on accuracy but rather "lift" over an alternative. And the evaluation will be in units of user happiness.
> All of this means you'll need to retrieve a multiple of K vectors, figure out some way to re-rank them to exclude the irrelevant ones, and have your own ground truth to measure the index's precision and recall.
This is usually a Series E problem, not a Series A problem.
Exactly. The whole point about databases is you don't need "a database for AI" you need a database, ideally with an extension to add additional AI functionality (ie postgres and pgvector). Trying to take a special store you invent for AI and retrofit all the desirable things you need to make it work properly in the context of a real application you're just going to end up with a mess.
As a thought-experiment for people who don't understand why you need (for example) regular relational columns alongside vector storage, consider how you would implement RAG for a set of documents where not everyone has permission to view every document. In the pgvector case it's easy - I can add one or more label columns and then when I do my search query filter to only include labels that user has permission to view. Then my vector similarity results will definitely not include anything that violates my access control. Trivial with something like pgvector - basically impossible (afaics) with special-purpose vector stores.
Or think about ranking. Say you want to do RAG over a space where you want to prioritise the most recent results, not just pure similarity. Or prioritise on a set of other features somehow (source credibility whatever). Easy to do if you have relational columns, no bueno if you just have a vector store.
And that's not to mention the obvious things around ACID, availability, recovery, replication, etc.
Can I add one more nice to have? Good support for graph data. I'm not 100% certain on it yet, but there's a lot of ideas surrounding storing knowledge as a graph out there and it makes a lot of intuitive sense. I haven't found a killer use case for it yet as so far I can get by just tagging things and sql querying on the tags is powerful enough.
Maybe someone could pitch in. Is knowledge really a graph (for your problem domain), or is that just some bullshit people made up when they still thought AI could be captured mathematically? It feels to me now knowledge is much more like the way vector embeddings work, it's in a cloud where things are related to each other in an analog or statistical way, not a discrete way.
But, perhaps for similar reasons, vector embeddings haven't been super useful to me in building RAG agents yet. Knowledge is either relevant or it's not, and at least for me if it's relevant it has the keywords or tags I need, and just a straight up SQL query brings it in.
You can think of a vector database with n vectors as a network whose adjacency matrix is nxn and each edge is represented by whatever similarity metric between nodes you choose to use. So you can have strongly connected edges and weakly connected edges.
You may want to take a look at Zep, an LLM application platform that wraps Postgres, pgvector, embedding models, and more to offer chat memory persistence and document vector search.
The Python and TS SDKs are designed to support drop-in replacements for the bits of LangChain that don’t scale, but nothing stops you accessing Postgres directly.
Yes totally agree with that (and other comments below). Moving from a toy example to production deployment requires all the things we are used to having in robust/mature products like postgres.
I don’t fully understand the fascination with retrieval augmented generation. The retrieval part is already really good and computationally inexpensive — why not just pass the semantic search results to the user in a pleasant interface and allow them to synthesize their own response? Reading a generated paragraph that obscures the full sourcing seems like a practice that’s been popularized to justify using the shiny new tech, but is the generated part what users actually want? (Not to mention there is no bulletproof way to prevent hallucinations, lies, and prompt injection even with retrieval context.)
On the modeling side, it's compelling to separate the memory from the linguistic skills. Vector search is hella fast and can be very good. So you can off load the memorization part of the problem, and let the language model focus on the language. This should allow better performance with much smaller models.
I really like using LLMs to learn stuff because they can explain anything at the exact level I need. Hallucination is a big problem with that and RAG pretty much solves it. If I give chatGPT a good stackoverflow post and tell it to dumb it down for me, it does very well. RAG just automates that process with the added benefit of not letting the LLM decide which information to retrieve, which should greatly reduce the chance of accidentally biasing the model with your prompt.
The main reason is that you might not want the raw information but some reasoning above. LLM is not only the context but all the information it has been trained with. For example a math student is making a question, it doesn't want the raw theorems but some reasoning with them, and currently LLM can do that. It will make mistakes sometimes because of hallucinations, but for not very difficult questions it usually gives you the right answer. And that helps a lot when you are not an expert in the domain. And that is the reason GPT4 is a great tool for students, it helps you to understand the basics as if you have a teacher with you.
Sometimes what I want is to ask Google/Alexa/Siri a question and get a summary response along with the source. I think that would be a good application of the above.
Less so IMO when I’m on my phone or in front of the computer.
It's not clear to me that only a vector DB should be used for RAG.
Vector DBs give you stochastic responses.
For customer chatbots, it seems that structured data - from an operational database or a feature store adds more value. If the user asks about an order they made or a product they have a question about, you use the user-id (when logged in) to retrieve all info about what the user bought recently - the LLM will figure out what the prompt is referring to.
Thanks for sharing that observation on customer chatbots.
1. Will that query look like this:
SELECT LLM("{user_question}", order_info)
FROM postgres_data.order_table
WHERE user_id = “101”;
2. How will a feature store, like Hopsworks, help in this app?
Shameless self-plug: We are building EvaDB [1], a query engine for shipping fast AI-powered apps with SQL. Would love to exchange notes on such apps if you're up for it!
A lot of things mentioned are too handwaved and not explained well.
It's not explained how vector DB is going to help while incumbents like chatgpt4 can already call functions and do API calls.
It doesn't make AI less black box, it's irrelevant and not explained..
There's already existing ways to fine tune models without expensive hardwares such as using LoRA to inject small layers with customized training data, which trains in fractions of the time and resource needed to retrain the model
We use Lance extensively at my startup. This blog post (previously on HN) details nicely why: https://thedataquarry.com/posts/vector-db-4/ but essentially it’s because Lance is a “just a file” in the same way SQLite is a “just a file” which makes it embedded and serverless and straightforward to use locally or in a deployment.
I find it quite comical to speak of a "missing storage layer" during your own self-promotion, considering that the market for vector databases is literally overflowing right now.
Everything else may be missing, but not the storage layer.
Does ChatGPT always start articles with “in the rapidly evolving landscape of X”?
Surely if you’re posting an article promoting miraculous AI tech you should human edit the article summary so that it’s not really obviously drafted by AI.
Or just use the prompt “tone your writing down and please remember that you’re not writing for a high school student who is impressed by nonsensical hyperbole”. I’ve started using this prompt and it works astonishingly well in the fast evolving landscape of directionless content creation.
The second unstated assumption is that the vector index can accurately identify the top K vectors by cosine similarity, and that's not true either. If you retrieve the top K vectors according to the vector index (instead of computing all the pairwise similarities in advance), that set of 10 vectors will be missing documents that have a higher cosine similarity than that of the K'th vector retrieved.
All of this means you'll need to retrieve a multiple of K vectors, figure out some way to re-rank them to exclude the irrelevant ones, and have your own ground truth to measure the index's precision and recall.
> second unstated assumption is that the vector index can accurately identify the top K vectors by cosine similarity, and that's not true either
Its not unstated, its called ANN for a reason
Are they? A learned embedding doesn't guarantee this and a positional embedding certainly doesn't. Our latent embeddings don't either unless you are inferring this through the dot product in the attention mechanism. But that too is learned. There are no guarantees that the similarities that they learn are the same things we consider as similarities. High dimensional space is really weird.
And while we're at it, we should mention that methods like t-SNE and UMAP are clustering algorithms not dimensional reduction. Just because they can find ways to cluster the data in a lower dimensional projection (epic mapping) doesn't mean that they are similar in the higher dimensional space. It all depends on the ability to unknot in the higher dimensional space.
It is extremely important to do what the OP is doing and consider the assumptions of the model, data, and measurements. Good results do not necessarily mean good methods. I like to say that you don't need to know math to make a good model, but you do need to know math to know why your model is wrong. Your comment just comes off as dismissive rather than actually countering the claims. There's plenty more assumptions than OP listed too. But their assumptions don't mean the model won't work, it just means what constraints the model is working under. We want to understand the constraints/assumptions if we want to make better models. Large models have advantages because they can have larger latent spaces and that gives them a lot of freedom to unknot data and move them around as they please. But that doesn't mean the methods are efficient.
They are related, and we frequently assume they are close enough that it doesn’t matter, but they are different.
If I'm using vectors for question/answer, then:
"What is a cat"
and
"What is a dog"
Should be more dissimilar than the documents answering either.
If I'm using it for FAQ filtering then they should be more similar.
hence heuristic.
code: https://github.com/jimmc414/document_intelligence/blob/main/...https://github.com/jimmc414/document_intelligence
I've interpreted transformer vector similarity as 'likelihood to be followed by the same thing' which is close to word2vec's 'sum of likelihoods of all words to be replaced by the other set' (kinda), but also very different in some contexts.
Cosine similar is a useful compromise and yes a lot of authors take this for granted. At the end of the day, an LLM product probably won't be evaluated on accuracy but rather "lift" over an alternative. And the evaluation will be in units of user happiness.
> All of this means you'll need to retrieve a multiple of K vectors, figure out some way to re-rank them to exclude the irrelevant ones, and have your own ground truth to measure the index's precision and recall.
This is usually a Series E problem, not a Series A problem.
- Full SQL support
- Has good tooling around migrations (i.e. dbmate)
- Good support for running in Kubernetes or in the cloud
- Well understood by operations i.e. backups and scaling
- Supports vectors and similarity search.
- Well supported client libraries
So basically Postgres and PgVector.
As a thought-experiment for people who don't understand why you need (for example) regular relational columns alongside vector storage, consider how you would implement RAG for a set of documents where not everyone has permission to view every document. In the pgvector case it's easy - I can add one or more label columns and then when I do my search query filter to only include labels that user has permission to view. Then my vector similarity results will definitely not include anything that violates my access control. Trivial with something like pgvector - basically impossible (afaics) with special-purpose vector stores.
Or think about ranking. Say you want to do RAG over a space where you want to prioritise the most recent results, not just pure similarity. Or prioritise on a set of other features somehow (source credibility whatever). Easy to do if you have relational columns, no bueno if you just have a vector store.
And that's not to mention the obvious things around ACID, availability, recovery, replication, etc.
Maybe someone could pitch in. Is knowledge really a graph (for your problem domain), or is that just some bullshit people made up when they still thought AI could be captured mathematically? It feels to me now knowledge is much more like the way vector embeddings work, it's in a cloud where things are related to each other in an analog or statistical way, not a discrete way.
But, perhaps for similar reasons, vector embeddings haven't been super useful to me in building RAG agents yet. Knowledge is either relevant or it's not, and at least for me if it's relevant it has the keywords or tags I need, and just a straight up SQL query brings it in.
The Python and TS SDKs are designed to support drop-in replacements for the bits of LangChain that don’t scale, but nothing stops you accessing Postgres directly.
https://github.com/getzep/zep
Disclosure: I’m the primary author.
In conversational AI, providing search results appended to a long-memory context produces "human-like" results.
Deleted Comment
Less so IMO when I’m on my phone or in front of the computer.
For customer chatbots, it seems that structured data - from an operational database or a feature store adds more value. If the user asks about an order they made or a product they have a question about, you use the user-id (when logged in) to retrieve all info about what the user bought recently - the LLM will figure out what the prompt is referring to.
Reference:
https://www.hopsworks.ai/dictionary/retrieval-augmented-llm
1. Will that query look like this:
2. How will a feature store, like Hopsworks, help in this app?Shameless self-plug: We are building EvaDB [1], a query engine for shipping fast AI-powered apps with SQL. Would love to exchange notes on such apps if you're up for it!
[1] https://github.com/georgia-tech-db/evadb
You can train a small llm on your private data to map the user question to tables in your db.
Then Just select with a limit ( or time bounded). The feature store is just another operational store that could have relevant data for the query.
I would assume the embedding model isn't trained on code and specific words that are industry/company specific.
It's not explained how vector DB is going to help while incumbents like chatgpt4 can already call functions and do API calls.
It doesn't make AI less black box, it's irrelevant and not explained..
There's already existing ways to fine tune models without expensive hardwares such as using LoRA to inject small layers with customized training data, which trains in fractions of the time and resource needed to retrain the model
Everything else may be missing, but not the storage layer.
Surely if you’re posting an article promoting miraculous AI tech you should human edit the article summary so that it’s not really obviously drafted by AI.
Or just use the prompt “tone your writing down and please remember that you’re not writing for a high school student who is impressed by nonsensical hyperbole”. I’ve started using this prompt and it works astonishingly well in the fast evolving landscape of directionless content creation.