It feels like Postgres is eating the world. In a thread a couple of days ago many argued that with the capability of efficiently storing and processing JSON we can do without MongoDB et al. Now we also replace vector databases.
There are extensions that provide columnar storage and to be honest I never completely understood the need for specialized time series databases.
It is a bit as if Postgres has become what Oracle wanted to be - a one-stop solution for all your data storage needs.
Postgres allowing Json support destroyed so many marketing campaigns from non-technicaleadership "sick of all the money spent on SQL, we gotta try this nosql thing"
It's not only that postgres supports json (plenty of RDBMS do), it's that postgres has better nosql performance than the databases behind those campaigns.
> (...) and to be honest I never completely understood the need for specialized time series databases.
The need for specialized time series databases is explained by the basis of any software development process: reuse ready-made high-level abstractions instead of having to reinvent the wheel, poorly.
There are a few key features of time series databases that you either go without (i.e., data compression) or have to reinvent the wheel (querying, resolution-based retention periods, statistical summary of time periods,etc).
You might live without those features,but in some applications there's a hefty price tag tied to not having them. If you adopt a time series database, you benefit from all those features without having to implement a single one of them.
Perhaps that has no value to you, but it does have for the world.
I think what OP was getting at is why it has to be specialized. Time-series features can be implemented in a general purpose DB engine like postgres or mongo.
I'm curious to research if a dedicated TSDB is worth it for my project. Not to waste time could you recommend one that in your opinion would be worth it over just Postgres? I used InfluxDB a couple of years a ago but didn't really see the value, even though I loved the product. Should I give it another chance?
Application is storage of a diverse set of (irregular) events. Lifecycle management and archival is more important than compression. Volume is moderate.
Me too. You know what's another good sign times are changing? A fortune 500 company gave me a Linux laptop to WFH. Even a decade ago this would be extremely improbable.
You mean, open collaboration from people who care about what they are doing often leads to better projects faster? Careful now, someone might call ya a commie or worse an anarchist.
I am surprised that the "Postgres supports JSON so we don't need NoSQL systems" talking point is so common on Hacker News. MongoDB's marketing department is partially to blame for this because they marketed the negative trade-offs of MongoDB like non-SQL querying and lack of schema as "intentional" features.
It was Google's BigTable paper (2006) and Amazon's Dynamo paper (2007) that led to the "NoSQL revolution" in the late 2000s. The goal was to make it easier to scale a lot of de-normalized data horizontally with the tradeoffs being lack of RDBMS-like querying and schema enforcement.
In hindsight it was bizarre to market a whole category of horizontally scalable data systems based on one of their main negative trade-offs: Lack of SQL querying.
Since then many of these data systems have added SQL-like querying.
JSON in Postgres is less performant that MongoDB, that is what I found. Indices over JSON in Postgres leave a lot to be desired - so much so that it required us to design around it. My rule now is do not treat JSON in Postgress like MongoDB otherwise you will have a bad time (e.g. slowness.)
If you need to index over data inside JSON, why not pull out the pieces that need to be indexed into normal relational columns?
When I've used Postgres JSON fields in the past, I've used it for unstructured data that doesn't need to be indexed, as if there's anything that does need to be indexed, then it's important enough to extract out to a field for efficiency.
No it isn't. It's a transactions-focused DBMS that's no good for analytics. Orders of magnitude slower than available solutions and another order of magnitude or so slower than the academic state of the art, last time I checked (which admittedly was a few years back).
I guess I have never quite understood what is meant by analytics, and feel there is a different definition that is over my head.
I do some “analytics” in a postgres database, with data and metrics stored in the same db, but I guess it’s not a huge amount - amount of data swamps the metrics by orders of magnitude. Seems to work ok for me, queries are 5ms or less, and it is only one thing to learn/deploy/maintain.
I think it’s the natural evolution of software development in an age where decoupled architectures make more and more sense. I’ve been around long enough to see a few pendulum swings, but I’m not sure we’re ever going back to a world where the monolith makes sense again. I know some people will disagree with me, but I think there is a fundamental issue with the OOP line of thought, to no fault of the philosophy itself. It’s simply that people don’t handle complexity well in the real world. The way I see it, you have two modes of operating. The right way and whatever the heck you’re doing when it’s Thursday afternoon and you’ve slept poorly all week. Sure there are ways around this, but in most cases your code-reviewer and quality control isn’t going to be allocated enough time (and neither are you for that matter) to actually deny things that work. Which is how you end up with 9000 different tables, views, stored procedures and what not in your giant mess of a database.
Which was sort of fine in a world where nobody cared about data security. But it’s almost impossible to build a “monolith” database that lives up to the modern legislation requirements, at least within the EU. So we’re basically living in a world where the way we’ve done databases for the past 50 years, doesn’t work for the business. Unless you separate everything into business related domains in small kingdoms with total sovereignty over who comes and goes. Which is basically what micro services are.
At the same time we’re also living in a world where it’s often cheaper (and performant enough) to put 95% of what you do in a container using volumes for persistence. If SQLite was ready for it, or maybe, if all the ORMs were ready for SQLite, you could probably do 90% or your database needs in it. Since that’s not the case, Postgres reigns supreme because it’s the next best thing (yes, you’re allowed to think it’s better). I view Vector DBs sort of similarity to this. We operate almost all of our databases in containers, the one exception is for the one services everyone calls where the performance loss of the container is too much. Well, that’s not entirely true, our dataware house isn’t operated in containers, at least not by us, but I rarely interact with the BI side of things except for data architecture so I’m actually not sure how their external partners do Ops. Anyway, I think Postgres will do 90% or (y)our vector needs and then we will need dedicated vector databases for those last bits that won’t fit into the “everything” box. I don’t think this is bad either, maybe Oracle wanted it to be theirs, but then Oracle should not have Oracles governance. I’m sure they didn’t foresee this future when they bought MySQL though, I know I didn’t. Because who could’ve foreseen that that way we used to do Databases would become sort of obsolete because of containers and EU legislation… and other things?
>But it’s almost impossible to build a “monolith” database that lives up to the modern legislation requirements, at least within the EU. So we’re basically living in a world where the way we’ve done databases for the past 50 years, doesn’t work for the business.
Not sure I agree. It seems far simpler to comply with most regulations if you store your data in a centralised data store with good management features.
If each and every microservice handles its own data, how would you implement rules affecting all of them, such as retention periods, deletion requests, permissions, access logs, etc? It's not just that the data is stored all over the place. You may also have to use microservice specific APIs to access it.
If you're referring to laws that require data storage in a particular jurisdiction, it would still be easier to centralise data management in each jursdiction rather than distribute it across scores of services within each of those jurisdictions.
Great article, but I'm saddened by their view that C is too hard to work with, so the 2-year-old extension must be rewritten in Rust.
C certainly has its faults, and while I have no real experience with Rust, I'm willing to believe that it's significantly better as a language.
But pgvector, at least from a quick scan, looks like a well-written, easily comprehensible C codebase with decent tests. There are undoubtedly lots of hard problems that the developers have solved in implementing it. Putting time and effort into reimplementing all of that in another language because of an aversion to C feels like a waste of effort that could be put into enhancing the existing extension.
Maybe there's something I'm missing? Is the C implementation not as solid as it looks at first glance?
Rust makes concurrency really easy, at least in comparison to C or C++. It has great cross-platform frameworks available, like Tokio which pgvecto.rs uses, and makes using them safe and straightforward.
We've gone with Postgres and pg_vector which works really well thus far, especially since we know and like the existing Postgres tooling and vector functionality there is just a (widely supported) pg_vector extension away. Since it now also supports HNSW indices it should perform well on large amounts of vectorized data as well (though we haven't tested this yet).
From a consultant perspective, Postgres as a requirement is a much easier sell then some new, hot but unknown dedicated vector DB.
I have been following the vector database trend back in 2020 and I ended up with the conclusion: vector search features are a nice to have features which adds more value on existing database (postgres) or text search services (elasticsearch) than using an entirely new framework full of hidden bugs. You could get way higher speedup when you are using the right embedding models and encoding way than just using the vector database with the best underlying optimization. And the bonus side is that you are using a stack which was battle tested (postgres, elasticsearch) vs new kids (pinecone, milvus ... )
The Cassandra project recently[1] added vector search. Relative to a lot of other features, it was fairly simple. A new 'vector' type and an extension of the existing indexing system using the Lucene HNSW library. Now we'll be finding ways to optimize and improve performance with better algorithms and query schemes.
What we won't be doing is figuring out how to scale to petabytes of data distributed across multiple data centers in a massive active-active cluster. We've spent the last 14 years perfecting that, and still have work to do. With the benefit of hindsight, if you have a database that is less than 10 years old, all I have to say is good luck. You have some challenging days ahead.
If you can do it in GPU memory, do it in GPU memory.
If it takes quantization + buying a 1 TB RAM server ($4k of RAM + parts), do that in memory with the raw tensors and shed a small tear -- both for cost and the joy of the pain that you are saving yourself, your team, and everyone around you.
If you need more, then tread lightly and extremely carefully. Very few mega LLM pretraining datasets are even on this order of magnitude, though some are a few terabytes IIRC. If you are exceeding this, then your business usecase is likely specialized indeed.
This message brought to you by the "cost reduction by not adding dumb complexity" group. I try to maintain a reputation for aggressively fighting unnecessary complexity, which is the true cost measure IMO of any system.
Is there a better article explaining how vector DBs are used with LLMs exactly? This 4 point description raises more questions than it answers for me.
For example what is the difference between a "chunk vector" and a "prompt vector"? Aren't they essentially the same thing (a vector representation of text)?
How do we "search the prompt vector for the most similar chunk vector"? A short code snippet is shown that queries the DB, but it's not shown what is done with what comes back. What format is the output in?
I suspect this works by essentially replacing chunks of input text by shorter chunks of "roughly equivalent" text found in the vector DB and sending that as the prompt instead, but based on this description I can't be sure.
The article is referring to the problem of having a limited context length in LLMs. That is you can only pass X tokens in the prompt.
For example, let’s say you have a prompt that lets you answer questions about a book. If the book is long enough, you won’t be able to include it as is in the prompt, so you have to figure out what are the most relevant passages you must include to answer a given question. What you usually do is find the passages that are the most semantically similar to your question.
Chunk vectors are the vectorized passages of the book (i.e., a numerical vector that represents a passage), and the prompt vector is usually the vectorized question.
To find the most similar vectors you need a distance measure, cosine similarity being the most popular.
The output of finding the most similar vectors is the vectors + it’s metadata (chunk, page, chapter, etc)
Thank you for explaining it. I was aware of the problem of too short context length. I'd love to be able to pass an entire programming project with my prompt for example (or a book in your example).
I think I understand how it works now for many kinds of prompts where the information to be extracted is contained in one(or more) of the chunks of much bigger whole.
I'm not sure about prompts where it is required to "understand" entire input to answer properly. For example summarising a book. Although even with this vector search could perhaps help by looking for things not near "please provide a summary", but certain hand crafted values such as "important to the plot" etc.
I guess I need to do some experimenting with It. I found some open source alternatives to (not at all)OpenAI in form of "Sentence Transformers" to create embeddings.
However, what would be really neat is to have a large open source dataset of embeddings already created on some general purpose collection of texts, to try searches etc.
> Search the prompt vector to find the most similar chunk vector.
Does this mean that a question is changed to a similar question?
Doesnt it decrease the quality of answers significantly?
If someone asks "who is the best student in California?" will the question be changed to "what is the best school in California?".
This would explain the terrible drops in quality of amswers that we saw. The underlying technology is changed for easier scaling, but is much worse.
It's like the current google (what has multiple problems), where it also sometimes doesnt search your keywords - even when they are in quotation marks - because it knows "better"..
Without exception, LLM frontends out there that use vector search simply do a query and hope for the best. The reality is that the answer may not be in the result. It is a hard problem to solve. We are at least 1-2 stepping stones away from having a solution that actually works!
You can do a lot with hybrid approaches, where you mix and match semantic search results with classic text based search. In addition, depending on what kind of knowledge you pull in, the LLM pretraining can often fill the gaps pretty well even if the retrieval augmentation isn't ideal. But yes, it's still pretty fuzzy, though it works more often than not.
>Does this mean that a question is changed to a similar question?
No. It is retrieving the most semantically similar document(s) to work with. For a question about students information about students will be more semantically similar than information about schools.
>It's like the current google
Google has used vector search for years. It's why you can just type questions in and Google will understand what you are talking about.
>doesnt search your keywords - even when they are in quotation marks
Quotation marks work. If there are no results it will show results without the quotes. The mobile site doesn't make this clear when it happens though.
It is a bit as if Postgres has become what Oracle wanted to be - a one-stop solution for all your data storage needs.
The need for specialized time series databases is explained by the basis of any software development process: reuse ready-made high-level abstractions instead of having to reinvent the wheel, poorly.
There are a few key features of time series databases that you either go without (i.e., data compression) or have to reinvent the wheel (querying, resolution-based retention periods, statistical summary of time periods,etc).
You might live without those features,but in some applications there's a hefty price tag tied to not having them. If you adopt a time series database, you benefit from all those features without having to implement a single one of them.
Perhaps that has no value to you, but it does have for the world.
Application is storage of a diverse set of (irregular) events. Lifecycle management and archival is more important than compression. Volume is moderate.
at least until wasm gets more and more traction
It was Google's BigTable paper (2006) and Amazon's Dynamo paper (2007) that led to the "NoSQL revolution" in the late 2000s. The goal was to make it easier to scale a lot of de-normalized data horizontally with the tradeoffs being lack of RDBMS-like querying and schema enforcement.
In hindsight it was bizarre to market a whole category of horizontally scalable data systems based on one of their main negative trade-offs: Lack of SQL querying.
Since then many of these data systems have added SQL-like querying.
When I've used Postgres JSON fields in the past, I've used it for unstructured data that doesn't need to be indexed, as if there's anything that does need to be indexed, then it's important enough to extract out to a field for efficiency.
I do some “analytics” in a postgres database, with data and metrics stored in the same db, but I guess it’s not a huge amount - amount of data swamps the metrics by orders of magnitude. Seems to work ok for me, queries are 5ms or less, and it is only one thing to learn/deploy/maintain.
It’s on a pretty powerful server, though.
Which was sort of fine in a world where nobody cared about data security. But it’s almost impossible to build a “monolith” database that lives up to the modern legislation requirements, at least within the EU. So we’re basically living in a world where the way we’ve done databases for the past 50 years, doesn’t work for the business. Unless you separate everything into business related domains in small kingdoms with total sovereignty over who comes and goes. Which is basically what micro services are.
At the same time we’re also living in a world where it’s often cheaper (and performant enough) to put 95% of what you do in a container using volumes for persistence. If SQLite was ready for it, or maybe, if all the ORMs were ready for SQLite, you could probably do 90% or your database needs in it. Since that’s not the case, Postgres reigns supreme because it’s the next best thing (yes, you’re allowed to think it’s better). I view Vector DBs sort of similarity to this. We operate almost all of our databases in containers, the one exception is for the one services everyone calls where the performance loss of the container is too much. Well, that’s not entirely true, our dataware house isn’t operated in containers, at least not by us, but I rarely interact with the BI side of things except for data architecture so I’m actually not sure how their external partners do Ops. Anyway, I think Postgres will do 90% or (y)our vector needs and then we will need dedicated vector databases for those last bits that won’t fit into the “everything” box. I don’t think this is bad either, maybe Oracle wanted it to be theirs, but then Oracle should not have Oracles governance. I’m sure they didn’t foresee this future when they bought MySQL though, I know I didn’t. Because who could’ve foreseen that that way we used to do Databases would become sort of obsolete because of containers and EU legislation… and other things?
Not sure I agree. It seems far simpler to comply with most regulations if you store your data in a centralised data store with good management features.
If each and every microservice handles its own data, how would you implement rules affecting all of them, such as retention periods, deletion requests, permissions, access logs, etc? It's not just that the data is stored all over the place. You may also have to use microservice specific APIs to access it.
If you're referring to laws that require data storage in a particular jurisdiction, it would still be easier to centralise data management in each jursdiction rather than distribute it across scores of services within each of those jurisdictions.
I agree, because we never left it. People who say otherwise have an extremely distorted view. Most companies are not Google or Meta.
This is an extremely narrow view. There are a zillion uses of databases that are not storing user data from web apps or whatever.
C certainly has its faults, and while I have no real experience with Rust, I'm willing to believe that it's significantly better as a language.
But pgvector, at least from a quick scan, looks like a well-written, easily comprehensible C codebase with decent tests. There are undoubtedly lots of hard problems that the developers have solved in implementing it. Putting time and effort into reimplementing all of that in another language because of an aversion to C feels like a waste of effort that could be put into enhancing the existing extension.
Maybe there's something I'm missing? Is the C implementation not as solid as it looks at first glance?
Do we know that tokio's concurrency strategy is optimal for database access?
From a consultant perspective, Postgres as a requirement is a much easier sell then some new, hot but unknown dedicated vector DB.
here are a few benchmarks from this week's commit: https://jkatz05.com/post/postgres/pgvector-hnsw-performance/
What we won't be doing is figuring out how to scale to petabytes of data distributed across multiple data centers in a massive active-active cluster. We've spent the last 14 years perfecting that, and still have work to do. With the benefit of hindsight, if you have a database that is less than 10 years old, all I have to say is good luck. You have some challenging days ahead.
1. https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30...
If it takes quantization + buying a 1 TB RAM server ($4k of RAM + parts), do that in memory with the raw tensors and shed a small tear -- both for cost and the joy of the pain that you are saving yourself, your team, and everyone around you.
If you need more, then tread lightly and extremely carefully. Very few mega LLM pretraining datasets are even on this order of magnitude, though some are a few terabytes IIRC. If you are exceeding this, then your business usecase is likely specialized indeed.
This message brought to you by the "cost reduction by not adding dumb complexity" group. I try to maintain a reputation for aggressively fighting unnecessary complexity, which is the true cost measure IMO of any system.
For example what is the difference between a "chunk vector" and a "prompt vector"? Aren't they essentially the same thing (a vector representation of text)?
How do we "search the prompt vector for the most similar chunk vector"? A short code snippet is shown that queries the DB, but it's not shown what is done with what comes back. What format is the output in?
I suspect this works by essentially replacing chunks of input text by shorter chunks of "roughly equivalent" text found in the vector DB and sending that as the prompt instead, but based on this description I can't be sure.
For example, let’s say you have a prompt that lets you answer questions about a book. If the book is long enough, you won’t be able to include it as is in the prompt, so you have to figure out what are the most relevant passages you must include to answer a given question. What you usually do is find the passages that are the most semantically similar to your question.
Chunk vectors are the vectorized passages of the book (i.e., a numerical vector that represents a passage), and the prompt vector is usually the vectorized question.
To find the most similar vectors you need a distance measure, cosine similarity being the most popular.
The output of finding the most similar vectors is the vectors + it’s metadata (chunk, page, chapter, etc)
I think I understand how it works now for many kinds of prompts where the information to be extracted is contained in one(or more) of the chunks of much bigger whole.
I'm not sure about prompts where it is required to "understand" entire input to answer properly. For example summarising a book. Although even with this vector search could perhaps help by looking for things not near "please provide a summary", but certain hand crafted values such as "important to the plot" etc.
I guess I need to do some experimenting with It. I found some open source alternatives to (not at all)OpenAI in form of "Sentence Transformers" to create embeddings.
However, what would be really neat is to have a large open source dataset of embeddings already created on some general purpose collection of texts, to try searches etc.
Does this mean that a question is changed to a similar question? Doesnt it decrease the quality of answers significantly?
If someone asks "who is the best student in California?" will the question be changed to "what is the best school in California?".
This would explain the terrible drops in quality of amswers that we saw. The underlying technology is changed for easier scaling, but is much worse.
It's like the current google (what has multiple problems), where it also sometimes doesnt search your keywords - even when they are in quotation marks - because it knows "better"..
No. It is retrieving the most semantically similar document(s) to work with. For a question about students information about students will be more semantically similar than information about schools.
>It's like the current google
Google has used vector search for years. It's why you can just type questions in and Google will understand what you are talking about.
>doesnt search your keywords - even when they are in quotation marks
Quotation marks work. If there are no results it will show results without the quotes. The mobile site doesn't make this clear when it happens though.