Deleted Comment
Alternative solutions (lucene/ tantivy) are both designed for 'immutable segments' (indexing immutable files), so marrying them with postgres heap table would results in a worse solution.
First, both strategies - the one outlined by the Neon/ParadeDB article, and the one used here -- are presented as viable alternatives by the Postgres docs: https://www.postgresql.org/docs/current/textsearch-tables.ht....
Second - as the article correctly demonstrates, the problem with Postgres FTS isn't "how can I pick and optimize a single pre-defined query" it's "how do I bring Postgres to Elastic-level performance across a wide range of real-world boolean, fuzzy, faceted, relevance-ranked, etc. queries?"
`pg_search` is designed to solve the latter problem, and the benchmarks were made to reflect that. You can always cherry-pick a query and optimize it at the expense of data duplication and complexity. The Neon/ParadeDB benchmarks contained 12 queries in total, and the benchmarks could have:
- Created composite b-tree indexes for each of the queries with boolean predicates
- Extracted the all the text fields from JSONBs, stored and indexed them as a separate columns for queries against JSONB
But that's not realistic for many real-world use cases. `pg_search` doesn't require that - it's a simple index definition that works for a variety of "Elastic style" queries and Postgres types and doesn't ask the user to duplicate every text column.
One clarification question - the blog post lists "lack of ACID transactions and MVCC can lead to data inconsistencies and loss, while its lack of relational properties and real-time consistency makes many database queries challenging" as the bad for ElasticSearch. What is pg_bm25's consistency model? It had been mentioned previously as offering "weak consistency" [0], which I interpret to have the same problems with transactions, MVCC, etc?
Take multi-tenancy.
What if user 1 has many more documents than user 2, and uses "new orleans" a lot. But user 2 does not. User 2 does the search.
The db will first use FTS, and then filter. So user 1 will bias the results of user 2. Perhaps enough for user 2 to discover what words are in user 1 corpus.
I assume that's DataFusion speed. What's the plan to improve upon it?
I should clarify that our published Clickbench results are from our pg_analytics extension. New results with pg_lakehouse will be released. They're going to beat the old benchmarks because 1. No overhead from Postgres transactions/MVCC, since pg_analytics used the table access method whereas pg_lakehouse is just a foreign data wrapper 2. Uses the latest release of DataFusion.
The performance differences that exist between DataFusion and other OLAP engine are rapidly becoming commoditized. DataFusion is already a world-class query engine and will only improve. pg_lakehouse absorbs all those improvements into Postgres.
In all seriousness though, I see your point. While it's true that we don't provide the storage or table format, our belief is that companies actually want to own the data in their S3. We called it pg_lakehouse because it's the missing glue for companies already using Postgres + S3 + Delta Lake/Iceberg to have a lakehouse without new infrastructure.
2 questions:
- do you distribute query processing over multiple pg nodes ?
- do you store the metadata in PG, instead of a traditional metastore?
1. It's single node, but DataFusion parallelizes query execution across multiple cores. We do have plans for a distributed architecture, but we've found that you can get ~very~ far just by scaling up a single Postgres node.
2. The only information stored in Postgres are the options passed into the foreign data wrapper and the schema of the foreign table (this is standard for all Postgres foreign data wrappers).