retakeming (u/retakeming)

retakeming commented on We made Postgres writes faster, but it broke replication paradedb.com/blog/lsm_tre... · Posted by u/philippemnoel

strbean · a month ago

Total tangent, but I think "Canola is a neutral oil" is a lie. It's got the most distinctive (and in my opinion, bad) flavor of the common cooking oils.

retakeming · a month ago

What would you say is the most neutral oil then?

retakeming commented on PostgreSQL Full-Text Search: Fast When Done Right (Debunking the Slow Myth) blog.vectorchord.ai/postg... · Posted by u/VoVAllen

zhousun · 5 months ago

Glab to see more 'postgres-native' full-text search implementation.

Alternative solutions (lucene/ tantivy) are both designed for 'immutable segments' (indexing immutable files), so marrying them with postgres heap table would results in a worse solution.

retakeming · 5 months ago

The segments themselves being immutable doesn't mean that Tantivy is incompatible with Postgres - it just means that Tantivy needs to be made compatible with Postgres' concurrency control mechanisms (MVCC) and storage format (block storage). This blog post explains the latter: https://www.paradedb.com/blog/block_storage_part_one

retakeming commented on PostgreSQL Full-Text Search: Fast When Done Right (Debunking the Slow Myth) blog.vectorchord.ai/postg... · Posted by u/VoVAllen

retakeming · 5 months ago

I'm one of the pg_search maintainers. Hello! A few thoughts.

First, both strategies - the one outlined by the Neon/ParadeDB article, and the one used here -- are presented as viable alternatives by the Postgres docs: https://www.postgresql.org/docs/current/textsearch-tables.ht....

Second - as the article correctly demonstrates, the problem with Postgres FTS isn't "how can I pick and optimize a single pre-defined query" it's "how do I bring Postgres to Elastic-level performance across a wide range of real-world boolean, fuzzy, faceted, relevance-ranked, etc. queries?"

`pg_search` is designed to solve the latter problem, and the benchmarks were made to reflect that. You can always cherry-pick a query and optimize it at the expense of data duplication and complexity. The Neon/ParadeDB benchmarks contained 12 queries in total, and the benchmarks could have:

- Created composite b-tree indexes for each of the queries with boolean predicates

- Extracted the all the text fields from JSONBs, stored and indexed them as a separate columns for queries against JSONB

But that's not realistic for many real-world use cases. `pg_search` doesn't require that - it's a simple index definition that works for a variety of "Elastic style" queries and Postgres types and doesn't ask the user to duplicate every text column.

retakeming commented on Full text search over Postgres: Elasticsearch vs. alternatives blog.paradedb.com/pages/e... · Posted by u/philippemnoel

alanwli · a year ago

Always great to see Postgres-based alternatives.

One clarification question - the blog post lists "lack of ACID transactions and MVCC can lead to data inconsistencies and loss, while its lack of relational properties and real-time consistency makes many database queries challenging" as the bad for ElasticSearch. What is pg_bm25's consistency model? It had been mentioned previously as offering "weak consistency" [0], which I interpret to have the same problems with transactions, MVCC, etc?

[0]: https://news.ycombinator.com/item?id=37864089

retakeming · a year ago

Good question. That was from a very old version of pg_bm25 (since renamed to pg_search). BM25 indexes are now strongly consistent.

retakeming commented on Full text search over Postgres: Elasticsearch vs. alternatives blog.paradedb.com/pages/e... · Posted by u/philippemnoel

sroussey · a year ago

The problem with BM25 in a database, is that is can have unexpected outcomes for some common use cases.

Take multi-tenancy.

What if user 1 has many more documents than user 2, and uses "new orleans" a lot. But user 2 does not. User 2 does the search.

The db will first use FTS, and then filter. So user 1 will bias the results of user 2. Perhaps enough for user 2 to discover what words are in user 1 corpus.

retakeming · a year ago

If you're a Postgres pg_search user, partial indexes (https://docs.paradedb.com/search/full-text/index#partial-bm2...) can solve.

retakeming commented on Better RAG Results with Reciprocal Rank Fusion and Hybrid Search assembled.com/blog/better... · Posted by u/johnjwang

retakeming · a year ago

pg_search (full text search Postgres extension) can be used with pgvector for hybrid search over Postgres tables. It comes with a helpful hybrid search function that uses relative score fusion. Whereas rank fusion considers just the order of the results, relative score fusion uses the actual metrics outputted by text/vector search.

retakeming commented on Pg_lakehouse: Query Any Data Lake from Postgres github.com/paradedb/parad... · Posted by u/landingunless

nikita · a year ago

I have another question. So far on the clickbench leaderboard it's 15x slower than baseline. The number 1 place is 1.67 slower the baseline.

I assume that's DataFusion speed. What's the plan to improve upon it?

retakeming · a year ago

Could you clarify which result you're referring to as the baseline and "number 1 place?"

I should clarify that our published Clickbench results are from our pg_analytics extension. New results with pg_lakehouse will be released. They're going to beat the old benchmarks because 1. No overhead from Postgres transactions/MVCC, since pg_analytics used the table access method whereas pg_lakehouse is just a foreign data wrapper 2. Uses the latest release of DataFusion.

The performance differences that exist between DataFusion and other OLAP engine are rapidly becoming commoditized. DataFusion is already a world-class query engine and will only improve. pg_lakehouse absorbs all those improvements into Postgres.

retakeming commented on Pg_lakehouse: Query Any Data Lake from Postgres github.com/paradedb/parad... · Posted by u/landingunless

mcdonje · a year ago

Looks like pg as a replacement for databricks sql, which is already a query engine for datalakes. It's not a lakehouse, but it calls itself one. Seems like a cool and useful project, but the name is problematic.

retakeming · a year ago

pg_house just wasn't as catchy!

In all seriousness though, I see your point. While it's true that we don't provide the storage or table format, our belief is that companies actually want to own the data in their S3. We called it pg_lakehouse because it's the missing glue for companies already using Postgres + S3 + Delta Lake/Iceberg to have a lakehouse without new infrastructure.

retakeming commented on Pg_lakehouse: Query Any Data Lake from Postgres github.com/paradedb/parad... · Posted by u/landingunless

samber · a year ago

It seems very promising!

2 questions:

- do you distribute query processing over multiple pg nodes ?

- do you store the metadata in PG, instead of a traditional metastore?

retakeming · a year ago

Thanks!

1. It's single node, but DataFusion parallelizes query execution across multiple cores. We do have plans for a distributed architecture, but we've found that you can get ~very~ far just by scaling up a single Postgres node.

2. The only information stored in Postgres are the options passed into the foreign data wrapper and the schema of the foreign table (this is standard for all Postgres foreign data wrappers).