Readit News logoReadit News

Deleted Comment

retakeming commented on We made Postgres writes faster, but it broke replication   paradedb.com/blog/lsm_tre... · Posted by u/philippemnoel
strbean · a month ago
Total tangent, but I think "Canola is a neutral oil" is a lie. It's got the most distinctive (and in my opinion, bad) flavor of the common cooking oils.
retakeming · a month ago
What would you say is the most neutral oil then?
retakeming commented on PostgreSQL Full-Text Search: Fast When Done Right (Debunking the Slow Myth)   blog.vectorchord.ai/postg... · Posted by u/VoVAllen
zhousun · 5 months ago
Glab to see more 'postgres-native' full-text search implementation.

Alternative solutions (lucene/ tantivy) are both designed for 'immutable segments' (indexing immutable files), so marrying them with postgres heap table would results in a worse solution.

retakeming · 5 months ago
The segments themselves being immutable doesn't mean that Tantivy is incompatible with Postgres - it just means that Tantivy needs to be made compatible with Postgres' concurrency control mechanisms (MVCC) and storage format (block storage). This blog post explains the latter: https://www.paradedb.com/blog/block_storage_part_one
retakeming commented on PostgreSQL Full-Text Search: Fast When Done Right (Debunking the Slow Myth)   blog.vectorchord.ai/postg... · Posted by u/VoVAllen
retakeming · 5 months ago
I'm one of the pg_search maintainers. Hello! A few thoughts.

First, both strategies - the one outlined by the Neon/ParadeDB article, and the one used here -- are presented as viable alternatives by the Postgres docs: https://www.postgresql.org/docs/current/textsearch-tables.ht....

Second - as the article correctly demonstrates, the problem with Postgres FTS isn't "how can I pick and optimize a single pre-defined query" it's "how do I bring Postgres to Elastic-level performance across a wide range of real-world boolean, fuzzy, faceted, relevance-ranked, etc. queries?"

`pg_search` is designed to solve the latter problem, and the benchmarks were made to reflect that. You can always cherry-pick a query and optimize it at the expense of data duplication and complexity. The Neon/ParadeDB benchmarks contained 12 queries in total, and the benchmarks could have:

- Created composite b-tree indexes for each of the queries with boolean predicates

- Extracted the all the text fields from JSONBs, stored and indexed them as a separate columns for queries against JSONB

But that's not realistic for many real-world use cases. `pg_search` doesn't require that - it's a simple index definition that works for a variety of "Elastic style" queries and Postgres types and doesn't ask the user to duplicate every text column.

retakeming commented on Full text search over Postgres: Elasticsearch vs. alternatives   blog.paradedb.com/pages/e... · Posted by u/philippemnoel
alanwli · a year ago
Always great to see Postgres-based alternatives.

One clarification question - the blog post lists "lack of ACID transactions and MVCC can lead to data inconsistencies and loss, while its lack of relational properties and real-time consistency makes many database queries challenging" as the bad for ElasticSearch. What is pg_bm25's consistency model? It had been mentioned previously as offering "weak consistency" [0], which I interpret to have the same problems with transactions, MVCC, etc?

[0]: https://news.ycombinator.com/item?id=37864089

retakeming · a year ago
Good question. That was from a very old version of pg_bm25 (since renamed to pg_search). BM25 indexes are now strongly consistent.
retakeming commented on Full text search over Postgres: Elasticsearch vs. alternatives   blog.paradedb.com/pages/e... · Posted by u/philippemnoel
sroussey · a year ago
The problem with BM25 in a database, is that is can have unexpected outcomes for some common use cases.

Take multi-tenancy.

What if user 1 has many more documents than user 2, and uses "new orleans" a lot. But user 2 does not. User 2 does the search.

The db will first use FTS, and then filter. So user 1 will bias the results of user 2. Perhaps enough for user 2 to discover what words are in user 1 corpus.

retakeming · a year ago
If you're a Postgres pg_search user, partial indexes (https://docs.paradedb.com/search/full-text/index#partial-bm2...) can solve.
retakeming commented on Better RAG Results with Reciprocal Rank Fusion and Hybrid Search   assembled.com/blog/better... · Posted by u/johnjwang
retakeming · a year ago
pg_search (full text search Postgres extension) can be used with pgvector for hybrid search over Postgres tables. It comes with a helpful hybrid search function that uses relative score fusion. Whereas rank fusion considers just the order of the results, relative score fusion uses the actual metrics outputted by text/vector search.
retakeming commented on Pg_lakehouse: Query Any Data Lake from Postgres   github.com/paradedb/parad... · Posted by u/landingunless
nikita · a year ago
I have another question. So far on the clickbench leaderboard it's 15x slower than baseline. The number 1 place is 1.67 slower the baseline.

I assume that's DataFusion speed. What's the plan to improve upon it?

retakeming · a year ago
Could you clarify which result you're referring to as the baseline and "number 1 place?"

I should clarify that our published Clickbench results are from our pg_analytics extension. New results with pg_lakehouse will be released. They're going to beat the old benchmarks because 1. No overhead from Postgres transactions/MVCC, since pg_analytics used the table access method whereas pg_lakehouse is just a foreign data wrapper 2. Uses the latest release of DataFusion.

The performance differences that exist between DataFusion and other OLAP engine are rapidly becoming commoditized. DataFusion is already a world-class query engine and will only improve. pg_lakehouse absorbs all those improvements into Postgres.

retakeming commented on Pg_lakehouse: Query Any Data Lake from Postgres   github.com/paradedb/parad... · Posted by u/landingunless
mcdonje · a year ago
Looks like pg as a replacement for databricks sql, which is already a query engine for datalakes. It's not a lakehouse, but it calls itself one. Seems like a cool and useful project, but the name is problematic.
retakeming · a year ago
pg_house just wasn't as catchy!

In all seriousness though, I see your point. While it's true that we don't provide the storage or table format, our belief is that companies actually want to own the data in their S3. We called it pg_lakehouse because it's the missing glue for companies already using Postgres + S3 + Delta Lake/Iceberg to have a lakehouse without new infrastructure.

retakeming commented on Pg_lakehouse: Query Any Data Lake from Postgres   github.com/paradedb/parad... · Posted by u/landingunless
samber · a year ago
It seems very promising!

2 questions:

- do you distribute query processing over multiple pg nodes ?

- do you store the metadata in PG, instead of a traditional metastore?

retakeming · a year ago
Thanks!

1. It's single node, but DataFusion parallelizes query execution across multiple cores. We do have plans for a distributed architecture, but we've found that you can get ~very~ far just by scaling up a single Postgres node.

2. The only information stored in Postgres are the options passed into the foreign data wrapper and the schema of the foreign table (this is standard for all Postgres foreign data wrappers).

u/retakeming

KarmaCake day109March 16, 2023View Original