Show HN: Pg_analytica – Speed up queries by exporting tables to columnar format

I think the clickbait title might harm the reception here somewhat. This kind of solution is potentially interesting, but it's a solution with many tradeoffs. Keeping a copy of the data synced has a cost in space and in general resource usage. And of course the analytical queries will work on slightly delayed data, which might or might not be an issue for your use case.

The part I'd wonder about is how expensive this is after the first sync, especially for a table with many writes. And related to that how long the delay can get between your live data and the column store version.

wanderinglight · 2 years ago

Hey Fabian, there are indeed many trade-offs in making this possible.

The extension periodically schedules export runs for a table exporting all the data each time (I know this is expensive but I haven't found an alternative like a change pump for getting writes after time T in Postgres yet). The export frequency can be configured by the user.

Regarding how much the delay will be, we that depends on the export frequency and the machine running the Postgres instance. A fast machine could complete an entire table export every hour or so resulting data that is decently fresh. Slower / more resource constrained machines would have to schedule exports at a slower frequency.

How does this compare to pg_analytics?

https://github.com/paradedb/pg_analytics

mildbyte · 2 years ago

Another difference is that this solution uses parquet_fdw, which handles fast scans through Parquet files and filter pushdown via row group pruning, but doesn't vectorize the groupby / join operations above the table scan in the query tree (so you're still using the row-by-row PG query executor in the end).

pg_analytics uses DataFusion (dedicated analytical query engine) to run the entire query, which can achieve orders of magnitude speedups over vanilla PG with indexes on analytical benchmarks like TPC-H. We use the same approach at EDB for our Postgres Lakehouse (I'm part of the team that works on it).

wanderinglight · 2 years ago

This is definitely something I intend to fix.

My initial intent was to use duckdb for fast vectored query execution but I wasn't able to create a planner / execution hook that uses duckdb internally. Will definitely checkout pg_analytics / Datafusion to see if the same can be integrated here as well. Thanks for the pointers.

wanderinglight · 2 years ago

I looked into pg_analytics and some other solution like Citus before working on pg_analytica.

The key difference is solutions like pg_analytics completely swap out the native postgres row based storage for columnar storage.

Not using the default postgres storage engine means you miss outon a lot of battle tested, functionality like updating existing rows, deleting rows, transactional updates etc. Columnar stores are not suited for transactions, updates and deletes.

pg_analytica retains the existing Postgres storage and only exports a time delayed version of the table in columnar format. This way developers get the benefit of a transactional storage and fast analytics queries.

Introducing the pg_analytica extension for Postgres.

Tables can be exported to columnar format via this extension to massively speed up your analytics queries. Exported data is periodically refreshed to keep your query results fresh.

koollman · 2 years ago

would need some benchmark against indexed data

fabian2k · 2 years ago

Regular indexes are not that useful for this kind of query as you need to read the entire table in most cases.

vallode · 2 years ago

The slides from the Spark AI 2020 summit [1] helped me understand this a bit. If I get it correctly, the premise is that a specific format is used to organise information into efficient blocks where related column data "lives" closer together, enabling faster read speeds but worse write speeds.

If someone has more resources on the topic, I'd be very interested. There are many applications where sacrificing data freshness for a considerable uptick in performance is alluring.

[1]: https://www.slideshare.net/slideshow/the-apache-spark-file-f...

marcosdumay · 2 years ago

> enabling faster read speeds but worse write speeds

Write speeds will probably decrease, because the main organization is extremely optimized.

But the goal is to speed-up queries of the type of "select sum(price) - sum(tax) from orders" at the cost of queries of the type of "select * from orders where id = 1".

This lecture series on columnar storage formats and querying them is great. https://youtu.be/1hdynBJo3ew?si=5KfT_2qpUFQmy_uL

Thank you!

xiasongh · 2 years ago

fareesh · 2 years ago

would be interesting to see if some open source project (which usees postgres) and is available on github can just get forked with this enabled and then have a before/after comparison

some seed data would be required though

victorbjorklund · 2 years ago

Maybe an earlier version of Plausible analytics? They used to use postgres but now they use clickhouse