Introducing the pg_analytica extension for Postgres.
Tables can be exported to columnar format via this extension to massively speed up your analytics queries. Exported data is periodically refreshed to keep your query results fresh.
The slides from the Spark AI 2020 summit [1] helped me understand this a bit. If I get it correctly, the premise is that a specific format is used to organise information into efficient blocks where related column data "lives" closer together, enabling faster read speeds but worse write speeds.
If someone has more resources on the topic, I'd be very interested. There are many applications where sacrificing data freshness for a considerable uptick in performance is alluring.
> enabling faster read speeds but worse write speeds
Write speeds will probably decrease, because the main organization is extremely optimized.
But the goal is to speed-up queries of the type of "select sum(price) - sum(tax) from orders" at the cost of queries of the type of "select * from orders where id = 1".
I think the clickbait title might harm the reception here somewhat. This kind of solution is potentially interesting, but it's a solution with many tradeoffs. Keeping a copy of the data synced has a cost in space and in general resource usage. And of course the analytical queries will work on slightly delayed data, which might or might not be an issue for your use case.
The part I'd wonder about is how expensive this is after the first sync, especially for a table with many writes. And related to that how long the delay can get between your live data and the column store version.
Hey Fabian, there are indeed many trade-offs in making this possible.
The extension periodically schedules export runs for a table exporting all the data each time (I know this is expensive but I haven't found an alternative like a change pump for getting writes after time T in Postgres yet).
The export frequency can be configured by the user.
Regarding how much the delay will be, we that depends on the export frequency and the machine running the Postgres instance.
A fast machine could complete an entire table export every hour or so resulting data that is decently fresh.
Slower / more resource constrained machines would have to schedule exports at a slower frequency.
Another difference is that this solution uses parquet_fdw, which handles fast scans through Parquet files and filter pushdown via row group pruning, but doesn't vectorize the groupby / join operations above the table scan in the query tree (so you're still using the row-by-row PG query executor in the end).
pg_analytics uses DataFusion (dedicated analytical query engine) to run the entire query, which can achieve orders of magnitude speedups over vanilla PG with indexes on analytical benchmarks like TPC-H. We use the same approach at EDB for our Postgres Lakehouse (I'm part of the team that works on it).
My initial intent was to use duckdb for fast vectored query execution but I wasn't able to create a planner / execution hook that uses duckdb internally.
Will definitely checkout pg_analytics / Datafusion to see if the same can be integrated here as well.
Thanks for the pointers.
I looked into pg_analytics and some other solution like Citus before working on pg_analytica.
The key difference is solutions like pg_analytics completely swap out the native postgres row based storage for columnar storage.
Not using the default postgres storage engine means you miss outon a lot of battle tested, functionality like updating existing rows, deleting rows, transactional updates etc.
Columnar stores are not suited for transactions, updates and deletes.
pg_analytica retains the existing Postgres storage and only exports a time delayed version of the table in columnar format.
This way developers get the benefit of a transactional storage and fast analytics queries.
would be interesting to see if some open source project (which usees postgres) and is available on github can just get forked with this enabled and then have a before/after comparison
Tables can be exported to columnar format via this extension to massively speed up your analytics queries. Exported data is periodically refreshed to keep your query results fresh.
If someone has more resources on the topic, I'd be very interested. There are many applications where sacrificing data freshness for a considerable uptick in performance is alluring.
[1]: https://www.slideshare.net/slideshow/the-apache-spark-file-f...
Write speeds will probably decrease, because the main organization is extremely optimized.
But the goal is to speed-up queries of the type of "select sum(price) - sum(tax) from orders" at the cost of queries of the type of "select * from orders where id = 1".
The part I'd wonder about is how expensive this is after the first sync, especially for a table with many writes. And related to that how long the delay can get between your live data and the column store version.
The extension periodically schedules export runs for a table exporting all the data each time (I know this is expensive but I haven't found an alternative like a change pump for getting writes after time T in Postgres yet). The export frequency can be configured by the user.
Regarding how much the delay will be, we that depends on the export frequency and the machine running the Postgres instance. A fast machine could complete an entire table export every hour or so resulting data that is decently fresh. Slower / more resource constrained machines would have to schedule exports at a slower frequency.
https://github.com/paradedb/pg_analytics
pg_analytics uses DataFusion (dedicated analytical query engine) to run the entire query, which can achieve orders of magnitude speedups over vanilla PG with indexes on analytical benchmarks like TPC-H. We use the same approach at EDB for our Postgres Lakehouse (I'm part of the team that works on it).
My initial intent was to use duckdb for fast vectored query execution but I wasn't able to create a planner / execution hook that uses duckdb internally. Will definitely checkout pg_analytics / Datafusion to see if the same can be integrated here as well. Thanks for the pointers.
The key difference is solutions like pg_analytics completely swap out the native postgres row based storage for columnar storage.
Not using the default postgres storage engine means you miss outon a lot of battle tested, functionality like updating existing rows, deleting rows, transactional updates etc. Columnar stores are not suited for transactions, updates and deletes.
pg_analytica retains the existing Postgres storage and only exports a time delayed version of the table in columnar format. This way developers get the benefit of a transactional storage and fast analytics queries.
some seed data would be required though