AdamProut (u/AdamProut)

AdamProut commented on Show HN: Pg_mooncake – Delta/Iceberg columnstore tables in Postgres github.com/Mooncake-Labs/... · Posted by u/sunzhousz

AdamProut · 10 months ago

Very cool! pg_duckdb itself is missing fully integrated storage - it can query data stored externally (say in S3) in delta/iceberg formats, but it can't write out data in those formats via transactional writes to PG tables (insert\update\deletes). pg_mooncacke is one neat way of solving that problem. It lets you have a columnstore table in Postgres that can do both reads and writes as if it's any other PG table and have the storage format be an open format like delta/iceberg/etc with that data persisted to blob store (like most cloud DWs would do anyways).

AdamProut commented on Pg_parquet: An extension to connect Postgres and parquet crunchydata.com/blog/pg_p... · Posted by u/craigkerstiens

linuxhansl · a year ago

Parquet itself is actually not that interesting. It should be able to read (and even write) Iceberg tables.

Also, how does it compare to pg_duckdb (which adds DuckDB execution to Postgres including reading parquet and Iceberg), or duck_fdw (which wraps a DuckDB database, which can be in memory and only pass-through Iceberg/Parquet tables)?

AdamProut · a year ago

Had a similar thought. Azure Postgres has something similar to pg_parquet (pg_azure_storage), but we're looking into replacing it with pg_duckdb assuming the extension continues to mature.

It would be great if the Postgres community could get behind one good opensource extension for the various columnstore data use cases (querying data stored in an open columnstore format - delta, iceberg, etc. being one of them). pg_duckdb seems to have the best chance at being the goto extension for this.

AdamProut commented on A write-ahead log is not a universal part of durability notes.eatonphil.com/2024-... · Posted by u/todsacerdoti

eatonphil · a year ago

Thanks Adam! I think torn writes would still be caught via checksums, no? Although that may be later than you'd wish.

I'm not confident but from reading that page it seems that for Postgres at least, if it did do checksums it might not need to count on page-level atomic writes?

AdamProut · a year ago

Checksums can detect a torn page, but not always repair them. It's likely a good part of the database page is gone (i.e., an amount of data that matches the disk / file system atomic write unit size is probably missing). Torn page writes are a pretty common scenario too, so databases need to be able to fully recover from them - not just detect them and report a corruption (ie., just pull the power plug from the machine during a heavy write workload and you're likely to get one - it doesn't require a solar ray to flip a bit :) ).

AdamProut commented on A write-ahead log is not a universal part of durability notes.eatonphil.com/2024-... · Posted by u/todsacerdoti

AdamProut · a year ago

Maybe good to mention torn pages somewhere too? Both MySQL and Postgres jump through some hoops to both detect them and repair them [1][2]. So, even the scenario in the post where fsync is used to harden writes, the database still needs to handle torn pages (or requires using a file system \ storage that guarantees atomic page writes at the page size the database is using as several managed\cloud databases do).

[1] https://wiki.postgresql.org/wiki/Full_page_writes [2] https://dev.mysql.com/doc/refman/8.0/en/innodb-doublewrite-b...

AdamProut commented on Headline Driven Development spakhm.com/headline-devel... · Posted by u/MrBuddyCasino

buro9 · a year ago

Well this is, at least, better than Gartner Driven Development which has been observed at several startups. Whereby a startup begins to care about their placement on the Magic Quadrant, and start adding very poor imitations of features merely to get a check on a form and move their company more towards the top-right "Leaders" category.

AdamProut · a year ago

I'm not sure why you're getting downvoted. I think the pressure to appease Gartner usually starts when companies bring in CEOs whose primary background is enterprise sales. They tend to over value magic quadrant positioning (in my view).

AdamProut commented on Does MongoDB Support ACID Transactions? medium.com/@adamprout/whi... · Posted by u/AdamProut

AdamProut · 2 years ago

Based on the writings of 2 Turing awards winners for work on databases; No MongoDB doesn't have ACID transactions.

AdamProut commented on PID Controllers in Unity3D (2021) vazgriz.com/621/pid-contr... · Posted by u/signa11

AdamProut · 2 years ago

Control theory is also used by databases (probably not as often as it should be). It's great for "self tuning" [1], for example tuning the various cache sizes a database has to maximize throughput under changing workload conditions. Its definitely worth spending the time to understand PID controllers if your an engineer working on databases.

[1] https://www.vldb.org/conf/2006/p1081-storm.pdf

AdamProut commented on An overview of distributed Postgres architectures crunchydata.com/blog/an-o... · Posted by u/eatonphil

franckpachot · 2 years ago

The "disappointing" benchmark mentioned in the article is a shame for GigaOm who published it and for Microsoft who paid for it. They compare Citus with no HA to CockroachDB and YugabyteDB with replication factor 3 Multi-AZ, resilient to data center failure. And they run Citus on 16 cores (=32 vCPU) and the others on 16 vCPU. But your point about "beefy machine" shows the real advantages of Distributed SQL. PostgreSQL and Citus needs downtime to save cost if you don't need that beefy machine all days all year. Scale up and down is downtime, as well as upgrades. Distributed SQL offers elasticity (no downtime to resize the cluster) and high availability (no downtime on failure or maintenance)

AdamProut · 2 years ago

RE: "Distributed SQL offers elasticity (no downtime resize"). I'm not sure this is as much of an advantage of distributed databases vs single host databases anymore. Some of the tech to move virtual machines between machines quickly (without dropping TCP connections) is pretty neat. Neon has a blog post about it here[1]. Aurora Serverless V2 does the same thing (but I can't find a detailed technical blog post talking about how it works). Your still limited by "one big host" but its no longer as big of a deal to scale your compute up/down within that limit.

[1] https://neon.tech/blog/scaling-serverless-postgres

AdamProut commented on Blaze: Fast query execution engine for Apache Spark github.com/blaze-init/bla... · Posted by u/sbt567

whinvik · 2 years ago

Any comparisons with Databricks Spark. When we started experimenting with Spark, we initially used AWS EMR. But then the same code was way faster on Databricks than it was on EMR, which resulted in us ditching EMR.

AdamProut · 2 years ago

Databricks has kept their Photon[1][2] query engine for Spark closed sourced thus far. Unless EMR has made equivalent changes to the Spark runtime they use Databricks should be much faster. Photon brings the standard vectorized execution techniques used in SQL data warehouses for many years to Spark.

[1] https://docs.databricks.com/en/clusters/photon.html [2] https://dl.acm.org/doi/10.1145/3514221.3526054