Postgres's dominance is well deserved, of course. My only concerns with it, both are actively worked on, are bloat management (significant for update heavy workloads and programmers used to the MySQL model of rollback segments) and the scaling of concurrency (going over 500 connections). Bloat was taken over by Cybertec[1] after stalling for a bit and is funded (yay), while concurrency was also enhanced out of Microsoft [2]. All in all, an excellent future for our beloved Postgres.
In Postgres land, I think most businesses work around temporal tables with audit tables using triggers to dump jsonb or hstore. I wrote up how I used table-inheritance here [1].
I agree with your point. Postgres is starting to stick out compared to alternatives:
- MS SQL supports uni-temporal tables using system time.
- Snowflake has time travel which acts like temporal tables but with a limited retention window. Seems more like a restore mechanism.
- MariaDB has system-versioned tables (doesn't look like it's in MySQL).
- Cockroach DB has uni-temporal support with system time but limited to the garbage collection period. The docs indicate you don't want a long garbage collection period since all versions are stored in a single range.
- Oracle seems to have the best temporal support with their flashback tech. But it's hard to read between the lines to figure out what it actually does.
I've never seen a business actually use them, large or small. Any auditing requirements are usually fed from other sources, like Kafka event streams, files on S3, or a OLAP data warehouse.
Although temporal tables are a really good idea; it is possible to get away without them being a first class feature. They aren't hard to mimic if you can give up the guarantee of catching every detail. In an ideal world (ha ha, silly thought) the tables would be designed to be append-only anyway, or the amount of data would be significant. Both of which make temporal tables somewhat moot.
i’ve very rarely found that using a full temporal table is the right choice for online analysis—a dedicated schema serves you better in the long run and helps you design your indexes, etc appropriately. For compliance, PIT backups via WAL shipping should suffice, no?
I'm really excited by all the database love in the last few years. I moved to PG from MySQL in 2014 and don't regret it since.
Timescaledb looks very exciting, as it's "just" a PG extension, but their compression work looks great. [0]
I'm also really loving clickhouse, but haven't deployed that to production yet (haven't had the need to yet, almost did for an apache arrow reading thing, but didn't end up using arrow). They do some amazing things there, and the work they do is crazy impressive and fast. Reading their changelog they power through things.
So a company that sells PostgreSQL services thinks PostgreSQL is dominating. Brilliant.
The reality is that nothing is dominating. In 2021 there were more databases than ever each addressing a different use case. Companies don't have just one EDW they will have dozens even hundreds of siloed data stores. Startups will start with one for everything, then split out auth, user analytics, telemetry etc
There is no evidence of any consolidation in the market. And definitely not some mass trend towards PostgreSQL.
1. Ottertune doesn't sell PostgreSQL services, they sell a database optimization service that happens to support PostgreSQL (and other databases like MySQL)
2. PostgreSQL is definitely gaining market shares and fast, see the db-engine graph [1], you can compare it to the oracle trend if you are not convinced [2]
You can't just compare graphs like that without factoring in the cloud. PostgreSQL is a first-class, cloud managed, supported database in the top three cloud providers whereas Oracle is not. It's a massive impediment to adoption and is in no way a reflection of the database itself.
Either way nothing to suggest that PostgreSQL is any way dominating.
All you need is Postgres (OLTP) and if you have large datasets where Postgres falls behind for analytical work, then you reach for Clickhouse (OLAP) for those features (while Postgres remains your primary operational database and source of truth).
Agreed. I have a good bit of experience in SaaS and analytics and that's exactly what I landed on for building Luabase[0]. Postgres (specifically Supabase) for the app database, Clickhouse to run the analytics (which is the product).
It's weird to put postgres into the same bucket as elastic search as they are often used for different things.
No matter how much you tune / denormalize postgres, you'll never get the free text search performance elastic search offers. Our best efforts on a 5 million row table yielded 600ms query times vs 30-60ms.
Similarity with snow flake, you'd never expect postgres to perform analytical queries at that scale.
I know graph databases and Time series DB have similar performance tradeoffs.
I think the most interesting and challenging area is how to architect a system uses many of these databases and keeps them eventually consistent without some bound.
The author is talking about a different classes of rdbms. I believe his intention was not to compare PostgreSQL to ElasticSearch or ClickHouse which will solve a completely different problem.
But for small to medium datasets his advice to just stick to PostgreSQL is good: Start with an easy solution which will give you anything you need (by simply installing a plugin). If you need more specialized software THEN use it, but don't start with an overcomplicated stack because ElasticSearch and ClickHouse may be the state-of-the-art open source solution to a specific problem.
Have you tried GIN trigram(https://www.postgresql.org/docs/14/pgtrgm.html CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);) and GIN fulltext search indexes(CREATE INDEX textsearch_idx ON pgweb USING GIN (textsearchable_index_col);) ? As far as I know after applying those indexes on full text search columns you can search as fast as in Elastic because those indexes are built same way as in Elastic.
What are the distributed options for Postgres? What mechanisms are available to make it highly available i.e. with a distributed consensus protocol for strict serializability when failing over the primary? How do people typically deploy Postgres as a cluster?
1. Async replication tolerating data loss from slightly stale backup after a failover?
2. Sync replication tolerating downtime during manual failover?
3. Distributed consensus protocol for automated failover, high availability and no data loss, e.g. Viewstamped Replication, Paxos or Raft?
It seems like most managed service versions of databases such as Aurora, Timescale etc. are all doing option 3, but the open-source alternatives otherwise are still options 1 and 2?
I think you'd still need to change the core of the database to avoid stale reads when an old primary and client are partitioned away from the new primary, or force all client communication through a proxy smart enough to contact a quorum of replicas to ensure the current primary is still the primary during transaction begin and commit.
I was assuming in both cases of manual failover that the operator would have to have some way of physically shutting down the old primary, then starting it again only as a backup that doesn't reply to clients. Alternatively, the cluster would need to remain unavailable if any node is partitioned.
But none of this is really very practical when compared to a consensus protocol (or R/W quorums) and distributed database. I'm genuinely curious how people solve this with something like Postgres. Or is it perhaps something that isn't much worried about?
Databases are the best all around scratch every cs geek itch domain there is, with possible exception of operating systems.
The critical importance of extensibility as a primary concern of successful DB products needs to be highlighted. Realities of the domain dictate that product X matures a few years after inception, at which point the application patterns may have shifted. (Remember map-reduce?) If you pay attention, for example, you'll note that the du jour darlings are scrambling to claim fitness for ML (a subset of big-data), and the new comers are claiming to be "designed for ML".
[1] https://github.com/cybertec-postgresql/zheap [2] https://techcommunity.microsoft.com/t5/azure-database-for-po...
I agree with your point. Postgres is starting to stick out compared to alternatives:
- MS SQL supports uni-temporal tables using system time.
- Snowflake has time travel which acts like temporal tables but with a limited retention window. Seems more like a restore mechanism.
- MariaDB has system-versioned tables (doesn't look like it's in MySQL).
- Cockroach DB has uni-temporal support with system time but limited to the garbage collection period. The docs indicate you don't want a long garbage collection period since all versions are stored in a single range.
- Oracle seems to have the best temporal support with their flashback tech. But it's hard to read between the lines to figure out what it actually does.
[1]: https://news.ycombinator.com/item?id=29010446
https://www.cybertec-postgresql.com/en/zheap-undo-logs-disca...
https://github.com/cybertec-postgresql/postgres/tree/zheap_u...
Dead Comment
Not completely related, but his lectures on databases on YouTube are really good. Much better than the DB class I had at college.
A great way to learn more about the inner workings of databases, and entertaining too.
Another choice quote (from one of his lectures):
“There’s only two things I care about in life:
1. My wife
2. Databases
I don’t give a f#ck about anything else”
Timescaledb looks very exciting, as it's "just" a PG extension, but their compression work looks great. [0]
I'm also really loving clickhouse, but haven't deployed that to production yet (haven't had the need to yet, almost did for an apache arrow reading thing, but didn't end up using arrow). They do some amazing things there, and the work they do is crazy impressive and fast. Reading their changelog they power through things.
[0] https://docs.timescale.com/timescaledb/latest/how-to-guides/...
The reality is that nothing is dominating. In 2021 there were more databases than ever each addressing a different use case. Companies don't have just one EDW they will have dozens even hundreds of siloed data stores. Startups will start with one for everything, then split out auth, user analytics, telemetry etc
There is no evidence of any consolidation in the market. And definitely not some mass trend towards PostgreSQL.
1. Ottertune doesn't sell PostgreSQL services, they sell a database optimization service that happens to support PostgreSQL (and other databases like MySQL)
2. PostgreSQL is definitely gaining market shares and fast, see the db-engine graph [1], you can compare it to the oracle trend if you are not convinced [2]
[1] https://db-engines.com/en/ranking_trend/system/PostgreSQL
[2] https://db-engines.com/en/ranking_trend/system/Oracle
Answer: https://db-engines.com/en/ranking_definition
Deleted Comment
A ML program that automatically tunes your production database in real-time. What could possibly go wrong?
Either way nothing to suggest that PostgreSQL is any way dominating.
0 - https://luabase.com/
No matter how much you tune / denormalize postgres, you'll never get the free text search performance elastic search offers. Our best efforts on a 5 million row table yielded 600ms query times vs 30-60ms.
Similarity with snow flake, you'd never expect postgres to perform analytical queries at that scale.
I know graph databases and Time series DB have similar performance tradeoffs.
I think the most interesting and challenging area is how to architect a system uses many of these databases and keeps them eventually consistent without some bound.
ZomboDB is a Postgres extension that enables efficient full-text searching via the use of indexes backed by Elasticsearch. https://github.com/zombodb/zombodb#readme
But for small to medium datasets his advice to just stick to PostgreSQL is good: Start with an easy solution which will give you anything you need (by simply installing a plugin). If you need more specialized software THEN use it, but don't start with an overcomplicated stack because ElasticSearch and ClickHouse may be the state-of-the-art open source solution to a specific problem.
1. Async replication tolerating data loss from slightly stale backup after a failover?
2. Sync replication tolerating downtime during manual failover?
3. Distributed consensus protocol for automated failover, high availability and no data loss, e.g. Viewstamped Replication, Paxos or Raft?
It seems like most managed service versions of databases such as Aurora, Timescale etc. are all doing option 3, but the open-source alternatives otherwise are still options 1 and 2?
I was assuming in both cases of manual failover that the operator would have to have some way of physically shutting down the old primary, then starting it again only as a backup that doesn't reply to clients. Alternatively, the cluster would need to remain unavailable if any node is partitioned.
But none of this is really very practical when compared to a consensus protocol (or R/W quorums) and distributed database. I'm genuinely curious how people solve this with something like Postgres. Or is it perhaps something that isn't much worried about?
Well, unless each node has a complete copy of the data?
The critical importance of extensibility as a primary concern of successful DB products needs to be highlighted. Realities of the domain dictate that product X matures a few years after inception, at which point the application patterns may have shifted. (Remember map-reduce?) If you pay attention, for example, you'll note that the du jour darlings are scrambling to claim fitness for ML (a subset of big-data), and the new comers are claiming to be "designed for ML".
Smart VC money should be on extensible players ..