Databases in 2021: A Year in Review

Postgres's dominance is well deserved, of course. My only concerns with it, both are actively worked on, are bloat management (significant for update heavy workloads and programmers used to the MySQL model of rollback segments) and the scaling of concurrency (going over 500 connections). Bloat was taken over by Cybertec[1] after stalling for a bit and is funded (yay), while concurrency was also enhanced out of Microsoft [2]. All in all, an excellent future for our beloved Postgres.

[1] https://github.com/cybertec-postgresql/zheap [2] https://techcommunity.microsoft.com/t5/azure-database-for-po...

newlisp · 4 years ago

Another concern, no temporal tables, don't businesses demand this feature?

sa46 · 4 years ago

In Postgres land, I think most businesses work around temporal tables with audit tables using triggers to dump jsonb or hstore. I wrote up how I used table-inheritance here [1].

I agree with your point. Postgres is starting to stick out compared to alternatives:

- MS SQL supports uni-temporal tables using system time.

- Snowflake has time travel which acts like temporal tables but with a limited retention window. Seems more like a restore mechanism.

- MariaDB has system-versioned tables (doesn't look like it's in MySQL).

- Cockroach DB has uni-temporal support with system time but limited to the garbage collection period. The docs indicate you don't want a long garbage collection period since all versions are stored in a single range.

- Oracle seems to have the best temporal support with their flashback tech. But it's hard to read between the lines to figure out what it actually does.

[1]: https://news.ycombinator.com/item?id=29010446

manigandham · 4 years ago

I've never seen a business actually use them, large or small. Any auditing requirements are usually fed from other sources, like Kafka event streams, files on S3, or a OLAP data warehouse.

code_biologist · 4 years ago

Would love to see wider support for temporal tables, but application level approaches like https://github.com/jazzband/django-simple-history have worked for the business issues I have.

p_l · 4 years ago

Interestingly enough, Postgres used to have time travel in tables before MVCC transactions were added. Apparently it wasn't exactly used feature.

roenxi · 4 years ago

Although temporal tables are a really good idea; it is possible to get away without them being a first class feature. They aren't hard to mimic if you can give up the guarantee of catching every detail. In an ideal world (ha ha, silly thought) the tables would be designed to be append-only anyway, or the amount of data would be significant. Both of which make temporal tables somewhat moot.

ivank · 4 years ago

I use https://github.com/xocolatl/periods for this to some success.

srcreigh · 4 years ago

What do temporal tables do that good queries don't?

nightpool · 4 years ago

i’ve very rarely found that using a full temporal table is the right choice for online analysis—a dedicated schema serves you better in the long run and helps you design your indexes, etc appropriately. For compliance, PIT backups via WAL shipping should suffice, no?

lenkite · 4 years ago

I wish Postgres was more SQL standards compliant. Stuff like using `nextval()` instead of `NEXT VALUE` in SQL sequences is a pain.

nicoburns · 4 years ago

Is zheap definitely still an active project? Last commit seems to be Oct 2020

phonon · 4 years ago

Looks like it's still being actively worked on.

https://www.cybertec-postgresql.com/en/zheap-undo-logs-disca...

https://github.com/cybertec-postgresql/postgres/tree/zheap_u...

srcreigh · 4 years ago

Clustered indexes?

Dead Comment

It's weird to put postgres into the same bucket as elastic search as they are often used for different things.

No matter how much you tune / denormalize postgres, you'll never get the free text search performance elastic search offers. Our best efforts on a 5 million row table yielded 600ms query times vs 30-60ms.

Similarity with snow flake, you'd never expect postgres to perform analytical queries at that scale.

I know graph databases and Time series DB have similar performance tradeoffs.

I think the most interesting and challenging area is how to architect a system uses many of these databases and keeps them eventually consistent without some bound.

code_biologist · 4 years ago

Not affiliated, but for anyone looking to do searches on data stored primarily in Postgres via Elastic, ZomboDB is pretty slick.

ZomboDB is a Postgres extension that enables efficient full-text searching via the use of indexes backed by Elasticsearch. https://github.com/zombodb/zombodb#readme

tpetry · 4 years ago

The author is talking about a different classes of rdbms. I believe his intention was not to compare PostgreSQL to ElasticSearch or ClickHouse which will solve a completely different problem.

But for small to medium datasets his advice to just stick to PostgreSQL is good: Start with an easy solution which will give you anything you need (by simply installing a plugin). If you need more specialized software THEN use it, but don't start with an overcomplicated stack because ElasticSearch and ClickHouse may be the state-of-the-art open source solution to a specific problem.

zxcq544 · 4 years ago

Have you tried GIN trigram(https://www.postgresql.org/docs/14/pgtrgm.html CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);) and GIN fulltext search indexes(CREATE INDEX textsearch_idx ON pgweb USING GIN (textsearchable_index_col);) ? As far as I know after applying those indexes on full text search columns you can search as fast as in Elastic because those indexes are built same way as in Elastic.

t-writescode · 4 years ago

How large are your text areas? What types of indexes are you using?

why-el · 4 years ago

zffr · 4 years ago

The author is a professor at CMU who specializes in databases: https://www.cs.cmu.edu/~pavlo/

Not completely related, but his lectures on databases on YouTube are really good. Much better than the DB class I had at college.

adamkl · 4 years ago

Said lectures on YouTube: https://www.youtube.com/playlist?list=PLSE8ODhjZXjbohkNBWQs_...

A great way to learn more about the inner workings of databases, and entertaining too.

Another choice quote (from one of his lectures):

“There’s only two things I care about in life:

1. My wife

2. Databases

I don’t give a f#ck about anything else”

_the_inflator · 4 years ago

The author is hilarious! Quote from his article: “I even broke up with a girlfriend once because of sloppy benchmark results.”

thejosh · 4 years ago

I'm really excited by all the database love in the last few years. I moved to PG from MySQL in 2014 and don't regret it since.

Timescaledb looks very exciting, as it's "just" a PG extension, but their compression work looks great. [0]

I'm also really loving clickhouse, but haven't deployed that to production yet (haven't had the need to yet, almost did for an apache arrow reading thing, but didn't end up using arrow). They do some amazing things there, and the work they do is crazy impressive and fast. Reading their changelog they power through things.

[0] https://docs.timescale.com/timescaledb/latest/how-to-guides/...

threeseed · 4 years ago

So a company that sells PostgreSQL services thinks PostgreSQL is dominating. Brilliant.

The reality is that nothing is dominating. In 2021 there were more databases than ever each addressing a different use case. Companies don't have just one EDW they will have dozens even hundreds of siloed data stores. Startups will start with one for everything, then split out auth, user analytics, telemetry etc

There is no evidence of any consolidation in the market. And definitely not some mass trend towards PostgreSQL.

Sytten · 4 years ago

Couple of points:

1. Ottertune doesn't sell PostgreSQL services, they sell a database optimization service that happens to support PostgreSQL (and other databases like MySQL)

2. PostgreSQL is definitely gaining market shares and fast, see the db-engine graph [1], you can compare it to the oracle trend if you are not convinced [2]

[1] https://db-engines.com/en/ranking_trend/system/PostgreSQL

[2] https://db-engines.com/en/ranking_trend/system/Oracle

Is this ranking by # of orgs using Postgres, or relative total company value using Postgres, or some even more ambiguous effectiveness metric?

Answer: https://db-engines.com/en/ranking_definition

Deleted Comment

they sell a database optimization service that happens to support PostgreSQL (and other databases like MySQL)

A ML program that automatically tunes your production database in real-time. What could possibly go wrong?

You can't just compare graphs like that without factoring in the cloud. PostgreSQL is a first-class, cloud managed, supported database in the top three cloud providers whereas Oracle is not. It's a massive impediment to adoption and is in no way a reflection of the database itself.

Either way nothing to suggest that PostgreSQL is any way dominating.

dreyfan · 4 years ago

All you need is Postgres (OLTP) and if you have large datasets where Postgres falls behind for analytical work, then you reach for Clickhouse (OLAP) for those features (while Postgres remains your primary operational database and source of truth).

mritchie712 · 4 years ago

Agreed. I have a good bit of experience in SaaS and analytics and that's exactly what I landed on for building Luabase[0]. Postgres (specifically Supabase) for the app database, Clickhouse to run the analytics (which is the product).

0 - https://luabase.com/

drchaim · 4 years ago

This is the way for my also.

czhu12 · 4 years ago

_vvhw · 4 years ago

What are the distributed options for Postgres? What mechanisms are available to make it highly available i.e. with a distributed consensus protocol for strict serializability when failing over the primary? How do people typically deploy Postgres as a cluster?

1. Async replication tolerating data loss from slightly stale backup after a failover?

2. Sync replication tolerating downtime during manual failover?

3. Distributed consensus protocol for automated failover, high availability and no data loss, e.g. Viewstamped Replication, Paxos or Raft?

It seems like most managed service versions of databases such as Aurora, Timescale etc. are all doing option 3, but the open-source alternatives otherwise are still options 1 and 2?

ryanworl · 4 years ago

I think you'd still need to change the core of the database to avoid stale reads when an old primary and client are partitioned away from the new primary, or force all client communication through a proxy smart enough to contact a quorum of replicas to ensure the current primary is still the primary during transaction begin and commit.

Ah yes, good point!

I was assuming in both cases of manual failover that the operator would have to have some way of physically shutting down the old primary, then starting it again only as a backup that doesn't reply to clients. Alternatively, the cluster would need to remain unavailable if any node is partitioned.

But none of this is really very practical when compared to a consensus protocol (or R/W quorums) and distributed database. I'm genuinely curious how people solve this with something like Postgres. Or is it perhaps something that isn't much worried about?

AtlasBarfed · 4 years ago

I can't see how #3 scales under any write load unless you have no joins.|

Well, unless each node has a complete copy of the data?

eternalban · 4 years ago

Databases are the best all around scratch every cs geek itch domain there is, with possible exception of operating systems.

The critical importance of extensibility as a primary concern of successful DB products needs to be highlighted. Realities of the domain dictate that product X matures a few years after inception, at which point the application patterns may have shifted. (Remember map-reduce?) If you pay attention, for example, you'll note that the du jour darlings are scrambling to claim fitness for ML (a subset of big-data), and the new comers are claiming to be "designed for ML".

Smart VC money should be on extensible players ..