MVCC – the part of PostgreSQL we hate the most (2023)

The big advantage is that you do not need any extra space if your workload mostly consists of INSERTs (followed by table drops). And it's generally unnecessary to split up insertion transactions because there is no size limit as such (neither on the generated data or the total count of rows changed). There is a limit on statements in a transaction, but you can sidestep that by using COPY FROM if you do not have to switch tables too frequently. From a DBA point of view, there is no need to manage a rollback/undo space separately from table storage.

Every application is a bit different, but it's not that the PostgreSQL design is a loser in all regards. It's not like bubble sort.

winternewt · 10 months ago

> Every application is a bit different, but it's not that the PostgreSQL design is a loser in all regards. It's not like bubble sort.

When doing game development in the early 2000s I learned that bubble sort is not a loser in all regards. It performs well when a list is usually almost sorted. One situation when this is the case is in 3D rendering, which sorts objects by their distance from the camera. As you move the camera around or rotate it, bubble sort works very well for re-sorting the objects given the order they had in the previous frame.

To prevent bad worst-case scenarios you can count the number of comparisons that failed on the last pass and the number of passes you have performed so far, then switch to a different sort algorithm after reaching a threshold.

jeltz · 10 months ago

But wouldn't insertion sort be better than bubble sort in those cases?

15155 · 10 months ago

> It's not like bubble sort.

Bubble sort is great in hardware and for mostly-sorted sets.

Dead Comment

indulona · 10 months ago

> but it's not that the PostgreSQL design is a loser in all regards

the article literally says that pg's mvcc design is from the 90s and no one does it like that any more. that is technology that is outdated by over 30 years. i'd say it does not make it a loser in all regards, but in the most important aspects.

mikeocool · 10 months ago

When it comes to your data store, some people might consider using technology that’s been reliably used in production by many organizations for 30 years a feature not a bug.

I’d prefer not to be the first person running up against a limit or discovering a bug in my DB software.

j16sdiz · 10 months ago

> the article literally says that pg's mvcc design is from the 90s and...

Actually, it is 1980s. The article:

> Its design is a relic of the 1980s and before the proliferation of log-structured system patterns from the 1990s.

kunley · 10 months ago

Still I am very happy to use every day the technology designed in early 70s by Ken Thompson and colleagues, so far in that specific field many tried to invent something more "modern" and "better" and failed, with an exception of a certain Finnish clone of that tech, also started in 80s by the way.

So, newer not always means better, just saying

naranha · 10 months ago

At least couchdb is also append only with vacuum. So it's maybe not completely outdated.

> MySQL and Oracle store a compact delta between the new and current versions (think of it like a git diff).

Doesn't git famously _not_ store diffs and instead follows the same storage pattern postgres uses here and stores the full new and old objects?

ChadNauseam · 10 months ago

TBF, the quoted section doesn't say that git stores diffs (or anything about git storage), it just says that what MySQL and Oracle stores is similar to a git diff.

zdragnar · 10 months ago

It's a little too easy to misinterpret if you're skimming and still have memories of working with SVN, mercurial, perforce, and probably others (I've intentionally repressed everything about tfvc).

layer8 · 10 months ago

It’s not clear why they state “git diff” specifically. It’s simply a diff (git or otherwise).

jmholla · 10 months ago

That is correct. Each version of a file is a separate blob. There is some compression done by packing to make cloning faster, but the raw for git works with is these blobs.

quotemstr · 10 months ago

git's model is a good example of layered architecture. Most of the code works in terms of whole blobs. The blob storage system, as an implementation detail, stores some blobs with diffs. The use of diffs doesn't leak into the rest of the system. Good separation of concerns

Dead Comment

paulddraper · 10 months ago

1. The comparison was to MySQL and Oracle storage using git diff format as an analogy, not git storage.

2. git storage does compress, and the compression is "diff-based" of sorts, but it is not based on commit history as one might naively expect.

epcoa · 10 months ago

Others have mentioned that it said “git diffs”. However git does use deltas in pack files as a low level optimization, similar to the MySQL comparison. You don’t get back diffs from a SQL query either.

Deleted Comment

Hendrikto · 10 months ago

Git diffs are generated on the fly, but diffs are still diffs.

hn_throwaway_99 · 10 months ago

Wow, as someone who feels like I'm decently familiar with the ins and outs of Postgres, I thought this was a great article and I learned a ton.

It seems like one of the biggest fundamental flaws is that Postgres chose the O2N approach for tracking row versions instead of N2O. While switching to N2O wouldn't solve all problems (e.g. the article also talks about how Postgres stores full row copies and not just diffs), from an "80/20 rule" perspective, it seems like it would get rid of most of the downsides with the current implementation. For example, I'd assume that the vast majority of the time that transactions want the latest row version, so using the N2O ordering means you could probably do away with storing each row version in an index, as you'd only need to traverse the linked list of you needed an older version, which should be much less common.

topherjaynes · 10 months ago

You should check out Andy's History of Databases (CMU Databases / Spring 2020) on youtube. He does the entire first class from the streets of Amsterdam because he can't get in his hotel... he's an interesting character and he's insanely good at explaining the ins and out

Tostino · 10 months ago

The content that his group puts out on YouTube is great. Been a little while since I caught up, but I was extremely impressed.

fweimer · 10 months ago

OrvalWintermute · 10 months ago

This article is incorrect IMO - the following section in particular.

“ In the 2000s, the conventional wisdom selected MySQL because rising tech stars like Google and Facebook were using it. Then in the 2010s, it was MongoDB because non-durable writes made it “webscale“. In the last five years, PostgreSQL has become the Internet’s darling DBMS. And for good reasons! It’s dependable, feature-rich, extensible, and well-suited for most operational workloads.”

Smart engineers were choosing postgres not because of the logical fallacy of popularum, but for the following reasons:

Data safety - not MyIsam, ACID, Similarity to Oracle, MVCC, SQL standards adherence, Postgres team, Helpful awesome community, Data types, High performance, BSD flexibility

Above are the reasons I selected Postgres while at ATT early 2000s and our Oracle DBA found it a very easy transition. While Mysql went through rough transitions, PG has gone from strength to strength and ever improving path.

I think Bruce Momjian is a big part of this success; they truly have an excellent community. <3

andruby · 10 months ago

Similar. My preference switched from MySQL to PostgreSQL in 2005 when I wanted to use database views to create a "live" compatibility layer between an old (AS400) database schema and a modern Rails app.

The preference kept growing thanks to data safety, DDL's in transactions, etc.

nightfly · 10 months ago

mxey · 10 months ago

> The need for PostgreSQL to modify all of a table’s indexes for each update has several performance implications. Obviously, this makes update queries slower because the system has to do more work.

You know, I was wondering something regarding this write amplification. It's true that MySQL doesn't need to update its indexes like that. However, MySQL replication relies on the binlog, where every change has to be written in addition to the database itself (InnoDB redo log and so on). So, it seems to me, MySQL, if used in a cluster, has a different kind of write amplification. One that PostgreSQL does not have, because it reuses its WAL for the replication.

In addition, on the receiving side, MySQL first writes the incoming binlog to the relay log. The relay log is then consumed by the applier threads, creating more InnoDB writes and (by default) more binlog.

halayli · 10 months ago

This topic cannot be discussed alone without talking about disks. SSDs write 4k page at a time. Meaning if you're going to update 1 bit, the disk will read 4k, you update the bit, and it writes back a 4k page in a new slot. So the penalty for copying varies depending on the disk type.

rand_r · 10 months ago

Interesting! I wonder how this plays into AWS pricing. They charge a flat fate for MBps of IO. But I don’t know if they have a rule to round up to nearest 4K, or they actually charge you the IO amount from the storage implementation by tracking write volume on the drive itself, rather what you requested.

chupasaurus · 10 months ago

> They charge a flat fate for MBps of IO

They actually charge for IOPS, the throughput is just an upper bound that is easier to sell.

For gp SSDs, if requested pages are continuous 256K they will be merged into a single operation. For io page size is 256K on Nitro instances and 16K on everything else, st and sc have 1M page size.

srcreigh · 10 months ago

Postgres pages are 8kb so the point is moot.

olavgg · 10 months ago

The default is 8kb, but it can be recompiled for 4kb-32kb, I actually prefer 32kb because with ZSTD compression, it will most likey only use 8kb after being compressed. Average compress ratio with ZSTD, is usually between 4x-6x. But depending on how your compressable you data is, you may also get a lot less. Note that changing this block size, will require initialization of a new data file system for your Postgres database.

I am referring to physical pages in an SSD disk. The 8k pg page maps to 2 pages in a typical SSD disk. Your comment proves my initial point, which is write amplification cannot be discussed without talking about the disk types and their behavior.

dfox · 10 months ago

> Oracle and MySQL do not have this problem in their MVCC implementation because their secondary indexes do not store the physical addresses of new versions. Instead, they store a logical identifier (e.g., tuple id, primary key) that the DBMS then uses to look up the current version’s physical address. Now this may make secondary index reads slower since the DBMS has to resolve a logical identifier, but these DBMS have other advantages in their MVCC implementation to reduce overhead.

Interesting behavior of MySQL that I have observed (~500GB database, with a schema that is more of an document oriented than relational) is that when you update single row doing SELECT id WHERE something; UPDATE what WHERE id=id is orders of magnitudes faster than UPDATE what WHERE something. I somehow suspect that this is the reason for this behavior. But well, the normal workload will not do that and this only slows down ad-hoc DML when you fix some inconsistency.

whazor · 10 months ago

A SELECT is a readonly operation and can be performed in parallel. However, an UPDATE actually writes and might lock the table. Whereas UPDATE id=id allows for row level locking. There is also the risk of missing newly inserted records between the SELECT and the UPDATE.

nine_k · 10 months ago

SELECT FOR UPDATE was invented to address this,

fforflo · 10 months ago

I have a couple of read-heavy >2TB Postgres instances, document-oriented too. You're right that bulk updates can be too slow. Too many times I end up doing the updates incremental (in batches) or even use COPY.

You also want to avoid long transactions to avoid lock contention. Every statement is also a transaction, so chunking it up helps a lot on busy databases.

thih9 · 10 months ago

> Then in the 2010s, it was MongoDB because non-durable writes made it “webscale“.

Off topic, it was marketing all along: https://news.ycombinator.com/item?id=15124306

Hilift · 10 months ago

It was designed by former DoubleClick engineers as an afterthought DIY db for another service because no other db met their requirements. Supposedly version 4.2.8 (2020) is fairly solid, i.e. no dirty writes. https://en.wikipedia.org/wiki/MongoDB#Technical_criticisms