A write-ahead log is not a universal part of durability

pjdesno · a year ago

Unfortunately this post skips over the "atomicity" part of a write-ahead log.

Assume you start with data on disk AAAAAAAA, read it into memory, and update it to BBBBBBBB, then write it back. If you crash in the middle, you might end up with BBBAAAAA, BBBBBBAA, or even some crazy interleaving. (at least for reasonable file sizes - note that the largest atomic write to many NVMe drives is 128K)

If you ditch the in-memory BTree and write a direct-to-disk one, with a lot of care (and maybe a bit of copy-on-write) you can make sure that each disk write leaves the database in a crash-consistent state, but that will cost multiple writes and fsyncs for any database modifications that split or merge BTree nodes - you have to ensure that each write leaves the database in a consistent state.

(for those of you old enough to remember ext2, it had the same problem. If you mounted it async and had a bad crash, the data on disk would be inconsistent - you'd lose data, so you'd vow to always mount your filesystem with synchronous writes so you'd never lose data again, then a few weeks later you'd get tired of the crappy performance and go back to async writes, until the next crash happened, etc. etc.)

The advantage of a log is that it allows you to combine multiple writes to different parts of the database file into a single record, guaranteeing (after crash recovery if necessary) that either all changes happen or none of them do. It serves the same purpose as a mutex in multi-threaded code - if your invariants hold when you get the mutex, and you reestablish them before you drop it, everything will be fine. We'd all love to have a mutex that keeps the system from crashing, but failing that we can use a WAL record to ensure that we move atomically from one valid state to another, without worrying about the order of intermediate changes to the data structure.

neal · a year ago

Good point!

I'm not sure if there are any databases that do your 'with a lot of care' option, but for anyone curious about what that might look like in practice there are file systems that forgo write-ahead logging and maintain metadata consistency using techniques like soft updates[0] or copy-on-write up to a root[1].

[0]: https://www.usenix.org/conference/1999-usenix-annual-technic... [1]: https://www.cs.hmc.edu/~rhodes/courses/cs134/fa20/readings/T... (yes, ZFS can be configured to use a WAL too for durability)

refset · a year ago

> I'm not sure if there are any databases that do your 'with a lot of care' option

LMDB springs to mind.

> LMDB was designed to resist data loss in the face of system and application crashes. Its copy-on-write approach never overwrites currently-in-use data. Avoiding overwrites means the structure on disk/storage is always valid, so application or system crashes can never leave the database in a corrupted state.

https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Databa...

philkrylov · a year ago

> yes, ZFS can be configured to use a WAL too for durability

ZFS always uses ZIL for sync writes. You can optionally set `sync=disabled` for a dataset to switch it off. I would not describe it as "can be configured to use WAL".

makz · a year ago

I imagine “a lot of care” like this:

Ok, let’s write this veeery slowly, good, let’s check now it got written as intended. Right, let’s check again just in case, but this time slowly and paying a lot of attention… ok looks, good, moving on

bjornsing · a year ago

For some data structures you get the atomicity for “free” though. For example, a log structured merge tree where writes are done in large transactions could well do without a WAL and still achieve atomicity.

valyala · a year ago

A good example of database, which relies on LSM tree properties for protecting against data corruption on unclean shutdown is VictoriaMetrics [1].

[1] https://valyala.medium.com/wal-usage-looks-broken-in-modern-...

amluto · a year ago

One might argue that an LSM tree has a WAL — LSM’s log contains logical records instead of physical changes, but it seems like the essentials are there.

nyrikki · a year ago

Ever have an LSMT quit writing a run half way through? Ever need to try and restore Cassandra where mutation storage order isn't guaranteed, with the last mutation always winning?

It is absolutely not "atomic" if you have ever had to do a recovery from SSTable snapshots.

eatonphil · a year ago

One of the interesting things I came up against while writing this post was the pretty common misconception that SQLite and Postgres will validate your data with checksums [0]. SQLite leaves this to an optional extension (this is a little more commonly known). Here was an HN comment I stumbled on about Postgres [1]:

> Postgres (and other databases that actually care about data integrity) have per page checksums:

Postgres does have support for data checksumming (but not metadata checksumming) but defaults to disabling data checksumming. MongoDB (WiredTiger) on the other hand defaults to checksumming. I was told after publishing this post that MySQL (InnoDB) also does checksum by default but I did not check that in my survey.

Links for proof are in the post.

[0] https://x.com/eatonphil/status/1807572135340134687

[1] https://news.ycombinator.com/item?id=25231308

wongarsu · a year ago

That's interesting. [1] seems to be the relevant page in the postgres documentation.

tl;dl: run `SHOW data_checksums` to see if checksums are enabled (they were enabled by default on our GCP Cloud SQL instance), to enable them shut down postgres and run `pg_checksums --enable --progress`

1: https://www.postgresql.org/docs/current/checksums.html

slaymaker1907 · a year ago

I’ve come around to the idea that checksumming isn’t some universal panacea. Once you have checksums, you need to start thinking about error *recovery* and not just detection. What happens to linked data? Do you just declare the whole SQLite file corrupted and just throw up your hands?

If you have backups, that’s one obvious solution, but people might not have uncorrupted backups.

I still think you should do it and emit a warning about it at the very least, but it’s not trivial to handle.

throw0101b · a year ago

Just use ZFS. :)

dboreham · a year ago

Database engines that live as libraries inside application processes (like WT) have a more pressing need for checksums because the most common attack vector is buggy pointer arithmetic outside the library that stomps on data in memory randomly.

whizzter · a year ago

Checksumming is also useful when storage goes bad or other random hardware corruption like cosmic rays or overheating/old computers.

fdr · a year ago

yeah, that one is a path dependency issue, re: postgres. I'm not sure if it's well reasoned past inertia after the initial "it's new, let's not throw the entire user base into it at once", but at the very least, it will complicate pg_upgrade somewhat. WiredTiger didn't really have that path dependency, being itself new to Mongo to shore up the storage situation.

It's probably about time to swap the default, but some people's once-working pg_upgrade programs that they haven't looked at in a while might break. Probably okay; those things need to happen...once in a while. I suppose some people that resent the overhead of Postgres checksumming atop their ZFS/btrfs/dm-integrity/whatever stacks, but they are somewhat rarer.

funny_falcon · a year ago

Most weird thing: PostgreSQL's checksum algorithm is almost free on modern processors. It uses SIMD instructions extremely well in optimized builds. I've never seen him in CPU profile despite I always enable it.

There are no any reasons to not enable it (except pg_upgrade).

zellyn · a year ago

I found the Developer Voices discussion[1][2] with Joran Dirk Greef, creator of TigerBeetle, to be fascinating. They mentioned that the rigorous correctness proofs that exist for Raft and Paxos assume an absence of disk faults, but that more modern theory includes fixes for that, as long as you marry the log writing and consensus algorithms together properly rather than keeping each as a separate black box.

[1] https://podcasts.apple.com/us/podcast/databases-ambitions-an...

[2] https://www.youtube.com/watch?v=ayG7ltGRRHs

refset · a year ago

Specifically, that modern theory is "protocol-aware recovery" - e.g. https://www.usenix.org/conference/fast18/presentation/alagap...

samatman · a year ago

I second this recommendation. I'm impressed by what TigerBeetle has accomplished, and would love to see more of the industry approach software engineering with their philosophy and rigour.

The TigerStyle video is well worth watching for anyone who wants to build defect-free software systems, or recognizes that this is the correct thing to aim for. https://www.youtube.com/watch?v=w3WYdYyjek4

AdamProut · a year ago

Maybe good to mention torn pages somewhere too? Both MySQL and Postgres jump through some hoops to both detect them and repair them [1][2]. So, even the scenario in the post where fsync is used to harden writes, the database still needs to handle torn pages (or requires using a file system \ storage that guarantees atomic page writes at the page size the database is using as several managed\cloud databases do).

[1] https://wiki.postgresql.org/wiki/Full_page_writes [2] https://dev.mysql.com/doc/refman/8.0/en/innodb-doublewrite-b...

eatonphil · a year ago

Thanks Adam! I think torn writes would still be caught via checksums, no? Although that may be later than you'd wish.

I'm not confident but from reading that page it seems that for Postgres at least, if it did do checksums it might not need to count on page-level atomic writes?

AdamProut · a year ago

Checksums can detect a torn page, but not always repair them. It's likely a good part of the database page is gone (i.e., an amount of data that matches the disk / file system atomic write unit size is probably missing). Torn page writes are a pretty common scenario too, so databases need to be able to fully recover from them - not just detect them and report a corruption (ie., just pull the power plug from the machine during a heavy write workload and you're likely to get one - it doesn't require a solar ray to flip a bit :) ).

londons_explore · a year ago

Worth noting that there are various tricks to combine the write ahead log and the rest of the on-disk data, meaning the vast majority of data will not be written more than once, and you'll still get the durability and data locality benefits of a WAL.

7e · a year ago

Could you elaborate? Do you mean an LSM tree or log structured filesystem?

simonz05 · a year ago

Lsm trees also need a log, right?

beltsazar · a year ago

But WAL is not for durability, but for atomicity and consistency. And yes, you need to use fsync to ensure durability.

anarazel · a year ago

Nitpick: If you're careful fdatasync() can suffice.

refset · a year ago

> durability is a spectrum

This is the truth. My favourite example of this is MemoryDB's "multi-AZ (multi-datacenter) durability" for Valkey - there's a good write-up here https://brooker.co.za/blog/2024/04/25/memorydb.html

sgeisenh · a year ago

In the pseudocode snippet that introduces a WAL, the semaphores are signaled before the fsync occurs. This seems like a mistake but maybe there’s a good reason to do this that I am missing?

Edit: it looks like this is the case in the first batched fsync example, as well.

eatonphil · a year ago

Good catch! That was a think-o. I've swapped it around so that semaphores are signalled after the fsync.

sgeisenh · a year ago

Thanks for the quick response, and for the nice write up. I thoroughly enjoy the python-like pseudocode, it was the main reason I was able to pick that out reasonably quickly!