neal (u/neal) - Readit News

pjdesno · a year ago

Unfortunately this post skips over the "atomicity" part of a write-ahead log.

Assume you start with data on disk AAAAAAAA, read it into memory, and update it to BBBBBBBB, then write it back. If you crash in the middle, you might end up with BBBAAAAA, BBBBBBAA, or even some crazy interleaving. (at least for reasonable file sizes - note that the largest atomic write to many NVMe drives is 128K)

If you ditch the in-memory BTree and write a direct-to-disk one, with a lot of care (and maybe a bit of copy-on-write) you can make sure that each disk write leaves the database in a crash-consistent state, but that will cost multiple writes and fsyncs for any database modifications that split or merge BTree nodes - you have to ensure that each write leaves the database in a consistent state.

(for those of you old enough to remember ext2, it had the same problem. If you mounted it async and had a bad crash, the data on disk would be inconsistent - you'd lose data, so you'd vow to always mount your filesystem with synchronous writes so you'd never lose data again, then a few weeks later you'd get tired of the crappy performance and go back to async writes, until the next crash happened, etc. etc.)

The advantage of a log is that it allows you to combine multiple writes to different parts of the database file into a single record, guaranteeing (after crash recovery if necessary) that either all changes happen or none of them do. It serves the same purpose as a mutex in multi-threaded code - if your invariants hold when you get the mutex, and you reestablish them before you drop it, everything will be fine. We'd all love to have a mutex that keeps the system from crashing, but failing that we can use a WAL record to ensure that we move atomically from one valid state to another, without worrying about the order of intermediate changes to the data structure.

neal · a year ago

Good point!

I'm not sure if there are any databases that do your 'with a lot of care' option, but for anyone curious about what that might look like in practice there are file systems that forgo write-ahead logging and maintain metadata consistency using techniques like soft updates[0] or copy-on-write up to a root[1].

[0]: https://www.usenix.org/conference/1999-usenix-annual-technic... [1]: https://www.cs.hmc.edu/~rhodes/courses/cs134/fa20/readings/T... (yes, ZFS can be configured to use a WAL too for durability)