Async I/O on Linux in databases

The article claims that, when they switched to io_uring,

> throughput increased by an order of magnitude almost immediately

But right near the start is the real story: the sync version had

> the classic fsync() call after every write to the log for durability

They are not comparing performance of sync APIs vs io_uring. They're comparing using fsync vs not using fsync! They even go on to say that a problem with async API is that

> you lose the durability guarantee that makes databases useful. ... the data might still be sitting in kernel buffers, not yet written to stable storage.

No! That's because you stopped using fsync. It's nothing to do with your code being async.

If you just removed the fsync from the sync code you'd quite possibly get a speedup of an order of magnitude too. Or if you put the fsync back in the async version (I don't know io_uring well enough to understand that but it appears to be possible with "io_uring_prep_fsync") then that would surely slide back. Would the io_uring version still be faster either way? Quite possibly, but because they made an apples-to-oranges comparison, we can't know from this article.

(As other commenters have pointed out, their two-phase commit strategy also fails to provide any guarantee. There's no getting around fsync if you want to be sure that your data is really on the storage medium.)

codys · a month ago

> > you lose the durability guarantee that makes databases useful. ... the data might still be sitting in kernel buffers, not yet written to stable storage.

> No! That's because you stopped using fsync. It's nothing to do with your code being async.

From that section, it sounds like OP was tossing data into the io_uring submition queue and calling it "done" at that point (ie: not waiting for the io_uring completion queue to have the completion indicated). So yes, fsync is needed, but they weren't even waiting for the kernel to start the write before indicating success.

I think to some extent things have been confused because io_uring has a completion concept, but OP also has a separate completion concept in their dual wal design (where the second WAL they call the "completion" WAL).

But I'm not sure if OP really took away the right understanding from their issues with ignoring io_uring completions, as they then create a 5 step procedure that adds one check for an io_uring completion, but still omits another.

> 1. Write intent record (async)

> 2. Perform operation in memory

> 3. Write completion record (async)

> 4. Wait for the completion record to be written to the WAL

> 5. Return success to client

Note the lack of waiting for the io_uring completion of the intent record (and yes, there's still not any reference to fsync or alternates, which is also wrong). There is no ordering guarantee between independent io_urings (OP states they're using separate io_uring instances for each WAL), and even in the same io_uring there is limited ordering around completions (IOSQE_IO_LINK exists, but doesn't allow traversing submission boundaries, so won't work here because OP submits the work a separate times. They'd need to use IOSQE_IO_DRAIN which seems like it would effectively serialize their writes. which is why It seems like OP would need to actually wait for completion of the intent write).

cryptonector · a month ago

Correct, TFA needs to wait for the completion of _all_ writes to the WAL, which is what `fsync()` was doing. Waiting only for the completion of the "completion record" does not ensure that the "intent record" made it to the WAL. In the event of a power failure it is entirely possible that the intent record did not make it but the completion record did, and then on recovery you'll have to panic.

osigurdson · a month ago

Suggest watching the Tigerbeatle video link in the article. There they discuss bitrot, "fsync gate", how Postgres used fsync wrong for 30 years, etc. It is very interesting even as pure entertainment.

jorangreef · a month ago

Thanks! Great to hear you enjoyed our talk. Most of it is simply putting the spotlight on UW-Madison’s work on storage faults.

Just to emphasize again that this blog post here is really quite different, since it does not fsync and breaks durability.

Not what we do in TigerBeetle or would recommend or encourage.

zozbot234 · a month ago

So OP's real point is that fsync() sucks in the context of modern hardware where thousands of I/O reqs may be in flight at any given time. We need more fine-grained mechanisms to ensure that writes are committed to permanent storage, without introducing undue serialization.

quietbritishjim · a month ago

Well, there already is slightly more fine gained control: in the sync version, you can perhaps call sync write() a few times before calling fsync() once i.e. basically batch up a few writes. That does have the disadvantage that you can't easily queue new writes while waiting for the previous ones. Perhaps you could use calls to write() in another thread while the first one is waiting for fsync() for the previous batch? You could even have lots of threads doing that in parallel, but probably not the thousands that you mentioned. I don't know the nitty gritty of Linux file IO well enough to know how well that would work.

As I said, I don't know anything about fsync in io_uring. Maybe that has more control?

An article that did a fair comparison, by someone who actually knows what they're talking about, would be pretty interesting.

ImPostingOnHN · a month ago

Some applications, like Apache Kafka, don't immediately fsync every write. This lets the kernel batch writes and also linearize them, both adding speed. Until synced, the data exists only in the linux page cache.

To deal with the risk of data loss, multiple such servers are used, with the hope that if one server dies before syncing, another server to which the data was replicated, performs an fsync without failure.

stefanha · a month ago

The Linux RWF_DSYNC flag sets the Full Unit Access (FUA) bit in write requests. This can be used instead of fdatasync(2) in some cases. It only syncs a specific write request instead of the entire disk write cache.

Deleted Comment

ajross · a month ago

> There's no getting around fsync if you want to be sure that your data is really on the storage medium.

That's not correct; io_uring supports O_DIRECT write requests just fine. Obviously bypassing the cache isn't the same as just flushing it (which is what fsync does), so there are design impacts.

But database engines are absolutely the target of io_uring's feature set and they're expected to be managing this complexity.

jandrewrogers · a month ago

O_DIRECT is not a substitute for fsync(). It only guarantees that data gets to the storage device cache, which is not durable in most cases.

codys · a month ago

> But database engines are absolutely the target of io_uring's feature set and they're expected to be managing this complexity.

io_uring includes an fsync opcode (with range support). When folks talk about fsync generally here, they're not saying the io_uring is unusable, they're saying that they'd expect the fsync to be used whether it's via the io_uring opcode, the system call, or some other mechanism yet to be created.

zozbot234 · a month ago

That's not what O_DIRECT is for. Did you mean O_SYNC ?

quietbritishjim · a month ago

Is that's true (notwithstanding objections from sibling comments) then that's just another spelling of fsync.

My point was really: you can't magically get the performance benefits of omitting fsync (or functional equivalent) while still getting the durability guarantees it gives.

I don't get this. How can two(+) WAL operations be faster than one (double the sync IOPS)?

I think this database doesn't have durability at all.

benjiro · a month ago

fsync waits for the drive to report back the success write. When you do a ton of small writes, fsync becomes a bottleneck. Its a issue of context switching and pipelining with fsync.

When you async write data, you do not need to wait for this confirmation. So by double writing two async requests, you are better using all your system CPU cores as they are not being stalled waiting for that I/O response. Seeing a 10x performance gain is not uncommon using a method like this.

Yes, you do need to check if both records are written and then report it back to the client. But that is a non-fsync request and does not tax your system the same as fsync writes.

It has literally the same durability as a fsync write. You need to take in account, that most databases are written 30, 40 ... years ago. In the time when HDDs ruled and stuff like NVME drives was a pipedream. But most DBs still work the same, and threat NVME drives like they are HDDs.

Doing this above operation on a HDD, will cost you 2x the performance because you barely have like 80 to 120 IOPS/s. But a cheap NVME drive easily does 100.000 like its nothing.

If you even monitored a NVME drive with a database write usage, you will noticed that those NVME drives are just underutilized. This is why you see a lot more work in trying new data storage layers being developed for Databases that better utilize NVME capabilities (and trying to bypass old HDD era bottlenecks).

zozbot234 · a month ago

> It has literally the same durability as a fsync write

I don't think we can ensure this without knowing what fsync() maps to in the NVMe standard, and somehow replicating that. Just reading back is not enough, e.g. the hardware might be reading from a volatile cache that will be lost in a crash.

codys · a month ago

> Yes, you do need to check if both records are written and then report it back to the client. But that is a non-fsync request and does not tax your system the same as fsync writes.

What mechanism can be used to check that the writes are complete if not fsync (or adjacent fdatasync)? What specific io_uring operation or system call?

To be clear, this is different to what we do (and why we do it) in TigerBeetle.

For example, we never externalize commits without full fsync, to preserve durability [0].

Further, the motivation for why TigerBeetle has both a prepare WAL plus a header WAL is different, not performance (we get performance elsewhere, through batching) but correctness, cf. “Protocol-Aware Recovery for Consensus-Based Storage” [1].

Finally, TigerBeetle's recovery is more intricate, we do all this to survive TigerBeetle's storage fault model. You can read the actual code here [2] and Kyle Kingsbury's Jepsen report on TigerBeetle also provides an excellent overview [3].

[0] https://www.youtube.com/watch?v=tRgvaqpQPwE

[1] https://www.usenix.org/system/files/conference/fast18/fast18...

[2] https://github.com/tigerbeetle/tigerbeetle/blob/main/src/vsr...

[3] https://jepsen.io/analyses/tigerbeetle-0.16.11.pdf

jmpman · a month ago

“Write intent record (async) Perform operation in memory Write completion record (async) Return success to client

During recovery, I only apply operations that have both intent and completion records. This ensures consistency while allowing much higher throughput. “

Does this mean that a client could receive a success for a request, which if the system crashed immediately afterwards, when replayed, wouldn’t necessarily have that request recorded?

How does that not violate ACID?

> Does this mean that a client could receive a success for a request, which if the system crashed immediately afterwards, when replayed, wouldn’t necessarily have that request recorded?

Yup. OP says "the intent record could just be sitting in a kernel buffer", but then the exact same issue applies to the completion record. So confirmation to the client cannot be issued until the completion record has been written to durable storage. Not really seeing the point of this blogpost.

JasonSage · a month ago

As best I can tell, the author understands that the async write-ahead fails to be a guarantee where the sync one does… then turns their async write into two async writes… but there’s still no guarantee comparable to the synchronous version.

So I fail to see how the two async writes are any guarantee at all. It sounds like they just happen to provide better consistency than the one async write because it forces an arbitrary amount of time to pass.

m11a · a month ago

Yeah, I feel like I’m missing the point of this. The original purpose of the WAL was for recovery, so WAL entries are supposed to be flushed to disk.

Seems like OP’s async approach removes that, so there’s no durability guarantee, so why even maintain a WAL to begin with?

tlb · a month ago

The recovery process is to "only apply operations that have both intent and completion records." But then I don't see the point of logging the intent record separately. If no completion is logged, the intent is ignored. So you could log the two together.

Presumably the intent record is large (containing the key-value data) while the completion record is tiny (containing just the index of the intent record). Is the point that the completion record write is guaranteed to be atomic because it fits in a disk sector, while the intent record doesn't?

ta8645 · a month ago

It's really not clear in the article. But I _think_ the gains are to be had because you can do the in-memory updating during the time that the WAL is being written to disk (rather than waiting for it to flush before proceeding). So I'm guessing the protocol as presented, is actually missing a key step:

    Write intent record (async)
    Perform operation in memory
    Write completion record (async)
    * * Wait for intent and completion to be flushed to disk * *
    Return success to client

gsliepen · a month ago

But this makes me wonder how it works when there are concurrent requests. What if a second thread requests data that is being written to memory by the first thread? Shouldn't it also wait for both the write intent record and completion record having been flushed to disk? Otherwise you could end up with a query that returns data that after a crash won't exist anymore.

avinassh · a month ago

    * * Wait for intent and completion to be flushed to disk * *

if you wait for both to complete, then how it can be faster than doing a single IO?

cbzbc · a month ago

Presumably the intent record is large (containing the key-value data) while the completion record is tiny

I don't think this is necessarily the case, because the operations may have completed in a different order to how they are recorded in the intent log.

I don't get this scheme at all. The protocol violates durability, because once the client receives success from server, it should be durable. However, completion record is async, it is possible that it never completes and server crashes.

During recovery, since the server applies only the operations which have both records, you will not recover a record which was successful to the client.

I think you missed the part in the middle:

-----------------

So the protocol ends up becoming:

Write intent record (async) Perform operation in memory Write completion record (async) Return success to client

In other words, the client only knows its a success when both wal files have been written.

The goal is not to provide faster responses to the client, on the first intent record, but to ensure that the system is not stuck with I/O Waiting on fsync requests.

When you write a ton of data to database, you often see that its not the core writes but the I/O > fsync that eat a ton of your resources. Cutting back on that mess, results that you can push more performance out of a write heavy server.

loeg · a month ago

No, we saw this scheme, it just doesn't work. Either of the async writes can fail after ack'ing the logical write to the client as successful (e.g., kernel crash or power failure) and then you have lost data.

jcgrillo · a month ago

There's no fsync in the async version, though, unless I missed it? The problem with the two WAL approach is that now none of the WAL writes are durable--you could encounter a situation where a client reads an entry on the completion WAL which upon recovery does not exist on disk. Before with the single fsynced WAL, writes were durably persisted.

leentee · a month ago

First, I think the article provides false claim, the solution doesn't guarantee durability. Second, I believe good synchronous code is better than bad asynchronous code, and it's way easier to write good synchronous code than asynchronous code, especially with io_uring. Modern NVMe are fast, even with synchronous IO, enough for most applications. Before thinking about asynchronous, make sure your application use synchronous IO well.

Speaking from experience, its easy to make Postgres (for example), just trash your system usage on a lot of individual or batch inserts. The NVME drives are often extreme underutilized, and your bottleneck is the whole fsync layer.

Second, the durability is the same as fsync. The client only gets reported a success, if both wall writes have been done.

Its the same guarantee as fsync but you bypass the fsync bottleneck, what in turn allows for actually using the benefits of your NVME drives better (and shifting away the resource from the i/o blocking fsync).

Yes, it involves more management because now you need to maintain two states, instead of one with the synchronous fsync operation. But that is the thing about parallel programming, its more complex but you get a ton of benefits from it by bypassing synchronous bottlenecks.

tobias3 · a month ago

sethev · a month ago

There's some faulty reasoning in this post. Without the code, it's hard to pin down exactly where things went wrong.

These are the steps described in the post:

   1. Write intent record (async)
   2. Perform operation in memory
   3. Write completion record (async)
   4. Wait for the completion record to be written to the WAL
   5. Return success to client

If 4 is done correctly then 3 is not needed - it can just wait for the intent to be durable before replying to the client. Perhaps there's a small benefit to speculatively executing the operation before the WAL is committed - but I'm skeptical and my guess is that 4 is not being done correctly. The author added an update to the article:

> This is tracked through io_uring's completion queue - we only send a success response after receiving confirmation that the completion record has been persisted to stable storage

This makes it sound like he's submitting write operations for the completion record and then misinterpreting the completion queue for those writes as "the record is now in durable storage".