MySQL transactions per second vs. fsyncs per second (2020)

It hurts my soul that database engines still don’t batch multiple transactions per disk flush as a default setting.

It’s such an insane bottleneck on modern systems.

Animats · 9 months ago

File system APIs should be able to return events "data is transferred, buffer no longer needed" and "data is safely stored on persistent media". That's what databases really need from the file system.

Interestingly, some IBM mainframe hardware makes this distinction, with "channel end" and "device end". But Linux does not.

colanderman · 9 months ago

Linux does make this distinction.

Linux has two main disk I/O paths, buffered and direct I/O.

Direct I/O indicates "channel end" by the completion of the write operation. By that point, data has been transferred to disk from the buffer and the buffer can be reused. (Notably, direct I/O may be performed asynchronously via the Linux AIO API.) Contrary to popular belief, this does not indicate that the data is safely committed to disk. [1]

Buffered I/O does not require the concept of "channel end", since write calls complete immediately after the buffer contents have been copied into kernel space. (This mode is what PostgreSQL uses by default. I don't know MySQL.)

In either case, `fsync(2)` (or `fdatasync(2)`) is used to indicate "device end". The disk has indicated that data is safely stored. (This can likewise be monitored asynchronously, either via the AIO API for direct I/O, or `sync_file_range(2)` for buffered I/O. The latter is used by PostgreSQL [2].)

Aside – Linux has also recently grown support for this concept in its networking API, via zerocopy `send(2)` functionality in io_uring.

[1] https://lwn.net/Articles/457667/

[2] https://news.ycombinator.com/item?id=11512653

Andys · 9 months ago

I thought PostgreSQL did, but now that I check the docs, you are correct, the default `commit_delay` is zero. That would be worth increasing a little if you can afford the latency.

anarazel · 9 months ago

It works even with that setting at zero! Just requires a bit more concurrency.

anonymars · 9 months ago

How would that work without risking loss of committed transactions?

> Fully durable transaction commits are synchronous and report a commit as successful and return control to the client only after the log records for the transaction are written to disk. Delayed durable transaction commits are asynchronous and report a commit as successful before the log records for the transaction are written to disk. Writing the transaction log entries to disk is required for a transaction to be durable. Delayed durable transactions become durable when the transaction log entries are flushed to disk.

https://learn.microsoft.com/en-us/sql/relational-databases/l...

TwoPhonesOneKid · 9 months ago

Typically the SQL engine will allow flexibility on this. Not all transactions need to prioritize write-to-disk confirmation over throughput. If you're collecting observability metrics, for instance, these don't have the same data coherency constraints your app model (account etc) demand. In this case you can accept the logical commit and the tiny chance it might not actually hit the disk. Postgres at least allows customizing this per transaction, I believe, although I'm not quite sure how it works if you compose transactions with distinct syncrhonization constraints.

Deleted Comment

toast0 · 9 months ago

Write data for transaction 1, write data for transaction 2, fsync, signal commit success for transactions 1 and 2.

Up to you how many transactions you want in a batch.

bob1029 · 9 months ago

It is really shocking to me that this isn't more of a thing.

If the I/O block size is 4,096 bytes and you are doing things like updating a user's email address or some boolean preference, you could get a LOT more data packed in per unit space & time. Each block is enough to store an entire page of text or a small image.

The only caveat is that the system needs to be consistently under heavy load, or you would have to force clients to wait unreasonable amounts of time for a sufficient batch to accumulate.

I think the best goldilocks path is to put something like a MPSC ring buffer in front of the command processor and to "take what you can get" in terms of batch size each iteration.

colanderman · 9 months ago

They do, if you issue the transactions concurrently (i.e., multiple DB connections). This is "group commit" as referenced in the article.

You can't batch sequential transactions without violating the semantics of `COMMIT`. (Data is after all not durable if it's not flushed to disk!) Hence such behavior is not the default in quality databases.

If you are issuing many sequential writes and you don't care about them being individually committed, you can and should group them.

avinassh · 9 months ago

where can I read more?

Only skimmed through, so might be wrong, but got the impression that the mystery is never really solved… If transactions are batched, then MySQL will sometimes have to return OK before the transaction is on disk, no? But it’s stated that the configuration is fully ACID. Seems like a contradiction.

ncruces · 9 months ago

No.

Two non-conflicting transactions (where neither transaction reads the other's changes) can both wait for the other to commit, so they can be left hanging, and the COMMIT/fsync can be batched.

Depending on how you interpret ACID, even two conflicting transactions (B reads what A wrote) can be left hanging. Whoever submitted A can be left blocked at COMMIT, while B is processed assuming A had committed; then either both commit or both fail. This is ACID as long as you only consider data committed once the COMMIT actually succeeds, and don't produce observable effects (outside the system) before that's true.

bjornsing · 9 months ago

The performance test program does one insert at a time, sequentially…

mnahkies · 9 months ago

The transaction is written to the write ahead log and fsync'd before the client is told OK.

It may not have made it to the table storage on disk yet (but will be in memory), but if there's a crash the WAL can be replayed to recover safely - you'll see postgres do this if it crashes from OOM or gets restarted unexpectedly

Deleted Comment

hackit2 · 9 months ago

I might be wrong but most disk controller report the file as written when it isn't actually written to the drive.

darkwater · 9 months ago

But then it's in their cache which is battery-backed and survives power losses. It's the same concept as the write ahead log is at a higher level.

LinuxBender · 9 months ago

The concepts being described are called "Furious Flushing". I believe Percona did a writeup on this as well some time ago and I had to assist a DBA with getting systems to keep up with Percona MySQL demands regarding furious flushing and adjusting filesystem and memory options. Here [1] is one of them. OS tuning options have changed a bit since 2014 and there are also more recent articles I believe. The biggest win for us was changing the servers to use SSD's and large capacitor backed raid caches.

[1] - https://ftp.fau.de/macports/distfiles/percona/PerconaServer-...

echelon · 9 months ago

Are the Percona folks a bunch of core MySQL/MariaDB engineers?

Not as far as I know but here is a brief history. [1] They will commercially support all the databases including MariaDB. Their goal was to collectively update MySQL to take advantage of modern hardware. They are also useful for companies that have complex replication and backup requirements. They have also created a lot of tools for DBA's to manage data. The DBA's I worked with had a great relationship and great experiences with them.

[1] - https://www.percona.com/sites/default/files/The-Story-of-Per...

tomnipotent · 9 months ago

The Percona founders worked at MySQL AB previously.

jiggawatts · 9 months ago

trhway · 9 months ago

>MySQL will group multiple writes with each fsync

in the old days of HDD the Linux IO driver (some of them) would also re-order the writes in the queue to minimize HDD head seeks.

>A modern disk can do ~1000 fsyncs per second

sounds low for SSD. Haven't benchmarked for a while though. Sounds like something a 5-7 HDD disk array would do if i remember numbers correctly.

tibbar · 9 months ago

The article seems to imply that fsyncs need to happen in a linear order, that is, 1ms / fsync -> 1000 fsyncs/sec. It seems to imply that any batching happens in a linear order as well, that is, we have to completely finish one batch of fsyncs before the next one begins. Is that true? Obviously some systems (including databases) will happily let the beginning of one transaction overlap with another, only simulating linearizable transactions when that is required by the isolation level. But I don't have a great mental model of what the file systems are doing under the hood.

evanelias · 9 months ago

The error semantics of fsync are very important here and this can get messy, for example see https://danluu.com/fsyncgate/

jcartw · 9 months ago

The author does mention that the OS appears to be performing a sort of batching or aggregation of fsyncs after reviewing the test results and concludes that more than 1000 fsyncs are occurring per second. I’ve also confirmed this by running some benchmarking on EC2 instances with gp2 volumes: https://justincartwright.com/2025/03/13/iops-and-fsync.html

Palomides · 9 months ago

on an SSD with PLP (or on optane) I think you can get 20x rate, or more

voodoo_child · 9 months ago

‪There’s a nice summary of group commit in this MySQL 5.6 work log for binary log group commit implementation ‬ ‪https://dev.mysql.com/worklog/task/?id=5223‬

‪Great blog here too on changes since :‬ ‪https://hackmysql.com/group-commit-and-trx-dependency-tracki...

philippta · 9 months ago

This has been something I’ve been curious about for a long time, so thanks for the write up.

When it comes to system design, we often ignore what modern hardware is capable of and ignore fundamentals like fsyncs/s or more generally IOPS.

So instead of assuming that we potentially need a horizontally scalable system, we should instead find hardware that matches our needs.

tandr · 9 months ago

This is the same fella who is hosting github.com/sipusen/logrus ! Man, I would like to say big THANK YOU for all your work there!

ZephyrBlu · 9 months ago

Correct URL: https://github.com/sirupsen/logrus

eatonphil · 9 months ago

And napkin-math:

https://github.com/sirupsen/napkin-math

cyansmoker · 9 months ago

dude literally has a BUNCH of repos I want to explore, now.

andrewstuart · 9 months ago

Couldn’t be too hard to find max batch time in the mysql source and see what it does…