Afaict nothing in this benchmark will actually use AIO in 18. As of 18 there is aio reads for seq scans, bitmap scans, vacuum, and a few other utility commands. But the queries being run should normally be planned as index range scans. We're hoping to the the work for using AIO for index scans into 19, but it could work end up in 20, it's nontrivial.
It's also worth noting that the default for data checksums has changed, with some overhead due to that.
If the benchmark doesn’t use AIO, why the performance difference between PG 17 and 18 in the blog post (sync, worker, and io_uring)?
Is it because remote storage in the cloud always introduces some variance & the benchmark just picks that up?
For reference, anarazel had a presentation at pgconf.eu yesterday about AIO. anarazel mentioned that remote cloud storage always introduced variance making the benchmark results hard to interpret. His solution was to introduce synthetic latency on local NVMes for benchmarks.
I wonder if it's just being executed on a different VMs with slightly different performance characteristics. I can't tell based on the formulation in the post whether all the runs for one test are executed on the same VM or not.
Where are the error bars? I don’t get why people run all these tests and don’t give me an idea of standard deviation or whether the differences are actually statistically significant.
If you care about performance, don't use network storage.
If you are using local nvme disk, then it does not matter if you are using Postgres 17 or 18. Performance is about the same. And significantly faster than network storage.
Yes, it’s the ephemerality that’s the biggest issue. Enterprise-grade SSDs are quite reliable, and typically have PLP so even in the event of a sudden power loss, any queued writes that the drive has accepted - and thus ack’d the fsync() - will be written. Presumably you’d be running some kind of redundancy, likely some flavor of RAID or zRAID (assuming purely local storage here, not a distributed system like Ceph, nor synchronous replication).
But in the cloud, if the physical server backing your instance dies, or even if someone accidentally issues a shutdown command, you don’t get that same drive back when the new instance comes up. So a problem that is normally solved by basic local redundancy suddenly becomes impossible, and thus you must either run synchronous or semi-sync replication (the latter is what PlanetScale Metal does), accepting the latency hit from distributed storage, or asynchronous replication and accept some amount of data loss, which is rarely acceptable.
Databases like Postgres have well established ways to handle that. And if you're setting up the DB yourself, you absolutely need to do backups anyway. And a replica on a different server.
On some providers (e.g. Hetzner), the dedicated servers come by default with 2x RAID 1 disks, so it's a lot less likely to fail (unless the datacenter burns down).
Yes, a single disk in a VPS or cloud provider has durability concerns. That's why EBS and products like it that pretend to be a single disk are actually several. Instead of relying on multiple block devices, though, we create that redundancy at a higher level by relying on multiple MySQL or Postgres servers for durability, each with a local NVMe drive for performance.
Sure. Till an extent. And if you run some mission-critical application, definitely.
But most applications run fine from local storage and can tolerate some downtime. They might even benefit from the improved performance. You can also fix the durability and disaster recovery concerns by setting up on RAID/ZFS and maintaining proper backups.
yeh planetscale loves to flex how fast they are but the main reason they are fast is because they run a full abstraction less than any other cloud provider and this does in fact have trade-offs.
RAID isn't the answer, either, for the record. In AWS and GCP, the CPU or RAM blowing up will cost you access to that local NVMe drive, too, no matter how much RAID you throw at it.
just for reference with 4 consumer nvmes and raid10 and pciex16 you can easily do 3m IOPS for one time cost of like 1000$
in my current job we constantly have to rethink db queries/design because of cloud IOPS, and of course not having control over RDS page cache and numa.
every time I am woken up at night because a seemingly normal query all of the sudden goes beyond our IOPS budget and the WAL starts trashing, I seriously question my choices.
You don't pay for RDS because you care about IOPS. You pay for it because you want backups and replication to be somebody else's problem. And because you (by which I mean probably the MBA management above you, rather than you yourself) care about it being an opex rather than capex cost, a lot more than you care about how much the cost is. And because ISO audit boxes get ticked.
> You pay for it because you want backups and replication to be somebody else's problem.
Or you just use something like CockroachDB, YugabyteDB etc that auto replicate, auto rebalance if a node goes down, and have build in support for backups to and from S3...
Or if your a bit more hands on, multigress seems to be closing to completion ( https://github.com/multigres/multigres ) from the guy that make Vitess for Mysql.
The idea that managing hardware and software is hard, is silly yet, people (mostly managers it seems ) think its the best solution.
Comparing monthly cloud cost with one-time hardware purchasing cost completely dismisses the latter's long-time cost like people, replacement parts, power, housing, accessories. While I do believe you can run your own hardware much cheaper, there's a lot to consider before making the decision.
Working at IT places in the late 2000s, it was still pretty common place for there to be a server rooms. Even for a large org with multiple sites 100s of kms a part, you could manage it with a pretty small team. And it is a lot easier to build resilient applications now than it was back then from what I remember.
Cloud costs are getting large enough that I know I’ve got one foot out the door and a long term plan to move back to having our own servers and spend the money we save on people. I can only see cloud getting even more expensive, not less.
I was expecting `pg_dumpall` to get the `--format` option in v18,[0] but at the moment the docs say it's still only available in the development branch.[1]
Is anyone familiar with Postgres development able to give an update on the state of the feature? Is it planned for a future (18 or 19) release?
Ah my mistake, I linked to the docs for `pg_dump` (which has long had the `format` option) rather than `pg_dumpall` (which lacks it).
Before Postgres 18 was released, the docs listed `format` as an option for `pg_dumpall` in the upcoming version 18 (e.g. Wayback Machine from Jun 2025 https://web.archive.org/web/20250624230110/https://www.postg... ). The relevant commit is from Apr 2025 (see link #0 in my original comment). But now all mention has been scrubbed, even from the Devel branch docs.
It's also worth noting that the default for data checksums has changed, with some overhead due to that.
Is it because remote storage in the cloud always introduces some variance & the benchmark just picks that up?
For reference, anarazel had a presentation at pgconf.eu yesterday about AIO. anarazel mentioned that remote cloud storage always introduced variance making the benchmark results hard to interpret. His solution was to introduce synthetic latency on local NVMes for benchmarks.
I wonder if it's just being executed on a different VMs with slightly different performance characteristics. I can't tell based on the formulation in the post whether all the runs for one test are executed on the same VM or not.
If you care about performance, don't use network storage.
If you are using local nvme disk, then it does not matter if you are using Postgres 17 or 18. Performance is about the same. And significantly faster than network storage.
Am I correct in that using local disk on any VPS has durability concerns?
But in the cloud, if the physical server backing your instance dies, or even if someone accidentally issues a shutdown command, you don’t get that same drive back when the new instance comes up. So a problem that is normally solved by basic local redundancy suddenly becomes impossible, and thus you must either run synchronous or semi-sync replication (the latter is what PlanetScale Metal does), accepting the latency hit from distributed storage, or asynchronous replication and accept some amount of data loss, which is rarely acceptable.
But most applications run fine from local storage and can tolerate some downtime. They might even benefit from the improved performance. You can also fix the durability and disaster recovery concerns by setting up on RAID/ZFS and maintaining proper backups.
Deleted Comment
I concluded that better IO planning it's only worth it for "slow" I/O in 18.
Pretty sure it will bring a lot of learnings. Postgress devs are pretty awesome.
just for reference with 4 consumer nvmes and raid10 and pciex16 you can easily do 3m IOPS for one time cost of like 1000$
in my current job we constantly have to rethink db queries/design because of cloud IOPS, and of course not having control over RDS page cache and numa.
every time I am woken up at night because a seemingly normal query all of the sudden goes beyond our IOPS budget and the WAL starts trashing, I seriously question my choices.
the whole cloud situation is just ridiculous.
Or you just use something like CockroachDB, YugabyteDB etc that auto replicate, auto rebalance if a node goes down, and have build in support for backups to and from S3...
Or if your a bit more hands on, multigress seems to be closing to completion ( https://github.com/multigres/multigres ) from the guy that make Vitess for Mysql.
The idea that managing hardware and software is hard, is silly yet, people (mostly managers it seems ) think its the best solution.
lspci is only written in the old alchemy books, in the whispers of the thrice great Hermes.
PS: I have personally put down actual fires in a datacenter, and I prefer it to this 3000 IOPS crap.
Cloud costs are getting large enough that I know I’ve got one foot out the door and a long term plan to move back to having our own servers and spend the money we save on people. I can only see cloud getting even more expensive, not less.
not to mention then you have all the issues people are discussing managing your own backups, snapshots and replication and etc
Is anyone familiar with Postgres development able to give an update on the state of the feature? Is it planned for a future (18 or 19) release?
[0]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...
[1]: https://www.postgresql.org/docs/devel/app-pgdump.html#:~:tex...
Before Postgres 18 was released, the docs listed `format` as an option for `pg_dumpall` in the upcoming version 18 (e.g. Wayback Machine from Jun 2025 https://web.archive.org/web/20250624230110/https://www.postg... ). The relevant commit is from Apr 2025 (see link #0 in my original comment). But now all mention has been scrubbed, even from the Devel branch docs.
Has anybody seriously benchmarked this ?
I don’t think io uring would make a difference with this setting but I’m curious, as it’s the default for oracle and sybase.
See e.g. here: https://www.cybertec-postgresql.com/en/postgresql-18-and-bey...
Deleted Comment