Readit News logoReadit News
ayende · 3 months ago
This is wrong, because your mmap code is being stalled for page faults (including soft page faults that you have when the data is in memory, but not mapped to your process).

The io_uring code looks like it is doing all the fetch work in the background (with 6 threads), then just handing the completed buffers to the counter.

Do the same with 6 threads that would first read the first byte on each page and then hand that page section to the counter, you'll find similar performance.

And you can use both madvice / huge pages to control the mmap behavior

mrlongroots · 3 months ago
Yes, it doesn't take a benchmark to find out that storage can not be faster than memory.

Even if you had a million SSDs and somehow were able to connect them to a single machine somehow, you would not outperform memory, because the data needs to be read into memory first, and can only then be processed by the CPU.

Basic `perf stat` and minor/major faults should be a first-line diagnostic.

johncolanduoni · 3 months ago
This was a comparison of two methods of moving data from the VFS to application memory. Depending on cache status this would run the whole gambit of mapping existing memory pages, kernel to userspace memory copies, and actual disk access.

Also, while we’re being annoyingly technical, a lot of server CPUs can DMA straight to the L3 cache so your proof of impossibility is not correct.

alphazard · 3 months ago
> storage can not be faster than memory

This is an oversimplification. It depends what you mean by memory. It may be true when using NVMe on modern architectures in a consumer use case, but it's not true about computer architecture in general.

External devices can have their memory mapped to virtual memory addresses. There are some network cards that do this for example. The CPU can load from these virtual addresses directly into registers, without needing to make a copy to the general purpose fast-but-volatile memory. In theory a storage device could also be implemented in this way.

hinkley · 3 months ago
I’m pretty sure that as of PCI-E 2 this is not true.

It’s only true if you need to process the data before passing it on. You can do direct DMA transfers between devices.

In which case one needs to remember that memory isn’t on the CPU. It has to beg for data just about as much as any peripheral. It uses registers and L1, which are behind two other layers of cache and an MMU.

lucketone · 3 months ago
It would seem you summarised whole post.

That’s the point: “mmap” is slow because it is serial.

arghwhat · 3 months ago
mmap isn't "serial", the code that was using the mapping was "serial". The kernel will happily fill different portions of the mapping in parallel if you have multiple threads fault on different pages.

(That doesn't undermine that io_uring and disk access can be fast, but it's comparing a lazy implementation using approach A with a quite optimized one using approach B, which does not make sense.)

guenthert · 3 months ago
Well, yes, but isn't one motivation of io_uring to make user space programming simpler and (hence) less error prone? I mean, i/o error handling on mmap isn't exactly trivial.
arunc · 3 months ago
Indeed. Use with mmap with MAP_POPULATE which will pre populate.
jared_hulbert · 3 months ago
Someone else suggested this, results are even worse by 2.5s.
bawolff · 3 months ago
Shouldn't you also compare to mmap with huge page option? My understanding is its presicely meant for this circumstance. I don't think its a fair comparison without it.

Respectfully, the title feels a little clickbaity to me. Both methods are still ultimately reading out of memory, they are just using different i/o methods.

jared_hulbert · 3 months ago
The original blog post title is intentionally clickbaity. You know, to bait people into clicking. Also I do want to challenge people to really think here.

Seeing if the cached file data can be accessed quickly is the point of the experiment. I can't get mmap() to open a file with huge pages.

void* buffer = mmap(NULL, size_bytes, PROT_READ, (MAP_HUGETLB | MAP_HUGE_1GB), fd, 0); doesn't work.

You can can see my code here https://github.com/bitflux-ai/blog_notes. Any ideas?

mastax · 3 months ago
MAP_HUGETLB can't be used for mmaping files on disk, it can only be used with MAP_ANONYMOUS, with a memfd, or with a file on a hugetlbfs pseudo-filesystem (which is also in memory).
jandrewrogers · 3 months ago
Read the man pages, there are restrictions on using the huge page option with mmap() that mean it won’t do what you might intuit it will in many cases. Getting reliable huge page mappings is a bit fussy on Linux. It is easier to control in a direct I/O context.
mrlongroots · 3 months ago
You don't need hugepages for basic 5GB/s sequential scans. I don't know the exact circumstances that would cause TLB pressure, but this is not it.

You can maybe reduce the number of page faults, but you can do that by walking the mapped address space once before the actual benchmark too.

nextaccountic · 3 months ago
The real difference is that with io_uring and O_DIRECT you manage the cache yourself (and can't share with other processes, and the OS can't reclaim the cache automatically if under memory pressure), and with mmap this is managed by the OS.

If Linux had an API to say "manage this buffer you handled me from io_uring as if it were a VFS page cache (and as such it can be shared with other processes, like mmap), if you want it back just call this callback (so I can cleanup my references to it) and you are good to go", then io_uring could really replace mmap.

What Linux has currently is PSI, which lets the OS reclaim memory when needed but doesn't help with the buffer sharing thing

touisteur · 3 months ago
Yes Linus has been ranting for decades against O_DIRECT saying similar things (aka better hints on pages and cache usage).

The notorious archive of Linus rants on [0] starts with "The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances". It gets better afterwards, though I'm not clear whether his articulated vision is implemented yet.

[0] https://yarchive.net/comp/linux/o_direct.html

jandrewrogers · 3 months ago
I know people like to post this rant but in this case Linus simply doesn't understand the problem domain. O_DIRECT is commonly used in contexts where the fundamental mechanisms of the kernel cache are inappropriate. It can't be fixed with hints.

As a database example, there are major classes of optimization that require perfect visibility into the state of the entire page cache with virtually no overhead and strict control over every change of state that occurs. O_DIRECT allows you to achieve this. The optimizations are predicated on the impossibility of an external process modifying state. It requires perfect control of the schedule which is invalidated if the kernel borrows part of the page cache. Whether or not the kernel asks nicely doesn't matter, it breaks a design invariant.

The Linus rant is from a long time ago. Given the existence of things like io_uring which explicitly enables this type of behavior almost to the point of encouraging it, Linus may understand the use cases better now.

modeless · 3 months ago
Wait, PCIe bandwidth is higher than memory bandwidth now? That's bonkers, when did that happen? I haven't been keeping up.

Just looked at the i9-14900k and I guess it's true, but only if you add all the PCIe lanes together. I'm sure there are other chips where it's even more true. Crazy!

DiabloD3 · 3 months ago
"No."

DDR5-8000 is 64GB/s per channel. Desktop CPUs have two channels. PCI-E 5.0 in x16 is 64GB/s. Desktops have one x16.

pseudosavant · 3 months ago
One x16 slot. They'll use PCIe lanes in other slots (x4, x1, M2 SSDs) and also for devices off the chipset (network, USB, etc). The current top AMD/Intel CPUs can do ~100GB/sec over 28 lanes of mostly PCIe 5.
immibis · 3 months ago
But my Threadripper has 4 channels of DDR5, and the equivalent of 4.25 x16 PCIe 5.

You know what adds up to an even bigger number though? Using both.

modeless · 3 months ago
Hmm, Intel specs the max memory bandwidth as 89.6 GB/s. DDR5-8000 would be out of spec. But I guess it's pretty common to run higher specced memory, while you can't overclock PCIe (AFAIK?). Even so I guess my math was wrong, it doesn't quite add up to more than memory bandwidth. But it's pretty darn close!
rwmj · 3 months ago
That's the promise (or requirement?) of CXL - have your memory managed centrally, servers access it over PCIe. https://en.wikipedia.org/wiki/Compute_Express_Link I wonder how many are actually using CXL. I haven't heard of any customers deploying it so far.
AnthonyMouse · 3 months ago
> Wait, PCIe bandwidth is higher than memory bandwidth now?

Hmm.

Somebody make me a PCIe card with RDIMM slots on it.

adgjlsfhk1 · 3 months ago
on server chips it's kind of ridiculous. 5th gen Epyc has 128 lanes of PCIEx5 for over 1TB/s of pcie bandwith (compared to 600GB/s RAM bandwidth from 12 channel ddr5 at 6400)
andersa · 3 months ago
Your math is a bit off. 128 lanes gen5 is 8 times x16, which has a combined theoretical bandwidth of 512GB/s, and more like 440GB/s in practice after protocol overhead.

Unless we are considering both read and write bandwidth, but that seems strange to compare to memory read bandwidth.

hsn915 · 3 months ago
Shouldn't this be "io_uring is faster than mmap"?

I guess that would not get much engagement though!

That said, cool write up and experiment.

dang · 3 months ago
Let's use that. Since HN's guidelines say ""Please use the original title, unless it is misleading or linkbait", that "unless" clause seems to kick in here, so I've changed the title above. Thanks!

If anyone can suggest a better title (i.e. more accurate and neutral) we can change it again.

nine_k · 3 months ago
No. "io_uring faster than mmap" is sort of a truism: sequential page faults are slower than carefully orchestrated async I/O. The point of the article is that reading directly from a PCIe device, such as an NVMe flash, can actually be faster than caching things in RAM first.
wmf · 3 months ago
reading directly from a PCIe device, such as an NVMe flash, can actually be faster than caching things in RAM first.

That's not true though, because the PCIe device DMAs into RAM anyway.

jared_hulbert · 3 months ago
Lol. Thanks.
MaxikCZ · 3 months ago
Its not even about clickbait for me, but I really dont want to go parse an article to figure out what is meant by "Memory is slow, Disk is fast". You want "clickbait" to make people click and think, we want descriptive tittles to know what the article is about before we read it. That used to be original purpose of tittles, we like it that way.

Its like as if youd label your food product "you wont believe this", and forced customers to figure what it is from ingredients list.

avallach · 3 months ago
Maybe I'm misunderstanding, but after reading it sounds to me not like "io_uring is faster than mmap" but "raid0 with 8 SSDs has more throughput than 3 channel DRAM".
nine_k · 3 months ago
The title has been edited incorrectly. The original page title is "Memory is slow, Disk is fast", and it states exactly what you say: an NVMe RAID can offer more bandwidth than RAM.
kentonv · 3 months ago
No, the title edit is fair, where the original title is misleading.

Obviously, no matter how you read from disk, it has to go through RAM. Disk bandwidth cannot exceed memory bandwidth.*

But what the article actually tests is a program that uses mmap() to read from page cache, vs. a program that uses io_uring to read directly from disk (with O_DIRECT). You'd think the mmap() program would win, because the data in page cache is already in memory, whereas the io_uring program is explicitly skipping cache and pulling from disk.

However, the io_uring program uses 6 threads to pull from disk, which then feed into one thread that sequentially processes the data. Whereas the program using mmap() uses a single thread for everything. And even though the mmap() is pulling from page cache, that single thread still has to get interrupted by page faults as it reads, because the kernel does not proactively map the pages from cache even if they are available (unless, you know, you tell it to, with madvise() etc., but the test did not). So the mmap() test has one thread that has to keep switching between kernel and userspace and, surprise, that is not as fast as a thread which just stays in userspace while 6 other threads feed it data.

To be fair, the article says all this, if you read it. Other than the title being cheeky it's not hiding anything.

* OK, the article does mention that there exists CPUs which can do I/O directly into L3 cache which could theoretically beat memory bandwidth, but this is not actually something that is tested in the article.

juancn · 3 months ago

    Because PCIe bandwidth is higher than memory bandwidth
This doesn't sound right, a PCIe 5.0 x16 slot offers up to 64 GB/s. That's fully saturated, a fairly old Xeon server can sustain >100 GB/s memory reads per numa node without much trouble.

Some newer HBM enabled, like a Xeon Max 9480 can go over 1.6TBs for HBM (up to 64GB) and DDR5 can reach > 300 GB/s.

Even saturating all PCIe lanes (196 on a dual socket Xeon 6), you could at most theoretically get ~784GB/s, which coincidentally is the max memory bandwidth of such CPUs (12 Channels x 8,800 MT/s = 105,600 MT/s total bandwidth or roughly ~784GB/s).

I mean, solid state IO is getting really close, but it's not so fast on non-sequential access patterns.

I agree that many workloads could be shifted to SSDs but it's still quite nuanced.

jared_hulbert · 3 months ago
Not by a ton but if you add up the DDR5 channel bandwidth and the PCIe lanes most systems the PCIe bandwidth is higher. Yes. HBM and L3 cache will be higher than the PCIe.
kragen · 3 months ago
This is pretty great. I only learned to use perf_events to see annotated disassembly a few weeks ago, although I don't know how to interpret what I see there yet.

I suspect the slowness identified with mmap() here is somewhat fixable, for example by mapping already-in-RAM pages somewhat more eagerly. So it wouldn't be surprising to me (though see above for how much I'm not an expert) if next year mmap were faster than io_uring again.

jared_hulbert · 3 months ago
The io_uring solution avoids this whole effort of mapping. It doesn't have to map the already-in-RAM pages at all. It reuses a small set of buffers. So there is a lot of random cache-miss prone work that mmap() has to do that the io_uring solution avoids. If mmap() does this in the background it would cache up with io_uring. I'd then have to get a couple more drives to get io_uring to catch up. With enough drives I'd bet they'd be closer than you think. I still think I could get the io_uring to be faster than the mmap() even if the count never faulted, mostly because the io_uring has a smaller TLB footprint and can fit in L3 cache. But it'd be tough.
kragen · 3 months ago
I agree that io_uring is a fundamentally more efficient approach, but I think the performance limits you're currently measuring with mmap() aren't the fundamental ones imposed by the mmap() API, and I think that's what you're saying too?