Nice writeup. I suspect you're measuring the cost of abstraction. Specifically, routines that can handle lots of things (like locale based strings and utf8 character) have more things to do before they can produce results. This was something I ran into head on at Sun when we did the I18N[1] project.
In my experience there was a direct correlation between the number of different environments where a program would "just work" and its speed. The original UNIX ls(1) which had maximum sized filenames, no pesky characters allowed, all representable by 7-bit ASCII characters, and only the 12 bits of meta data that God intended[2] was really quite fast. You add things like a VFS which is mapping the source file system into the parameters of the "expected" file system that adds delay. You're mapping different character sets? adds delay. Colors for the display? Adds delay. Small costs that add up.
1: The first time I saw a long word like 'internationalization' reduced to first and last letter and the count of letters in between :-).
2: Those being Read, Write, and eXecute for user, group, and other, setuid, setgid, and 'sticky' :-)
How much of the speedup over GNU ls is due to lacking localization features? Your results table is pretty much consistent with my local observations: in a dir with 13k files, `ls -al` needs 33ms. But 25% of that time is spent by libc in `strcoll`. Under `LC_ALL=C` it takes just 27ms, which is getting closer to the time of your program.
I didn't include `busybox` in my initial table, so it isn't on the blog post but the repo has the data...but I am 99% sure busybox does not have locale support, so I think GNU ls without locale support would probably be closer to busybox.
Locales also bring in a lot more complicated sorting - so that could be a factor also.
I'm curious how lsr compares to bfs -ls for example. bfs only uses io_uring when multiple threads are enabled, but maybe it's worth using it even for bfs -j1
Oh that's cool. `find` is another tool I thought could benefit from io_uring like `ls`. I think it's definitely worth enabling io_uring for single threaded applications for the batching benefit. The kernel will still spin up a thread pool to get the work done concurrently, but you don't have to manage that in your codebase.
At those time scales, you would be better off using `tim` ( https://github.com/c-blake/bu/blob/main/doc/tim.md ) than hyperfine { and not just because that is your name! Lol. That is just a happy coincidence by clipping one letter off of the word "time". :-) } even though being in Nim might make it more of a challenge.
This is fantastic stuff. I'm doing a C++ project right now that I'm doing with an eye to eventual migration in whole or in part to Zig. My little `libevring` thing is pretty young and I'd be very open to replacing it with `ourio`.
What's your feeling on having C/C++ bindings in the project as a Zig migration path for such projects?
I wonder how it performs against an NFS server with lots of files, especially one over a kinda-crappy connection. Putting an unreliable network service behind blocking POSIX syscalls is one of the main reasons NFS is a terrible design choice (as can be seen by anyone who's tried to ctrl+c any app that's reading from a broken NFS folder), but I wonder if io_uring mitigates the bad parts somewhat.
The designers of NFS chose to make a distributed system emulate a highly consistent and available system (a hard drive), which was (and is) a reasonable tradeoff. It didn't require every existing tool, such as ls, to deal with things like the server rebooting while listing a directory. (The original NFS protocol is stateless, so clients can survive server reboots.) What does vi do when the server hosting the file you're editing stop responding? None of these tools have that kind of error handling.
I don't know how io_uring solves this - does it return an error if the underlying NFS call times out? How long do you wait for a response before giving up and returning an error?
> The designers of NFS chose to make a distributed system emulate a highly consistent and available system (a hard drive), which was (and is) a reasonable tradeoff
I don't agree that it was a reasonable tradeoff. Making an unreliable system emulate a reliable one is the very thing I find to be a bad idea. I don't think this is unique to NFS, it applies to any network filesystem you try to present as if it's a local one.
> What does vi do when the server hosting the file you're editing stop responding? None of these tools have that kind of error handling.
That's exactly why I don't think it's a good idea to just pretend a network connection is actually a local disk. Because tools aren't set up to handle issues with it being down.
Contrast it with approaches where the client is aware of the network connection (like HTTP/GRPC/etc)... the client can decide for itself how long it should retry failed requests, whether it should bubble up failures to the caller, or work "offline" until it gets an opportunity to resync, etc. With NFS the syscall just hangs forever by default.
Distributed systems are hard, and NFS (and other similar network filesystems) just pretend it isn't hard at all, which is great until something goes wrong, and then the abstraction leaks.
(Also I didn't say io_uring solves this, but I'm curious as to whether its performance would be any better than blocking calls.)
> The designers of NFS chose to make a distributed system emulate a highly consistent and available system (a hard drive),
> The original NFS protocol is stateless,
The protocol is, but the underlying disk isn’t.
- A stateless emulation doesn’t know of the concept of “open file”, so “open for exclusive access” isn’t possible, and ways to emulate that were bolted on.
- In a stateless system, you cannot open a scratch file for writing, delete it, and continue using it, in the expectation that it will be deleted when you’re done using it (Th Unix Hater’s handbook (https://web.mit.edu/~simsong/www/ugh.pdf) says there are hacks inside NFS to make this work, but that makes the protocol stateful)
> It didn't require every existing tool, such as ls, to deal with things like the server rebooting while listing a directory
But see above for an example where every tool that wants to do record locking or get exclusive access to a file has to know whether it’s writing to a NFS disk to figure out how to do that.
I run several machines at home with NFS $HOME and I usually don't notice. I'd say with a good network and as long as you're not stress-testing difficult cases like parallel writes to the same data from multiple machines, the average usability of NFS is actually very good.
I did have a hell of a difficult time once with intermittent failures from a poorly seated network cable.
> as can be seen by anyone who's tried to ctrl+c any app that's reading from a broken NFS folder
Theoretically "intr" mounts allowed signals to interrupt operations waiting on a hung remote server, but Linux removed the option long ago[1] (FreeBSD still supports it)[2]. "soft" might be the only workaround on Linux.
Probably historical preference for portability without a bunch of #ifdef means platform+version-specific stuff is very late to get adopted. Though, at this point, the benefit of portability across various posixy platforms is much lower.
Has anyone written an io_uring "polyfill" library with fallback to standard posix-y IO? It could presumably be done via background worker threads - at a perf cost.
io_uring is the asynchronous interface and that requires to use even-based architecture to use it effectively. But many command-line tools are still written is a straightforward sequential style. If C would have async or similar mechanism to pretend doing async programming sequentially, it would be easier to port. But without that a very significant refactoring is necessary.
Besides, io_uring is not yet stable and who knows may be in 10 years it will be replaced by yet another mechanism to take advantage of even newer hardware. So simply waiting for io_uring prove it is here to stay is very viable strategy. Besides in 10 years we may have tools/AI that will do the rewrite automatically...
> If C would have async or similar mechanism to pretend doing async programming sequentially, it would be easier to port.
The *context() family of formerly-POSIX functions (clownishly deprecated as “use pthreads instead”) is essentially a full implementation of stackful coroutines. Even the arguable design botch of them preserving the signal mask (the reason why they aren’t the go-to option even on Linux) is theoretically fixable on the libc level without system calls, it’s just a lot of work and very few can be bothered to do signals well.
As far as stackless coroutines, there’s a wide variety of libraries used in embedded systems and such (see the recent discussion[1] for some links), which are by necessity awkward enough that I don’t see any of them becoming broadly accepted. There were also a number of language extensions, among which I’d single out AC[2] (from the Barrelfish project) and CPC[3]. I’d love for, say, CPC to catch on, but it’s been over a decade now.
iirc io_uring also had some pretty significant security issues early on (a couple of years ago). Those should be fixed by now, but that probably dampened adoption as well.
Not years ago. io_uring has been a continuous parade of security problems, including a high severity one that wasn't fixed until a few months ago. Many large organizations have patched it out of their kernels on safety basis, which is one of the reasons it suffers from poor adoption.
Last I checked it's blocked by most container runtimes exactly because of the security problems, and Google blocked io_uring across all their services. I've not checked recently if that's still the case, but https://security.googleblog.com/2023/06/learnings-from-kctf-... has some background.
> I'm trying to understand why all command line tools don't use io_uring.
Because it's fairly new. The coreutils package which contains the ls command (and the three earlier packages which were merged to create it) is decades old; io_uring appeared much later. It will take time for the "shared ring buffer" style of system call to win over traditional synchronous system calls.
Really interesting, the difference is real though I would just hope that some better coloring support could be added because I have "eza --icons=always -1" command set as my ls and it looks really good, whereas when I use lsr -1, yes the fundamental thing is same, the difference is in the coloring.
Yes lsr also colors the output but it doesn't know as many things as eza does
For example
.opus will show up as a music icon and with the right color (green-ish in my case?) in eza
whereas it would be shown up as any normal file in lsr.
Really no regrets though, its quite easy to patch I think but yes this is rock solid and really fast I must admit.
Can you please create more such things but for cat and other system utilities too please?
Also love that its using tangled.sh which is using atproto, kinda interesting too.
I also like that its written in zig which imo feels way more easier for me to touch as a novice than rust (sry rustaceans)
This seems more interesting as demonstration of the amortized performance increase you'd expect from using io_uring, or as a tutorial for using it. I don't understand why I'd switch from using something like eza. If I'm listing 10,000 files the difference is between 40ms and 20ms. I absolutely would not notice that for a single invocation of the command.
Yeah, I wrote this as a fun little experiment to learn more io_uring usage. The practical savings of using this are tiny, maybe 5 seconds over your entire life. That wasn't the point haha
It's a very cool experiment. Just wanted to perhaps steer the conversation towards those things rather than whether or not this was a good ls replacement because like you say that feels like it was missing the point
I vaguely remember some benchmark I read a while back for some other io_uring project, and it suggested that io_uring syscalls are more expensive than whatever the other syscalls were that it was being used to replace. It's still a big improvement, even if not as big as you'd hope.
I wish I could remember the post, but I've had that impression in the back of my mind ever since.
The only VDSO-capable calls are clock_gettime, getcpu, getrandom, gettimeofday, and time. (Other architectures have some more, mostly related to signals and CPU cache flushing.)
io_uring doesn't support getdents though. so the primary benefit is bulk statting (ls -l).
It'd be nice if we could have a getdents in flight while processing the results of the previous one.
In my experience there was a direct correlation between the number of different environments where a program would "just work" and its speed. The original UNIX ls(1) which had maximum sized filenames, no pesky characters allowed, all representable by 7-bit ASCII characters, and only the 12 bits of meta data that God intended[2] was really quite fast. You add things like a VFS which is mapping the source file system into the parameters of the "expected" file system that adds delay. You're mapping different character sets? adds delay. Colors for the display? Adds delay. Small costs that add up.
1: The first time I saw a long word like 'internationalization' reduced to first and last letter and the count of letters in between :-).
2: Those being Read, Write, and eXecute for user, group, and other, setuid, setgid, and 'sticky' :-)
Naw, that's Dennis Ritchie. You're thinking of the other white-bearded guy that hangs out in heaven.
http://www.i18nguy.com/origini18n.html
Locales also bring in a lot more complicated sorting - so that could be a factor also.
I'm curious how lsr compares to bfs -ls for example. bfs only uses io_uring when multiple threads are enabled, but maybe it's worth using it even for bfs -j1
What's your feeling on having C/C++ bindings in the project as a Zig migration path for such projects?
I don't know how io_uring solves this - does it return an error if the underlying NFS call times out? How long do you wait for a response before giving up and returning an error?
I don't agree that it was a reasonable tradeoff. Making an unreliable system emulate a reliable one is the very thing I find to be a bad idea. I don't think this is unique to NFS, it applies to any network filesystem you try to present as if it's a local one.
> What does vi do when the server hosting the file you're editing stop responding? None of these tools have that kind of error handling.
That's exactly why I don't think it's a good idea to just pretend a network connection is actually a local disk. Because tools aren't set up to handle issues with it being down.
Contrast it with approaches where the client is aware of the network connection (like HTTP/GRPC/etc)... the client can decide for itself how long it should retry failed requests, whether it should bubble up failures to the caller, or work "offline" until it gets an opportunity to resync, etc. With NFS the syscall just hangs forever by default.
Distributed systems are hard, and NFS (and other similar network filesystems) just pretend it isn't hard at all, which is great until something goes wrong, and then the abstraction leaks.
(Also I didn't say io_uring solves this, but I'm curious as to whether its performance would be any better than blocking calls.)
> The original NFS protocol is stateless,
The protocol is, but the underlying disk isn’t.
- A stateless emulation doesn’t know of the concept of “open file”, so “open for exclusive access” isn’t possible, and ways to emulate that were bolted on.
- In a stateless system, you cannot open a scratch file for writing, delete it, and continue using it, in the expectation that it will be deleted when you’re done using it (Th Unix Hater’s handbook (https://web.mit.edu/~simsong/www/ugh.pdf) says there are hacks inside NFS to make this work, but that makes the protocol stateful)
> It didn't require every existing tool, such as ls, to deal with things like the server rebooting while listing a directory
But see above for an example where every tool that wants to do record locking or get exclusive access to a file has to know whether it’s writing to a NFS disk to figure out how to do that.
I did have a hell of a difficult time once with intermittent failures from a poorly seated network cable.
Theoretically "intr" mounts allowed signals to interrupt operations waiting on a hung remote server, but Linux removed the option long ago[1] (FreeBSD still supports it)[2]. "soft" might be the only workaround on Linux.
[1]: https://man7.org/linux/man-pages/man5/nfs.5.html
[2]: https://man.freebsd.org/cgi/man.cgi?query=mount_nfs&sektion=...
I'm trying to understand why all command line tools don't use io_uring.
As an example, all my nvme's on usb 3.2 gen 2 only reach 740MB/s peak.
If I use tools with aio or io_uring I get 1005MB/s.
I know I may not be copying many files simultaneously every time, but the queue length strategies and the fewer locks also help I guess.
Besides, io_uring is not yet stable and who knows may be in 10 years it will be replaced by yet another mechanism to take advantage of even newer hardware. So simply waiting for io_uring prove it is here to stay is very viable strategy. Besides in 10 years we may have tools/AI that will do the rewrite automatically...
The *context() family of formerly-POSIX functions (clownishly deprecated as “use pthreads instead”) is essentially a full implementation of stackful coroutines. Even the arguable design botch of them preserving the signal mask (the reason why they aren’t the go-to option even on Linux) is theoretically fixable on the libc level without system calls, it’s just a lot of work and very few can be bothered to do signals well.
As far as stackless coroutines, there’s a wide variety of libraries used in embedded systems and such (see the recent discussion[1] for some links), which are by necessity awkward enough that I don’t see any of them becoming broadly accepted. There were also a number of language extensions, among which I’d single out AC[2] (from the Barrelfish project) and CPC[3]. I’d love for, say, CPC to catch on, but it’s been over a decade now.
[1] https://news.ycombinator.com/item?id=44546640
[2] https://users.soe.ucsc.edu/~abadi/Papers/acasync.pdf
[3] https://www.irif.fr/~jch/research/cpc-2012.pdf
Because it's fairly new. The coreutils package which contains the ls command (and the three earlier packages which were merged to create it) is decades old; io_uring appeared much later. It will take time for the "shared ring buffer" style of system call to win over traditional synchronous system calls.
Yes lsr also colors the output but it doesn't know as many things as eza does
For example .opus will show up as a music icon and with the right color (green-ish in my case?) in eza whereas it would be shown up as any normal file in lsr.
Really no regrets though, its quite easy to patch I think but yes this is rock solid and really fast I must admit.
Can you please create more such things but for cat and other system utilities too please?
Also love that its using tangled.sh which is using atproto, kinda interesting too.
I also like that its written in zig which imo feels way more easier for me to touch as a novice than rust (sry rustaceans)
https://github.com/sharkdp/bat
Bat did 445 syscall Cat did 48 syscall
Sure bat does beautify some things a lot but still I just wanted to tell this, I want something that can use io_uring for cat too I think,
like what's the least number of syscalls that you can use for something like cat?
Dead Comment
Most of the coreutils are not fast enough to actually utilize modern SSDs.
I wish I could remember the post, but I've had that impression in the back of my mind ever since.