The most impressive thing to me is that Linux allows data piped from one program to another to stay entirely in L2 cache and not hit main memory.
To me, that is amazing systems architectural design - so many different parts of a regular linux kernel all working together to let this fastpath happen.
Would such a thing be possible on Mac OSX's Mach ports or Windows's Named Pipes?
If a CPU's caches are physically tagged and both process's page tables share the same backing physical page then the CPU will let the cache contents be used by either process as long as the OS doesn't explicitly flush/invalidate cache on context switch.
The major cost of two processes coordinating shared data over two threads is TLB invalidation on a context switch.
I assume this is what the author meant by “page table contention” as a bottleneck: Without TLB invalidation there would be no need for either process/thread to ever touch page tables in a scenario like this.
Besides living up to the username of "ais523 - high effort answers", the author also does some high effort comments with someone who can't get the program to run. The resolution:
> I suspect that what's happening is that the program was somehow compiled with ASLR turned on. For some reason, the dynamic linker doesn't respect the 4 MiB alignment of the BSS segment in this case, effectively ignoring my .align, and that's what's causing the bugs.
> Rust’s print function locks by default (because of safety), C doesn’t.
Huh? Traditionally, stdio implementations have placed locks around all I/O[1] when introducing threads—thus functions such as fputc_unlocked to claw back at least some of the performance when the stock bulk functions don’t suffice—and the current ISO C standard even requires it (N3096 7.23.2p8):
> All functions that read, write, position, or query the position of a stream lock the stream before accessing it. They release the lock associated with the stream when the access is complete.
The Microsoft C runtime used to have a statically linked non-threaded version with no locks, but it no longer does. (I’ve always assumed that linking -lpthread as required on some Unices was also intended to override some of the -lc code with thread-safe versions, but I’m not sure; in any case this doesn’t play well with dynamic linking, and Glibc doesn’t do it that way.)
Neat tricks. Beyond BufWriter (which I'm already using) and multthreading, I'm guessing there's not much to be done to improve my "frece" (a simple CLI frecency-indexed database) tool's performance without making it overly complicated. https://github.com/YodaEmbedding/frece/blob/master/src/main....
C and Python have adaptive buffering for stdout: if the output is a terminal they flush on newlines, otherwise they only flush when their internal buffer is full.
Here's a C program counting, with a 1ms delay between lines. The second column is a duration since the previous read():
Rust lacks this adaptive behaviour for output, and will always produce the second result, terminal or not.
Technically it unconditionally wraps stdout in a LineWriter (https://doc.rust-lang.org/std/io/struct.LineWriter.html), which always flushes if it sees a write containing a newline. To maximise throughput you therefore want to batch writes of multiple lines together, for example by wrapping it in a BufWriter.
Assembly is hardly the reason this is fast. It is necessary to this solution but by no means sufficient.
Extreme algorithmic research combined with a high LOK of Linux syscalls and platform specific optimizations is what allows this to exist. To quote the author, Alex Smith, himself:
> @chx: I already have a master's thesis. This was harder.
This is in a different universe than what can be produced by simply "do it in assembly".
The second ranked solution (by me :)) shows that a more or less trivial assembly tight loop can get about 70% of the speed. The remaining 30% is... something else.
Even if everything was just written in java we’d be better off than the current system of embedding chrome into a python instance and then running a webserver in javascript to render a document.
at $DAYJOB we're in the (slow) process of replacing a tool written in Java with its successor tool, which is a web app. the Java tool works great and has reasonably snappy performance. the web app is terribly slow (at least 0.75 jiras), frequently hangs, and is often unusable for simple tasks. it's been miserable enough to make me miss the Java Era.
I don't think replacing one bad idea with another bad idea is a good idea ;) I'm also not sure if Java would be less bad in that case. At least JS has years of research that went into making it start fast, not only run fast after warmup.
It would end up like Steve Yegge's tale of Geoworks:
"OK: I went to the University of Washington and [then] I got hired by this company called Geoworks, doing assembly-language programming, and I did it for five years. To us, the Geoworkers, we wrote a whole operating system, the libraries, drivers, apps, you know: a desktop operating system in assembly. 8086 assembly! It wasn't even good assembly! We had four registers! [Plus the] si [register] if you counted, you know, if you counted 386, right? It was horrible.
I mean, actually we kind of liked it. It was Object-Oriented Assembly. It's amazing what you can talk yourself into liking, which is the real irony of all this. And to us, C++ was the ultimate in Roman decadence. I mean, it was equivalent to going and vomiting so you could eat more. They had IF! We had jump CX zero! Right? They had "Objects". Well we did too, but I mean they had syntax for it, right? I mean it was all just such weeniness. And we knew that we could outperform any compiler out there because at the time, we could!
So what happened? Well, they went bankrupt. Why? Now I'm probably disagreeing – I know for a fact that I'm disagreeing with every Geoworker out there. I'm the only one that holds this belief. But it's because we wrote fifteen million lines of 8086 assembly language. We had really good tools, world class tools: trust me, you need 'em. But at some point, man...
The problem is, picture an ant walking across your garage floor, trying to make a straight line of it. It ain't gonna make a straight line. And you know this because you have perspective. You can see the ant walking around, going hee hee hee, look at him locally optimize for that rock, and now he's going off this way, right?
This is what we were, when we were writing this giant assembly-language system. Because what happened was, Microsoft eventually released a platform for mobile devices that was much faster than ours. OK? And I started going in with my debugger, going, what? What is up with this? This rendering is just really slow, it's like sluggish, you know. And I went in and found out that some title bar was getting rendered 140 times every time you refreshed the screen. It wasn't just the title bar. Everything was getting called multiple times.
Because we couldn't see how the system worked anymore!
Small systems are not only easier to optimize, they're possible to optimize. And I mean globally optimize."
Assembly is so far from the original code written today that it is not even feasible to think about. That being said, imagine how fast everything would be if the companies developing the software would actually care about performance.
For like 99% of websites and software today, if anyone cared about the app performance, I am pretty sure most would be able to achieve at least 50% speed-ups through very basic changes (correct caching, optimizing assets, replacing bloated 3rd party libraries with a basic native call that does the same thing, configuring the servers and databases properly, etc.).
EDIT: That being said, I am pretty sure in a few years AI would be able to provide one-click optimizations to a repository that would either apply best-practices or rewrite the original code in performant Assembly.
> "This blog is a static Octopress site, hosted on GitHub Pages. Static sites are supposed to be fast, and GitHub Pages uses Fastly, which is supposed to be fast, so everything should be fast, right?"
followed by
> "I'm not sure what to think about all this. On the one hand, I'm happy that I was able to get a 25x-50x speedup on my site. On the other hand, I associate speedups of that magnitude with porting plain Ruby code to optimized C++, optimized C++ to a GPU, or GPU to quick-and-dirty exploratory ASIC. How is it possible that someone with zero knowledge of web development can get that kind of speedup by watching one presentation and then futzing around for 25 minutes? I was hoping to maybe find 100ms of slack, but it turns out there's not just 100ms, or even 1000ms, but 10000ms of slack in a Octopress setup. According to a study I've seen, going from 1000ms to 3000ms costs you 20% of your readers and 50% of your click-throughs. I haven't seen a study that looks at going from 400ms to 10900ms because the idea that a website would be that slow is so absurd that people don't even look into the possibility. But many websites are that slow!"
> For like 99% of websites and software today, if anyone cared about the app performance, I am pretty sure most would be able to achieve at least 50% speed-ups through very basic changes (correct caching, optimizing assets, replacing bloated 3rd party libraries with a basic native call that does the same thing, configuring the servers and databases properly, etc.).
I'd say a big factor to include here is the choice of language. Going from Python/Ruby to Kotlin/Rust/etc could probably yield a speed up of over 10x / over 1000%.
I started in software dev in 1988; the only languages I knew were BASIC and Z80 assembly so, since everyone knew BASIC was kind of lame I naturally assumed I'd be a machine code programmer, albeit on slightly more exotic processors. Day one got handed a book on C -> mind blown.
It depends. You can write bubblesort in assembly and it will be pretty damn slow.
I can imagine those assembly leaders could push out pretty fast C implementations as well.
This exercise appears to be somewhat flawed, even if entertaining/informative. Instead of evaluating the speed at which complex problems are resolved, it predominantly tests a peripheral issue: the efficiency of extracting memory from one process and transferring it to another. This allows for the illusion that the second process continues to write to a console/file, even though, technically, it does not - executing pv >/dev/null is essentially a no-op, as the write system call returns almost instantly.
vmsplice grants access to a process' buffer/memory to another process - a shared mem equivalent. As the initial competition requirements are likely vague, I'd imagine it's unclear if this is still good wrt the rules.
> As the initial competition requirements are likely vague, I'd imagine it's unclear if this is still good wrt the rules.
You can scroll upward to the original question to see the initial requirements, and check the edit history to verify that they haven't changed since the start of the challenge:
> Write your fizz buzz program. Run it. Pipe the output through <your_program> | pv > /dev/null. The higher the throughput, the better you did.
> The program output must be exactly valid fizzbuzz. No playing tricks such as writing null bytes in between the valid output - null bytes that don't show up in the console but do count towards pv throughput.
And vmsplice(2) indeed produces a stream of bytes in the standard output pipe that pv(1) can splice into /dev/null, or cat(1) can copy into the terminal.
This submission was not the only one that uses vmsplice(2); others have found that it's far from a magic bullet. Once you pass the I/O hurdle, much work remains in generating the pages of output as quickly as possible.
> it predominantly tests a peripheral issue: the efficiency of extracting memory form one process and transferring it to another.
Isn’t this almost always the whole problem? Most code is bottlenecked on memory and I/O. Complex problems are usually held up by the speed of getting data from one place to another, and not very often on computing the data. As someone who spends his days optimizing GPU assembly, even in the rare cases when compute is the bottleneck, once you optimize it, memory becomes the bottleneck.
I disagree, because one does not get to the bottleneck being _the efficiency of extracting memory from one process and transferring it to another_ without major fizzbuzz-specific optimizations first.
For example, there's a clever bit representation to get base-10 carries to happen natively.
The initial competition requirements are not particularly vague about this point: Measuring throughput with `<program> | pv > /dev/null` is prescribed, and it also says
> Architecture specific optimizations / assembly is also allowed. This is not a real contest - I just want to see how people push fizz buzz to its limit - even if it only works in special circumstances/platforms.
I/O is something that literally every program has to do. Its also the bottleneck of 99% of code running on modern hardware. Moving bytes from one place to another is essential and relatively slow.
Understanding how to deal with memory I/O and file I/O performantly is a relevant skill for every program and programmer.
The convention I'm used to uses .S to denote hand-written assembly files (usually tracked in git) vs .s for machine generated assembly which can be overwritten as needed.
To me, that is amazing systems architectural design - so many different parts of a regular linux kernel all working together to let this fastpath happen.
Would such a thing be possible on Mac OSX's Mach ports or Windows's Named Pipes?
I assume this is what the author meant by “page table contention” as a bottleneck: Without TLB invalidation there would be no need for either process/thread to ever touch page tables in a scenario like this.
Deleted Comment
Deleted Comment
> I suspect that what's happening is that the program was somehow compiled with ASLR turned on. For some reason, the dynamic linker doesn't respect the 4 MiB alignment of the BSS segment in this case, effectively ignoring my .align, and that's what's causing the bugs.
> @chx: I already have a master's thesis. This was harder. – ais523 - high effort answers Oct 29, 2021 at 1:17
Rust -> 23.2MiB/s
Python3 -> 28.6MiB/s
C -> 238MiB/s
Does anyone know why Rust's performance is in the same ballpark as Python3.
I thought it would be more closer to C.
In order to get similar performance as C, you probably need to take care of this lock yourself:
(And also you need to make the buffering size for stdout match C’s.)Huh? Traditionally, stdio implementations have placed locks around all I/O[1] when introducing threads—thus functions such as fputc_unlocked to claw back at least some of the performance when the stock bulk functions don’t suffice—and the current ISO C standard even requires it (N3096 7.23.2p8):
> All functions that read, write, position, or query the position of a stream lock the stream before accessing it. They release the lock associated with the stream when the access is complete.
The Microsoft C runtime used to have a statically linked non-threaded version with no locks, but it no longer does. (I’ve always assumed that linking -lpthread as required on some Unices was also intended to override some of the -lc code with thread-safe versions, but I’m not sure; in any case this doesn’t play well with dynamic linking, and Glibc doesn’t do it that way.)
[1] e.g. see https://sourceware.org/git/?p=glibc.git;a=blob;f=libio/iofpu...
https://ismailmaj.github.io/tinkering-with-fizz-buzz-and-con...
Here's a C program counting, with a 1ms delay between lines. The second column is a duration since the previous read():
You can see they were all written in one go. When allocated a terminal, they come out line by line: Rust lacks this adaptive behaviour for output, and will always produce the second result, terminal or not.Technically it unconditionally wraps stdout in a LineWriter (https://doc.rust-lang.org/std/io/struct.LineWriter.html), which always flushes if it sees a write containing a newline. To maximise throughput you therefore want to batch writes of multiple lines together, for example by wrapping it in a BufWriter.
In audio dev it's very common for dsp code to be written in assembly.
Extreme algorithmic research combined with a high LOK of Linux syscalls and platform specific optimizations is what allows this to exist. To quote the author, Alex Smith, himself:
> @chx: I already have a master's thesis. This was harder.
This is in a different universe than what can be produced by simply "do it in assembly".
No, author's extreme boredom and/or free time allowed this to exist. Nothing else.
"OK: I went to the University of Washington and [then] I got hired by this company called Geoworks, doing assembly-language programming, and I did it for five years. To us, the Geoworkers, we wrote a whole operating system, the libraries, drivers, apps, you know: a desktop operating system in assembly. 8086 assembly! It wasn't even good assembly! We had four registers! [Plus the] si [register] if you counted, you know, if you counted 386, right? It was horrible.
I mean, actually we kind of liked it. It was Object-Oriented Assembly. It's amazing what you can talk yourself into liking, which is the real irony of all this. And to us, C++ was the ultimate in Roman decadence. I mean, it was equivalent to going and vomiting so you could eat more. They had IF! We had jump CX zero! Right? They had "Objects". Well we did too, but I mean they had syntax for it, right? I mean it was all just such weeniness. And we knew that we could outperform any compiler out there because at the time, we could!
So what happened? Well, they went bankrupt. Why? Now I'm probably disagreeing – I know for a fact that I'm disagreeing with every Geoworker out there. I'm the only one that holds this belief. But it's because we wrote fifteen million lines of 8086 assembly language. We had really good tools, world class tools: trust me, you need 'em. But at some point, man...
The problem is, picture an ant walking across your garage floor, trying to make a straight line of it. It ain't gonna make a straight line. And you know this because you have perspective. You can see the ant walking around, going hee hee hee, look at him locally optimize for that rock, and now he's going off this way, right?
This is what we were, when we were writing this giant assembly-language system. Because what happened was, Microsoft eventually released a platform for mobile devices that was much faster than ours. OK? And I started going in with my debugger, going, what? What is up with this? This rendering is just really slow, it's like sluggish, you know. And I went in and found out that some title bar was getting rendered 140 times every time you refreshed the screen. It wasn't just the title bar. Everything was getting called multiple times.
Because we couldn't see how the system worked anymore!
Small systems are not only easier to optimize, they're possible to optimize. And I mean globally optimize."
http://steve-yegge.blogspot.com/2008/05/dynamic-languages-st...
what in the god damn
Some wonderful systems were written in assembly. Donkey Kong comes to mind.
For like 99% of websites and software today, if anyone cared about the app performance, I am pretty sure most would be able to achieve at least 50% speed-ups through very basic changes (correct caching, optimizing assets, replacing bloated 3rd party libraries with a basic native call that does the same thing, configuring the servers and databases properly, etc.).
EDIT: That being said, I am pretty sure in a few years AI would be able to provide one-click optimizations to a repository that would either apply best-practices or rewrite the original code in performant Assembly.
See https://danluu.com/octopress-speedup/
> "This blog is a static Octopress site, hosted on GitHub Pages. Static sites are supposed to be fast, and GitHub Pages uses Fastly, which is supposed to be fast, so everything should be fast, right?"
followed by
> "I'm not sure what to think about all this. On the one hand, I'm happy that I was able to get a 25x-50x speedup on my site. On the other hand, I associate speedups of that magnitude with porting plain Ruby code to optimized C++, optimized C++ to a GPU, or GPU to quick-and-dirty exploratory ASIC. How is it possible that someone with zero knowledge of web development can get that kind of speedup by watching one presentation and then futzing around for 25 minutes? I was hoping to maybe find 100ms of slack, but it turns out there's not just 100ms, or even 1000ms, but 10000ms of slack in a Octopress setup. According to a study I've seen, going from 1000ms to 3000ms costs you 20% of your readers and 50% of your click-throughs. I haven't seen a study that looks at going from 400ms to 10900ms because the idea that a website would be that slow is so absurd that people don't even look into the possibility. But many websites are that slow!"
I'd say a big factor to include here is the choice of language. Going from Python/Ruby to Kotlin/Rust/etc could probably yield a speed up of over 10x / over 1000%.
when it’s painful to progress even a little bit while coding, you try very hard to implement as little as possible
resource constraint can give clarity of focus
vmsplice grants access to a process' buffer/memory to another process - a shared mem equivalent. As the initial competition requirements are likely vague, I'd imagine it's unclear if this is still good wrt the rules.
You can scroll upward to the original question to see the initial requirements, and check the edit history to verify that they haven't changed since the start of the challenge:
> Write your fizz buzz program. Run it. Pipe the output through <your_program> | pv > /dev/null. The higher the throughput, the better you did.
> The program output must be exactly valid fizzbuzz. No playing tricks such as writing null bytes in between the valid output - null bytes that don't show up in the console but do count towards pv throughput.
And vmsplice(2) indeed produces a stream of bytes in the standard output pipe that pv(1) can splice into /dev/null, or cat(1) can copy into the terminal.
This submission was not the only one that uses vmsplice(2); others have found that it's far from a magic bullet. Once you pass the I/O hurdle, much work remains in generating the pages of output as quickly as possible.
Isn’t this almost always the whole problem? Most code is bottlenecked on memory and I/O. Complex problems are usually held up by the speed of getting data from one place to another, and not very often on computing the data. As someone who spends his days optimizing GPU assembly, even in the rare cases when compute is the bottleneck, once you optimize it, memory becomes the bottleneck.
For example, there's a clever bit representation to get base-10 carries to happen natively.
The initial competition requirements are not particularly vague about this point: Measuring throughput with `<program> | pv > /dev/null` is prescribed, and it also says
> Architecture specific optimizations / assembly is also allowed. This is not a real contest - I just want to see how people push fizz buzz to its limit - even if it only works in special circumstances/platforms.
Understanding how to deal with memory I/O and file I/O performantly is a relevant skill for every program and programmer.
What’s the significane of “.S” vs “.s”?
Edit:
From manpage:
Not sure if it makes a difference on modern toolchains.
https://news.ycombinator.com/item?id=29031488
https://news.ycombinator.com/item?id=29413656