55 GiB/s FizzBuzz (2021)

I wrote simple implementations (simple if/else/while and printing to stdout) with no optimizations in Rust, Python3 and C

Rust -> 23.2MiB/s

Python3 -> 28.6MiB/s

C -> 238MiB/s

Does anyone know why Rust's performance is in the same ballpark as Python3.

I thought it would be more closer to C.

cyber_kinetist · 3 years ago

Rust’s print function locks by default (because of safety), C doesn’t. For more info see the Rust documentation: https://doc.rust-lang.org/std/macro.print.html

In order to get similar performance as C, you probably need to take care of this lock yourself:

    let mut lock = stdout().lock();
    write!(lock, "hello world").unwrap();

(And also you need to make the buffering size for stdout match C’s.)

mananaysiempre · 3 years ago

> Rust’s print function locks by default (because of safety), C doesn’t.

Huh? Traditionally, stdio implementations have placed locks around all I/O[1] when introducing threads—thus functions such as fputc_unlocked to claw back at least some of the performance when the stock bulk functions don’t suffice—and the current ISO C standard even requires it (N3096 7.23.2p8):

> All functions that read, write, position, or query the position of a stream lock the stream before accessing it. They release the lock associated with the stream when the access is complete.

The Microsoft C runtime used to have a statically linked non-threaded version with no locks, but it no longer does. (I’ve always assumed that linking -lpthread as required on some Unices was also intended to override some of the -lc code with thread-safe versions, but I’m not sure; in any case this doesn’t play well with dynamic linking, and Glibc doesn’t do it that way.)

[1] e.g. see https://sourceware.org/git/?p=glibc.git;a=blob;f=libio/iofpu...

inferiorhuman · 3 years ago

Take a look at the actual implementation on stackexchange, the slower impl is already doing the locking itself.

ismailmaj · 3 years ago

I wrote this a long time ago, you might find it useful.

https://ismailmaj.github.io/tinkering-with-fizz-buzz-and-con...

sicariusnoctis · 3 years ago

Neat tricks. Beyond BufWriter (which I'm already using) and multthreading, I'm guessing there's not much to be done to improve my "frece" (a simple CLI frecency-indexed database) tool's performance without making it overly complicated. https://github.com/YodaEmbedding/frece/blob/master/src/main....

qazpot · 3 years ago

Thanks for writing this, led me to a rabbit hole.

Freaky · 3 years ago

C and Python have adaptive buffering for stdout: if the output is a terminal they flush on newlines, otherwise they only flush when their internal buffer is full.

Here's a C program counting, with a 1ms delay between lines. The second column is a duration since the previous read():

   $ ./out | rtss
   4.7ms    4.7ms | 1
   4.7ms          | 2
   4.7ms          | 3
   4.7ms          | 4
   4.8ms    exit status: 0

You can see they were all written in one go. When allocated a terminal, they come out line by line:

   $ rtss --pty ./out
   0.8ms    0.8ms | 1
   1.9ms    1.1ms | 2
   3.0ms    1.1ms | 3
   4.1ms    1.1ms | 4
   4.3ms    exit status: 0

Rust lacks this adaptive behaviour for output, and will always produce the second result, terminal or not.

Technically it unconditionally wraps stdout in a LineWriter (https://doc.rust-lang.org/std/io/struct.LineWriter.html), which always flushes if it sees a write containing a newline. To maximise throughput you therefore want to batch writes of multiple lines together, for example by wrapping it in a BufWriter.

mkj · 3 years ago

You should compile rust with --release and C with -O3

SpaghettiCthulu · 3 years ago

That wouldn't be a fair comparison. Rust has an opt-level option for each build profile. It defaults to 2 for the release profile.

dontlaugh · 3 years ago

Almost certainly the limitation is due to printing, likely buffering or locking.

shakow · 3 years ago

Can we see your code?

Makes you wonder... How fast would everything be if it was written in assembly.

In audio dev it's very common for dsp code to be written in assembly.

nickelpro · 3 years ago

Assembly is hardly the reason this is fast. It is necessary to this solution but by no means sufficient.

Extreme algorithmic research combined with a high LOK of Linux syscalls and platform specific optimizations is what allows this to exist. To quote the author, Alex Smith, himself:

> @chx: I already have a master's thesis. This was harder.

This is in a different universe than what can be produced by simply "do it in assembly".

bonzini · 3 years ago

The second ranked solution (by me :)) shows that a more or less trivial assembly tight loop can get about 70% of the speed. The remaining 30% is... something else.

moreati · 3 years ago

What's LOK in this context? Search and Wikipedia didn't throw up anything. Thanks.

wheelerof4te · 3 years ago

"Extreme algorithmic research combined with a high LOK of Linux syscalls and platform specific optimizations is what allows this to exist."

No, author's extreme boredom and/or free time allowed this to exist. Nothing else.

MagicMoonlight · 3 years ago

Even if everything was just written in java we’d be better off than the current system of embedding chrome into a python instance and then running a webserver in javascript to render a document.

adamrezich · 3 years ago

at $DAYJOB we're in the (slow) process of replacing a tool written in Java with its successor tool, which is a web app. the Java tool works great and has reasonably snappy performance. the web app is terribly slow (at least 0.75 jiras), frequently hangs, and is often unusable for simple tasks. it's been miserable enough to make me miss the Java Era.

pkolaczk · 3 years ago

I don't think replacing one bad idea with another bad idea is a good idea ;) I'm also not sure if Java would be less bad in that case. At least JS has years of research that went into making it start fast, not only run fast after warmup.

hedora · 3 years ago

If that were true, then eclipse would be faster than VS Code, right?

kryptiskt · 3 years ago

It would end up like Steve Yegge's tale of Geoworks:

"OK: I went to the University of Washington and [then] I got hired by this company called Geoworks, doing assembly-language programming, and I did it for five years. To us, the Geoworkers, we wrote a whole operating system, the libraries, drivers, apps, you know: a desktop operating system in assembly. 8086 assembly! It wasn't even good assembly! We had four registers! [Plus the] si [register] if you counted, you know, if you counted 386, right? It was horrible.

I mean, actually we kind of liked it. It was Object-Oriented Assembly. It's amazing what you can talk yourself into liking, which is the real irony of all this. And to us, C++ was the ultimate in Roman decadence. I mean, it was equivalent to going and vomiting so you could eat more. They had IF! We had jump CX zero! Right? They had "Objects". Well we did too, but I mean they had syntax for it, right? I mean it was all just such weeniness. And we knew that we could outperform any compiler out there because at the time, we could!

So what happened? Well, they went bankrupt. Why? Now I'm probably disagreeing – I know for a fact that I'm disagreeing with every Geoworker out there. I'm the only one that holds this belief. But it's because we wrote fifteen million lines of 8086 assembly language. We had really good tools, world class tools: trust me, you need 'em. But at some point, man...

The problem is, picture an ant walking across your garage floor, trying to make a straight line of it. It ain't gonna make a straight line. And you know this because you have perspective. You can see the ant walking around, going hee hee hee, look at him locally optimize for that rock, and now he's going off this way, right?

This is what we were, when we were writing this giant assembly-language system. Because what happened was, Microsoft eventually released a platform for mobile devices that was much faster than ours. OK? And I started going in with my debugger, going, what? What is up with this? This rendering is just really slow, it's like sluggish, you know. And I went in and found out that some title bar was getting rendered 140 times every time you refreshed the screen. It wasn't just the title bar. Everything was getting called multiple times.

Because we couldn't see how the system worked anymore!

Small systems are not only easier to optimize, they're possible to optimize. And I mean globally optimize."

http://steve-yegge.blogspot.com/2008/05/dynamic-languages-st...

_dain_ · 3 years ago

>Object-Oriented Assembly

what in the god damn

sillysaurusx · 3 years ago

Interestingly, AI might change this. Not that it would be a good idea to write everything in assembly, but at least it would be possible now.

Some wonderful systems were written in assembly. Donkey Kong comes to mind.

XCSme · 3 years ago

Assembly is so far from the original code written today that it is not even feasible to think about. That being said, imagine how fast everything would be if the companies developing the software would actually care about performance.

For like 99% of websites and software today, if anyone cared about the app performance, I am pretty sure most would be able to achieve at least 50% speed-ups through very basic changes (correct caching, optimizing assets, replacing bloated 3rd party libraries with a basic native call that does the same thing, configuring the servers and databases properly, etc.).

EDIT: That being said, I am pretty sure in a few years AI would be able to provide one-click optimizations to a repository that would either apply best-practices or rewrite the original code in performant Assembly.

jodrellblank · 3 years ago

> "if anyone cared about the app performance, I am pretty sure most would be able to achieve at least 50% speed-ups through very basic changes"

See https://danluu.com/octopress-speedup/

> "This blog is a static Octopress site, hosted on GitHub Pages. Static sites are supposed to be fast, and GitHub Pages uses Fastly, which is supposed to be fast, so everything should be fast, right?"

followed by

> "I'm not sure what to think about all this. On the one hand, I'm happy that I was able to get a 25x-50x speedup on my site. On the other hand, I associate speedups of that magnitude with porting plain Ruby code to optimized C++, optimized C++ to a GPU, or GPU to quick-and-dirty exploratory ASIC. How is it possible that someone with zero knowledge of web development can get that kind of speedup by watching one presentation and then futzing around for 25 minutes? I was hoping to maybe find 100ms of slack, but it turns out there's not just 100ms, or even 1000ms, but 10000ms of slack in a Octopress setup. According to a study I've seen, going from 1000ms to 3000ms costs you 20% of your readers and 50% of your click-throughs. I haven't seen a study that looks at going from 400ms to 10900ms because the idea that a website would be that slow is so absurd that people don't even look into the possibility. But many websites are that slow!"

winter_blue · 3 years ago

> For like 99% of websites and software today, if anyone cared about the app performance, I am pretty sure most would be able to achieve at least 50% speed-ups through very basic changes (correct caching, optimizing assets, replacing bloated 3rd party libraries with a basic native call that does the same thing, configuring the servers and databases properly, etc.).

I'd say a big factor to include here is the choice of language. Going from Python/Ruby to Kotlin/Rust/etc could probably yield a speed up of over 10x / over 1000%.

actionfromafar · 3 years ago

I think an oft overlooked detail of more primitive languages yielding faster programs, is this:

when it’s painful to progress even a little bit while coding, you try very hard to implement as little as possible

resource constraint can give clarity of focus

zokier · 3 years ago

Shortest/smallest code is rarely the fastest. Loop unrolling and inlining make code larger, and occasionally faster as an trivial example

tragomaskhalos · 3 years ago

I started in software dev in 1988; the only languages I knew were BASIC and Z80 assembly so, since everyone knew BASIC was kind of lame I naturally assumed I'd be a machine code programmer, albeit on slightly more exotic processors. Day one got handed a book on C -> mind blown.

geek_at · 3 years ago

Or if everything had been written by "ais523 - high effort answers"

xxs · 3 years ago

You dont need Assembly, you need to the code not to be terrible, which is generally hard with the amount of code/developers needed.

jojobas · 3 years ago

It depends. You can write bubblesort in assembly and it will be pretty damn slow. I can imagine those assembly leaders could push out pretty fast C implementations as well.

dumdumchan · 3 years ago

It wouldn't run at all since it wouldn't exist.