Readit News logoReadit News
londons_explore · 3 years ago
The most impressive thing to me is that Linux allows data piped from one program to another to stay entirely in L2 cache and not hit main memory.

To me, that is amazing systems architectural design - so many different parts of a regular linux kernel all working together to let this fastpath happen.

Would such a thing be possible on Mac OSX's Mach ports or Windows's Named Pipes?

zeusk · 3 years ago
If a CPU's caches are physically tagged and both process's page tables share the same backing physical page then the CPU will let the cache contents be used by either process as long as the OS doesn't explicitly flush/invalidate cache on context switch.
twoodfin · 3 years ago
The major cost of two processes coordinating shared data over two threads is TLB invalidation on a context switch.

I assume this is what the author meant by “page table contention” as a bottleneck: Without TLB invalidation there would be no need for either process/thread to ever touch page tables in a scenario like this.

SkipperCat · 3 years ago
This is one of the many reasons that the entire HFT world runs on Linux.
idlewords · 3 years ago

  if(((i%3)||(i%5))== 0)
     buy_0_day_to_expiry_options()

smabie · 3 years ago
I've heard of some windows HFT systems, tho I'm sure that's suboptimal.
amatic · 3 years ago
Do they also use assembler in HFT?

Deleted Comment

Deleted Comment

Noumenon72 · 3 years ago
Besides living up to the username of "ais523 - high effort answers", the author also does some high effort comments with someone who can't get the program to run. The resolution:

> I suspect that what's happening is that the program was somehow compiled with ASLR turned on. For some reason, the dynamic linker doesn't respect the 4 MiB alignment of the BSS segment in this case, effectively ignoring my .align, and that's what's causing the bugs.

thumbuddy · 3 years ago
All I know is if this guy applies to the same job as me I won't stand a chance. The ultimate leet coder
rvba · 3 years ago
Are you sure they didnt spend a lot of time to get the answer, then posted the question with a sock puppet account?
miguelmurca · 3 years ago
Every time this is reposted I giggle at this comment:

> @chx: I already have a master's thesis. This was harder. – ais523 - high effort answers Oct 29, 2021 at 1:17

qazpot · 3 years ago
I wrote simple implementations (simple if/else/while and printing to stdout) with no optimizations in Rust, Python3 and C

Rust -> 23.2MiB/s

Python3 -> 28.6MiB/s

C -> 238MiB/s

Does anyone know why Rust's performance is in the same ballpark as Python3.

I thought it would be more closer to C.

cyber_kinetist · 3 years ago
Rust’s print function locks by default (because of safety), C doesn’t. For more info see the Rust documentation: https://doc.rust-lang.org/std/macro.print.html

In order to get similar performance as C, you probably need to take care of this lock yourself:

    let mut lock = stdout().lock();
    write!(lock, "hello world").unwrap();
(And also you need to make the buffering size for stdout match C’s.)

mananaysiempre · 3 years ago
> Rust’s print function locks by default (because of safety), C doesn’t.

Huh? Traditionally, stdio implementations have placed locks around all I/O[1] when introducing threads—thus functions such as fputc_unlocked to claw back at least some of the performance when the stock bulk functions don’t suffice—and the current ISO C standard even requires it (N3096 7.23.2p8):

> All functions that read, write, position, or query the position of a stream lock the stream before accessing it. They release the lock associated with the stream when the access is complete.

The Microsoft C runtime used to have a statically linked non-threaded version with no locks, but it no longer does. (I’ve always assumed that linking -lpthread as required on some Unices was also intended to override some of the -lc code with thread-safe versions, but I’m not sure; in any case this doesn’t play well with dynamic linking, and Glibc doesn’t do it that way.)

[1] e.g. see https://sourceware.org/git/?p=glibc.git;a=blob;f=libio/iofpu...

inferiorhuman · 3 years ago
Take a look at the actual implementation on stackexchange, the slower impl is already doing the locking itself.
ismailmaj · 3 years ago
I wrote this a long time ago, you might find it useful.

https://ismailmaj.github.io/tinkering-with-fizz-buzz-and-con...

sicariusnoctis · 3 years ago
Neat tricks. Beyond BufWriter (which I'm already using) and multthreading, I'm guessing there's not much to be done to improve my "frece" (a simple CLI frecency-indexed database) tool's performance without making it overly complicated. https://github.com/YodaEmbedding/frece/blob/master/src/main....
qazpot · 3 years ago
Thanks for writing this, led me to a rabbit hole.
Freaky · 3 years ago
C and Python have adaptive buffering for stdout: if the output is a terminal they flush on newlines, otherwise they only flush when their internal buffer is full.

Here's a C program counting, with a 1ms delay between lines. The second column is a duration since the previous read():

   $ ./out | rtss
   4.7ms    4.7ms | 1
   4.7ms          | 2
   4.7ms          | 3
   4.7ms          | 4
   4.8ms    exit status: 0
You can see they were all written in one go. When allocated a terminal, they come out line by line:

   $ rtss --pty ./out
   0.8ms    0.8ms | 1
   1.9ms    1.1ms | 2
   3.0ms    1.1ms | 3
   4.1ms    1.1ms | 4
   4.3ms    exit status: 0
Rust lacks this adaptive behaviour for output, and will always produce the second result, terminal or not.

Technically it unconditionally wraps stdout in a LineWriter (https://doc.rust-lang.org/std/io/struct.LineWriter.html), which always flushes if it sees a write containing a newline. To maximise throughput you therefore want to batch writes of multiple lines together, for example by wrapping it in a BufWriter.

mkj · 3 years ago
You should compile rust with --release and C with -O3
SpaghettiCthulu · 3 years ago
That wouldn't be a fair comparison. Rust has an opt-level option for each build profile. It defaults to 2 for the release profile.
dontlaugh · 3 years ago
Almost certainly the limitation is due to printing, likely buffering or locking.
shakow · 3 years ago
Can we see your code?
pier25 · 3 years ago
Makes you wonder... How fast would everything be if it was written in assembly.

In audio dev it's very common for dsp code to be written in assembly.

nickelpro · 3 years ago
Assembly is hardly the reason this is fast. It is necessary to this solution but by no means sufficient.

Extreme algorithmic research combined with a high LOK of Linux syscalls and platform specific optimizations is what allows this to exist. To quote the author, Alex Smith, himself:

> @chx: I already have a master's thesis. This was harder.

This is in a different universe than what can be produced by simply "do it in assembly".

bonzini · 3 years ago
The second ranked solution (by me :)) shows that a more or less trivial assembly tight loop can get about 70% of the speed. The remaining 30% is... something else.
moreati · 3 years ago
What's LOK in this context? Search and Wikipedia didn't throw up anything. Thanks.
wheelerof4te · 3 years ago
"Extreme algorithmic research combined with a high LOK of Linux syscalls and platform specific optimizations is what allows this to exist."

No, author's extreme boredom and/or free time allowed this to exist. Nothing else.

MagicMoonlight · 3 years ago
Even if everything was just written in java we’d be better off than the current system of embedding chrome into a python instance and then running a webserver in javascript to render a document.
adamrezich · 3 years ago
at $DAYJOB we're in the (slow) process of replacing a tool written in Java with its successor tool, which is a web app. the Java tool works great and has reasonably snappy performance. the web app is terribly slow (at least 0.75 jiras), frequently hangs, and is often unusable for simple tasks. it's been miserable enough to make me miss the Java Era.
pkolaczk · 3 years ago
I don't think replacing one bad idea with another bad idea is a good idea ;) I'm also not sure if Java would be less bad in that case. At least JS has years of research that went into making it start fast, not only run fast after warmup.
hedora · 3 years ago
If that were true, then eclipse would be faster than VS Code, right?
kryptiskt · 3 years ago
It would end up like Steve Yegge's tale of Geoworks:

"OK: I went to the University of Washington and [then] I got hired by this company called Geoworks, doing assembly-language programming, and I did it for five years. To us, the Geoworkers, we wrote a whole operating system, the libraries, drivers, apps, you know: a desktop operating system in assembly. 8086 assembly! It wasn't even good assembly! We had four registers! [Plus the] si [register] if you counted, you know, if you counted 386, right? It was horrible.

I mean, actually we kind of liked it. It was Object-Oriented Assembly. It's amazing what you can talk yourself into liking, which is the real irony of all this. And to us, C++ was the ultimate in Roman decadence. I mean, it was equivalent to going and vomiting so you could eat more. They had IF! We had jump CX zero! Right? They had "Objects". Well we did too, but I mean they had syntax for it, right? I mean it was all just such weeniness. And we knew that we could outperform any compiler out there because at the time, we could!

So what happened? Well, they went bankrupt. Why? Now I'm probably disagreeing – I know for a fact that I'm disagreeing with every Geoworker out there. I'm the only one that holds this belief. But it's because we wrote fifteen million lines of 8086 assembly language. We had really good tools, world class tools: trust me, you need 'em. But at some point, man...

The problem is, picture an ant walking across your garage floor, trying to make a straight line of it. It ain't gonna make a straight line. And you know this because you have perspective. You can see the ant walking around, going hee hee hee, look at him locally optimize for that rock, and now he's going off this way, right?

This is what we were, when we were writing this giant assembly-language system. Because what happened was, Microsoft eventually released a platform for mobile devices that was much faster than ours. OK? And I started going in with my debugger, going, what? What is up with this? This rendering is just really slow, it's like sluggish, you know. And I went in and found out that some title bar was getting rendered 140 times every time you refreshed the screen. It wasn't just the title bar. Everything was getting called multiple times.

Because we couldn't see how the system worked anymore!

Small systems are not only easier to optimize, they're possible to optimize. And I mean globally optimize."

http://steve-yegge.blogspot.com/2008/05/dynamic-languages-st...

_dain_ · 3 years ago
>Object-Oriented Assembly

what in the god damn

sillysaurusx · 3 years ago
Interestingly, AI might change this. Not that it would be a good idea to write everything in assembly, but at least it would be possible now.

Some wonderful systems were written in assembly. Donkey Kong comes to mind.

XCSme · 3 years ago
Assembly is so far from the original code written today that it is not even feasible to think about. That being said, imagine how fast everything would be if the companies developing the software would actually care about performance.

For like 99% of websites and software today, if anyone cared about the app performance, I am pretty sure most would be able to achieve at least 50% speed-ups through very basic changes (correct caching, optimizing assets, replacing bloated 3rd party libraries with a basic native call that does the same thing, configuring the servers and databases properly, etc.).

EDIT: That being said, I am pretty sure in a few years AI would be able to provide one-click optimizations to a repository that would either apply best-practices or rewrite the original code in performant Assembly.

jodrellblank · 3 years ago
> "if anyone cared about the app performance, I am pretty sure most would be able to achieve at least 50% speed-ups through very basic changes"

See https://danluu.com/octopress-speedup/

> "This blog is a static Octopress site, hosted on GitHub Pages. Static sites are supposed to be fast, and GitHub Pages uses Fastly, which is supposed to be fast, so everything should be fast, right?"

followed by

> "I'm not sure what to think about all this. On the one hand, I'm happy that I was able to get a 25x-50x speedup on my site. On the other hand, I associate speedups of that magnitude with porting plain Ruby code to optimized C++, optimized C++ to a GPU, or GPU to quick-and-dirty exploratory ASIC. How is it possible that someone with zero knowledge of web development can get that kind of speedup by watching one presentation and then futzing around for 25 minutes? I was hoping to maybe find 100ms of slack, but it turns out there's not just 100ms, or even 1000ms, but 10000ms of slack in a Octopress setup. According to a study I've seen, going from 1000ms to 3000ms costs you 20% of your readers and 50% of your click-throughs. I haven't seen a study that looks at going from 400ms to 10900ms because the idea that a website would be that slow is so absurd that people don't even look into the possibility. But many websites are that slow!"

winter_blue · 3 years ago
> For like 99% of websites and software today, if anyone cared about the app performance, I am pretty sure most would be able to achieve at least 50% speed-ups through very basic changes (correct caching, optimizing assets, replacing bloated 3rd party libraries with a basic native call that does the same thing, configuring the servers and databases properly, etc.).

I'd say a big factor to include here is the choice of language. Going from Python/Ruby to Kotlin/Rust/etc could probably yield a speed up of over 10x / over 1000%.

actionfromafar · 3 years ago
I think an oft overlooked detail of more primitive languages yielding faster programs, is this:

when it’s painful to progress even a little bit while coding, you try very hard to implement as little as possible

resource constraint can give clarity of focus

zokier · 3 years ago
Shortest/smallest code is rarely the fastest. Loop unrolling and inlining make code larger, and occasionally faster as an trivial example
tragomaskhalos · 3 years ago
I started in software dev in 1988; the only languages I knew were BASIC and Z80 assembly so, since everyone knew BASIC was kind of lame I naturally assumed I'd be a machine code programmer, albeit on slightly more exotic processors. Day one got handed a book on C -> mind blown.
geek_at · 3 years ago
Or if everything had been written by "ais523 - high effort answers"
xxs · 3 years ago
You dont need Assembly, you need to the code not to be terrible, which is generally hard with the amount of code/developers needed.
jojobas · 3 years ago
It depends. You can write bubblesort in assembly and it will be pretty damn slow. I can imagine those assembly leaders could push out pretty fast C implementations as well.
dumdumchan · 3 years ago
It wouldn't run at all since it wouldn't exist.
jagrsw · 3 years ago
This exercise appears to be somewhat flawed, even if entertaining/informative. Instead of evaluating the speed at which complex problems are resolved, it predominantly tests a peripheral issue: the efficiency of extracting memory from one process and transferring it to another. This allows for the illusion that the second process continues to write to a console/file, even though, technically, it does not - executing pv >/dev/null is essentially a no-op, as the write system call returns almost instantly.

vmsplice grants access to a process' buffer/memory to another process - a shared mem equivalent. As the initial competition requirements are likely vague, I'd imagine it's unclear if this is still good wrt the rules.

LegionMammal978 · 3 years ago
> As the initial competition requirements are likely vague, I'd imagine it's unclear if this is still good wrt the rules.

You can scroll upward to the original question to see the initial requirements, and check the edit history to verify that they haven't changed since the start of the challenge:

> Write your fizz buzz program. Run it. Pipe the output through <your_program> | pv > /dev/null. The higher the throughput, the better you did.

> The program output must be exactly valid fizzbuzz. No playing tricks such as writing null bytes in between the valid output - null bytes that don't show up in the console but do count towards pv throughput.

And vmsplice(2) indeed produces a stream of bytes in the standard output pipe that pv(1) can splice into /dev/null, or cat(1) can copy into the terminal.

This submission was not the only one that uses vmsplice(2); others have found that it's far from a magic bullet. Once you pass the I/O hurdle, much work remains in generating the pages of output as quickly as possible.

dahart · 3 years ago
> it predominantly tests a peripheral issue: the efficiency of extracting memory form one process and transferring it to another.

Isn’t this almost always the whole problem? Most code is bottlenecked on memory and I/O. Complex problems are usually held up by the speed of getting data from one place to another, and not very often on computing the data. As someone who spends his days optimizing GPU assembly, even in the rare cases when compute is the bottleneck, once you optimize it, memory becomes the bottleneck.

cyanydeez · 3 years ago
Maybe it should be called `code-origami`
tkluck · 3 years ago
I disagree, because one does not get to the bottleneck being _the efficiency of extracting memory from one process and transferring it to another_ without major fizzbuzz-specific optimizations first.

For example, there's a clever bit representation to get base-10 carries to happen natively.

The initial competition requirements are not particularly vague about this point: Measuring throughput with `<program> | pv > /dev/null` is prescribed, and it also says

> Architecture specific optimizations / assembly is also allowed. This is not a real contest - I just want to see how people push fizz buzz to its limit - even if it only works in special circumstances/platforms.

dahfizz · 3 years ago
I/O is something that literally every program has to do. Its also the bottleneck of 99% of code running on modern hardware. Moving bytes from one place to another is essential and relatively slow.

Understanding how to deal with memory I/O and file I/O performantly is a relevant skill for every program and programmer.

haunter · 3 years ago
> Save it as fizzbuzz.S (that's a capital S as the extension)

What’s the significane of “.S” vs “.s”?

harrisi · 3 years ago
Capital S will run the pre processor first.

Edit:

From manpage:

    file.s
        Assembler code.
    file.S
    file.sx
        Assembler code that must be preprocessed.

wkz · 3 years ago
IIRC, I believe that the difference traditionally was whether to pass the input through the preprocessor (.S), or not (.s).

Not sure if it makes a difference on modern toolchains.

zen_1 · 3 years ago
The convention I'm used to uses .S to denote hand-written assembly files (usually tracked in git) vs .s for machine generated assembly which can be overwritten as needed.
easytiger · 3 years ago
GCC et al won't overwrite a .S but if you ask it to gen ASM (e.g. gcc -S xyz.c) it will overwrite a .s