Iterators and Streams in Rust and Haskell

I find it really interesting that the idiomatic Haskell implementations (basically math) are the best-performing, while the Rust-like Haskell implementation (using an IORef) is orders of magnitude slower. This is exactly what I want: describe the logic of the operation, and leave the compiler to optimize for the hardware (in this case a CPU which mutates registers). The Rust implementation makes assumptions about the underlying hardware (has registers we can mutate), and is only about 15% faster than the Haskell implementation which makes no such assumptions.

In essence, this is why I love Haskell, and choose it over Rust: it allows me to write my application logic directly, without having to think about it in terms of mutation, and have the generated code be pretty fast. If GHC becomes well-optimized enough it can render Rust obsolete, since "no runtime overhead" becomes pretty meaningless if it's actually slower than Haskell (e.g. using LinearTypes, which removes need for GC). Rust can't render Haskell obsolete, however, since Haskell's goal is basically allowing you to write logic directly, using types as propositions and values as proofs. So Haskell's goal is a qualitative one (execute logic) while Rust's is a quantitative one (performance/no runtime overhead), which results in Haskell being able to take the place of Rust if GHC gains sufficiently in performance.

adwn · 8 years ago

> If GHC becomes well-optimized enough it can render Rust obsolete [...]

Ah yes, the mythical Sufficiently Smart Compiler (TM), scheduled to arrive any day now. Will it also be able to reliably transform non-trivial functions that are accidentally O(n) in memory consumption due Haskell's laziness, to O(1) versions?

The keyword here is reliably. There are many clever but brittle compiler optimizations that aren't used much in the field, because when performance is crucial, you can't rely on the compiler to detect this transformation opportunity in some cases and not in some other cases, so you code it explicitly.

chongli · 8 years ago

reliably transform non-trivial functions that are accidentally O(n) in memory consumption due Haskell's laziness, to O(1) versions?

What you're describing is called strictness analysis. [0] I believe that attempting to solve it in general runs up against the halting problem. That being said, GHC does perform some strictness analysis when optimization flags are turned on. [1] The language itself also includes strictness annotations on the fields of data constructors. [2] Experienced Haskell programmers know how to use these to avoid allocating unnecessary thunks.

[0] https://en.wikipedia.org/wiki/Strictness_analysis

[1] https://wiki.haskell.org/Performance/Strictness#Strictness_a...

[2] https://wiki.haskell.org/Performance/Data_types#Strict_field...

lmm · 8 years ago

Will Haskell's compiler ever be 100% reliable? Maybe. Will Haskell's compiler ever be more reliable than the average developer? IME it's already there, and has been for a while: for a given functional requirement and a given level of performance, the cheapest way to achieve it is usually Haskell or similar.

rtpg · 8 years ago

I understand this frustration, but my understanding is that patterns exist to _ensure_ transformations go through exist in Haskell land. Im thinking about stream fusion for example. Lots of "pipe-y" libraries exist to make sure your program consumes input in a streamed manner.

So while your O(n) code might sometimes unreliably be turned into O(1), if performance is your game you can use these kind of libraries to endure O(1).

jeffdavis · 8 years ago

Are you saying there's a way to use haskell without the GC or other runtime overhead?

I like rust because it doesn't have much in the way of runtime requirements (almost zero, like C).

runeks · 8 years ago

> Are you saying there's a way to use haskell without the GC or other runtime overhead?

There isn't yet, but there will be when, first, the LinearTypes extension arrives (scheduled for GHC 8.4) and, secondly, GHC's garbage collector is modified to take of advantage of the fact that linear types are always consumed exactly once. This will take 1-2 years or so.

After this it should be possible to write no-runtime-overhead Haskell functions by using the "lollipop" arrow after an argument, and then consuming that argument exactly once in your function. So you will have to split your functions into small enough functions, that consume their arguments exactly once, and then compose them into something that will not need a garbage collector at runtime. In theory you could write an entire application this way, although in practice you probably wouldn't bother avoiding the GC for the outermost constructions of your program, since this would offer little in performance gains and be a lot more difficult (remember, a no-GC function must consume its arguments exactly once).

But time will tell how this will compare to something like Rust's borrow checker.

evincarofautumn · 8 years ago

There’s been some work lately by Tweag I/O on a linear types extension to GHC.[1][2]

It would allow us to safely allocate things from malloc memory or a “linear heap” separate from the main GC heap, and that should reduce pause times by reducing GC pressure, but I don’t think we’ll ever be able to avoid the GC entirely.

However, we can still expect some nice performance advantages from the fact that linear types can enforce guarantees about inlining & fusion (which are currently somewhat unpredictable) as well as resource lifetimes.

[1]: https://github.com/tweag/linear-types

[2]: https://github.com/tweag/ghc/tree/linear-types

infinity0 · 8 years ago

I can certainly see it's plausible Haskell can beat Rust in the mid-term future on high-level programming away from the bare metal. Less certain that it can beat Rust for system-level programming or low-level library programming, where people using the library don't want to link to a heavy runtime that includes a GC.

pjmlp · 8 years ago

Not having issues with using a GC language is also a huge win.

I can grasp Haskell without major issues, follow C++14 and C++17 meta-programming tricks, yet I still don't understand quite well how to deal with the borrow checker on Rust.

Still haven't given up on it, but it shows the learning curve is a bit high, or then it is me that cannot wrap my head around it.

runeks · 8 years ago

I'm not that familiar with Rust but, as far as I can see, it makes sense to say that Rust's borrow checker is an interface to a O(1) garbage collector (as fast as no GC at all), that the programmer has to implement at compile-time for all values. All the borrow checker-code is compiled into a no-runtime-overhead garbage collector, that is embedded into the resulting generated code.

In the world of Haskell, the -> arrow embeds a GC (with runtime overhead) into the genrated code. A similar interface to Rust's borrow checker in Haskell (no runtime overhead) would be the ⊸ arrow from LinearTypes (in a future where GHC's GC is able to take advantage of this).

It looks, to me, like the arrow (⊸) notation is a lot simpler inteface to a no-runtime-overhead garbage collector -- or the absense of a traditional GC, in other words -- than Rust's borrow checker. But I'm looking forward to seeing how it plays out (it will probably take years before it becomes reality).

jmcomets · 8 years ago

Don't hesitate to check out #rust / #rust-beginners on irc.mozilla.org, I was struggling with BCK for quite a while and asking dumb questions to experienced rustaceans has been very helpful. I've now been using rust for 2 years and am very happy with it.

> But look again: C is taking 87 nanoseconds, while Rust and Haskell both take about 175 microseconds. It turns out that GCC it able to optimize this into a downward-counting loop, which drastically improves the performance. We can do similar things in Rust and Haskell to get down to nanosecond-level performance, but that's not our goal today. I do have to say: well done GCC.

Downward-counting or not, it is simply impossible for GCC to generate code that executes all 1,000,000 iterations of the loop in 87ns. That would be 87 femtoseconds per iteration, on average.

More likely, GCC figured out how to collapse the entire loop into a closed-form expression that is a function of the loop length.

gizmo686 · 8 years ago

I just tested this on my machine (gcc 5.4.0). At -O2, gcc produced normal looking assembly code. At -O3, gcc produced a monstrosity [0] that I don't feel like fully deciphering.

However, from a brief glance, it does not appear to have created a closed form solution. Instead, it contains a single loop:

    .L4:
        addl    $1, %edx
        paddd   %xmm1, %xmm0
        paddd   %xmm2, %xmm1
        cmpl    %edx, %eax
        ja  .L4

which seems to be using a SIMD instruction (paddd[1]) that adds does 4 32-bit integer additions in parallel.

After this loop, it does some "housekeeping" (read, something I don't understand) before proceeding to an unwound version of the last iterations of the loop:

    leal    4(%rdx), %ecx
    addl    %edx, %eax
    cmpl    %ecx, %edi
    jl  .L2 
    addl    %ecx, %eax
    leal    8(%rdx), %ecx
    cmpl    %ecx, %edi
    ...
    jl  .L2 
    addl    %ecx, %eax
    leal    28(%rdx), %ecx
    cmpl    %ecx, %edi
    jl  .L2
    addl    %ecx, %eax
    addl    $32, %edx
    leal    (%rax,%rdx), %ecx
    cmpl    %edx, %edi
    cmovge  %ecx, %eax
    ret

Where .L2 is just:

    .L2:
        rep ret

I assume that this is just some form of return, but the documentation I could find [2] seems to suggest that rep is a prefix for string operations, which doesn't make sense.

[0]https://pastebin.com/raw/Y55gQG7p

[1] http://x86.renejeschke.de/html/file_module_x86_id_226.html

[2] https://c9x.me/x86/html/file_module_x86_id_279.html

gpderetta · 8 years ago

the rep in rep ret is ignored, is just used for alignment; the 'housekeeping' code is to handle non-multiple of 8 loop counts.

Still, unless I'm missing something, the code should be executing 8 adds per clock[2]; at 4ghz, that still above 1us for 500k adds.

GCC doesn't seem to be able to fold the loop given a constant expression, unless the function is explicitly declared constexpr; in which case it will complain about the accumulator overflowing, but gcc doesn't seem to be taking advantage of it.

Clang does not vectorize the loop but will replace it with a constant given a constant parameter.

Bottom line, I'm not sure what's going on with the article's measurements.

[2] potentially 12 for skylake or even 24 with avx.

Deleted Comment

implr · 8 years ago

If you pass -march=skylake it will get even more monstrous, with AVX: https://godbolt.org/g/v7iKF3

kccqzy · 8 years ago

I don't think so. https://godbolt.org/g/GBZgER

nemetroid · 8 years ago

Perhaps the entire call of the function was replaced by a constant, then.

pcwalton · 8 years ago

Another possibility is that GCC discovered that the loop had no side effects in the benchmark harness and eliminated it. This is a very common thing that happens in microbenchmarks.

gizmo686 · 8 years ago

The c code defines a function that returns the sum computed in the loop. This function is than run using Haskell's FFI. GCC has no way of knowing there is no side effects, because other code might link to it and use the result of the function.

GHC, in concept, does know that there are no side effects, but that would apply to all of the functions being tested. Further, since it is using a benchmarking library, I assume the authors of said library were aware of the issue, and wrote it in such a way as to force evalutation of the function.

https://gist.github.com/snoyberg/9b1c77b595c4adf90880213fc49...

gpderetta · 8 years ago

I.e. benchmarking is hard.

e12e · 8 years ago

Being able to tell that a loop is gone shouldn't be too hard with a diff of the assembler output with optimizations on and off - at least for fairly trivial code.

adrianratnapala · 8 years ago

I assume the 87 nanoseconds means the average per loop iteration, not all 1,000,000 iterations.

red75prime · 8 years ago

It's unlikely. The entire loop should have taken 0.87 seconds then. That's a bit too much for several million operations, which don't seem to require memory access.