Readit News logoReadit News
Remnant44 · a year ago
It's rare to need to work at this level of optimization, but this is a really neat trick!

Modern cores are quite wide - capable of running 6-8 instructions at once, as long as there are no dependencies. Something as simple and common as a summation loop can often be sped up 2-4x by simply having multiple accumulators that you then combine after the loop body; this lets the processor "run ahead" without loop carried dependencies and execute multiple accumulations each cycle.

This technique is similar in concept, but even more general. Put the "guess" in the registers and run with it, relying on a second instruction within a branch to correct the guess if its wrong. Assuming your guess is overwhelmingly accurate... this lets you unlock the width of modern cores in code that otherwise wouldn't present a lot of ILP!

Clever.

xxpor · a year ago
>Something as simple and common as a summation loop can often be sped up 2-4x by simply having multiple accumulators that you then combine after the loop body; this lets the processor "run ahead" without loop carried dependencies and execute multiple accumulations each cycle.

Shouldn't the compiler be smart enough to figure that out these days (at least if it truly is a straightforward accumulation loop)?

johnbcoughlin · a year ago
Such optimizations may be forbidden by the fact that floating point addition is not associative unless you tell the compiler not to worry about that (I believe)

Deleted Comment

spockz · a year ago
If this would be possible for an application, does it not make more sense to use SIMD instructions at that point?
gpderetta · a year ago
You want to use SIMD and multiple accumulators. In fact not only you want to use as many accumulators as the number of SIMD ALUs, as SIMD operations are usually longer latency you usually unroll SIMD loops for software pipelining, using more accumulators to break loop carried dependencies.
pkhuong · a year ago
I see a lot of people asking for a real use case. If you follow the reference chain in the first aside, you'll find this blog post of mine https://pvk.ca/Blog/2020/07/07/flatter-wait-free-hazard-poin.... where we use value speculation to keep MOVS out of the critical path in an interrupt-atomic read sequence for hazard pointers.
Remnant44 · a year ago
Very nice! I was struggling to come up with a realistic real world scenario for value speculation, but this is a perfect example. Any kind of multi-thread/contention related code seems like an ideal target for making a guess-based fast-path corrected by a slow-path consistency check
gpderetta · a year ago
Went for the speculation example, stayed for the interrupt-atomic memory copy. Great article.
mistercow · a year ago
It’s interesting to me that the final assembly-trick-free version almost no longer looks like a hack.

If you commented the inner loop with something like “// Linear scan for adjacent nodes”, the reader gets an OK, if incomplete, intuition for why it’s faster. Even if you don’t know the exact CPU details, if you’re aware that flat arrays usually loop faster than contiguous linked lists, the nested loop immediately reads as a kind of “hybrid mode”.

metadat · a year ago
That is amazing, isn't it?

I want to understand, in:

  uint64_t sum5(Node *node) {
    uint64_t value = 0;
    Node *next = NULL;
    for (; node; node = node->next) {
      for (;;) {
        value += node->value;
        if (node + 1 != node->next) {
          break;
        }
        node++;
      }
    }
    return value;
  }
How is this magic:

        if (node + 1 != node->next) {
          break;
        }
Better than the more intuitive:

        if (node->next == null) {
          break;
        }
Also.. why is `node + 1' even a valid comparison? (Please forgive my rusty C/C++, college was awhile ago)

trealira · a year ago
> Also.. why is `node + 1' even a valid comparison here?

In C and C++, when you add 1 to a pointer, you actually make it point to the next object of that size in memory, e.g. if I have a pointer to a 4-byte integer at address 0x8000, incrementing it will make it point to address 0x8004.

Because arrays are contiguous chunks of memory, you can use this to iterate through arrays.

  int array[50];
  int *p = &array[0]; // Take the address of the first array element
  int sum = 0;
  for (int i = 50; i != 0; i--) {
      sum += *p;
      p++; // Point to the next element
  }
After this loop, p will be equal to &array[50], which is one past the last element of the array, because the loop will have run 50 times, and p is incremented once per loop.

What OP did is allocate an array of linked list nodes, and test to see if the next linked list node was actually just the next array element.

mistercow · a year ago
The other person who replied may have already cleared this up for you, but the two conditions you listed mean very different things. The first means “if the next node is not adjacent in memory”, while the second means “if there is no next node.
mjevans · a year ago
Finding 'the hack' to get the compilers to behave is one of the areas where I'd like to see better warnings (which could get promoted to errors).

  while (node) {
    value += node->value;
    next = node->next;
    node++;  // Line 101
    if (node != next) {
      node = next;
    }
  }
  // Compiler warning
  // Warning: Lines 101 102 103: always true evaluation combined
The pivot here is moving the fallback out of the tight inner loop, this also introduces a new statement block. The additional loop exit condition is dependent on values in memory and cannot be optimized out; since it's now directing flow control directly instead of always assuring the value of node becomes node->next (an indirect method of flow control).

  while (node) {
    while (node) {
      value += node->value;
      if (node + 1 != node->next) break;
      node++;
    }
    node = node->next;
  }

metadat · a year ago
In case your knowledge of the mechanics of `struct' vs `typedef struct' in C are rusty like mine, here are nice refreshers:

https://stackoverflow.com/a/23660072

https://stackoverflow.com/a/1675446

kristianp · a year ago
Per Vognsen (referenced in this blog) is now found on Mastodon at : https://mastodon.social/@pervognsen

He's just published "Finding Simple Rewrite Rules for the JIT with Z3":

https://www.pypy.org/posts/2024/07/finding-simple-rewrite-ru...

https://news.ycombinator.com/item?id=40951900

12_throw_away · a year ago
This got me wondering - it's said that C is based on the lie that all computers have the same architecture as a PDP-11. (At least, I'm pretty sure I remember people saying that).

So, are there any programming languages that have updated architectural models, something that takes into account branch prediction, CPU caches, etc?

darby_nine · a year ago
As far as I am aware, this saying is based on the reasoning behind C types rather than serious compiler considerations. In today's world such cpu-specific concerns are left to the compiler to figure out.

I'm sure you could contrive a language where this functionality is exposed, but I'm struggling to come with an example where this would be seriously beneficial across multiple platforms.

I strongly suspect that integrating editors of existing languages with tooling that informs programmers on how a given chunk of code performs with parallel execution units would be far more beneficial than inventing a language dedicated to such concerns at this time.

naveen99 · a year ago
I guess that’s what intel and amd were relying on, while nvidia let cuda programmers control the gpu cache explicitly.
kaba0 · a year ago
Well, C++ has the likely/unlikely attribute to somewhat prefer a branch over the other in the eye of the branch predictor, and C++, Rust and some other low-level languages do have native SIMD support (note: C doesn’t have an official one, just compiler-specific ones. So in this vein it is actually higher level than Rust or C++).
gpderetta · a year ago
Depend what you mean by official. There are likely more compilers implementing GCC vector extensions than there are rust compilers.
queuebert · a year ago
I appreciate the elegant blog design. Reminds me of Edward Tufte's books.
candiddevmike · a year ago
It's impressive that it doesn't have a mobile view and still looks great.
ahoka · a year ago
What? It looks horrible.
AnthOlei · a year ago
Ha, I think this site is styled by a single-sheet CSS called Tufte.css
notpushkin · a year ago
I don't think it is. In Tufte CSS, sidenotes are implemented using float: right [1], while here CSS Grid is used instead.

[1]: https://github.com/edwardtufte/tufte-css/blob/957e9c6dc3646a...

mwkaufma · a year ago
The optimization is the linear memory layout of the nodes -- value speculation is decoration.
Remnant44 · a year ago
The linear node layout is not the point at all.

It's serving two purposes here:

1) Providing us with a correct "guess" of what the next node is. 2) Ensuring that in all cases we're running from the L1 cache.

In real world code, you'd be correct -- getting things to run out of L1/L2 is the most important attribute. This is specifically about a micro-optimization that allows you to beat the obvious code even when running completely from cache!

pfedak · a year ago
The example is poorly chosen in terms of practicality for this reason, but otherwise, no, this is a poor summary that misses something interesting.

The memory layout isn't changing in the faster versions, and there are no additional cache misses. It's easy to convince yourself that the only difference between the naive linked list and assuming linear layout is the extra pointer load - but TFA shows this is false! The execution pipeline incurs extra costs, and you can influence it.

mistercow · a year ago
I think a more practical example might have been to have a mostly contiguous list with a few discontinuous nodes inserted randomly in the middle. That’s more like a real case, and exercises the advantages of linked lists over simple arrays, but should still perform well, since there would only be a few value speculation misses.