Totally off topic, but I've started to notice that more and more algorithmic low-ish level content assumes Rust by default. My entire life I've been used to anything algorithmic being either written in sciency pseudocode (invariably in latex-generated PDFs), or plain old C(++), the vulgar latin of computer programming. I know C and I don't really know Rust but nevertheless I love that this is changing! I'd never advise a new programmer interested in low-level/fast stuff to begin with C now, so much good content is written around Rust. Like the old C examples, the post doesn't even say "this is Rust". You're expected to expect it to be Rust.
And, well the language choice doesn't really matter for the core subject here. I could follow this post fine even though I don't really know Rust (just like I assume a Rust programmer could follow a well written algorithms deep dive with examples in C with a bit of dedication). EDIT okok that was a lie, I could follow the first half of this post, ish, and then it got way too deep for me.
But anyway, standardize on Rust, I like it. I'd love if a bit more Zig was thrown into the mix here but either of them feel like a better default than C at this point to me. After decades of C being The Standard for this stuff, I love that this is finally changing.
I think it's almost irrelevant how good Rust is as a language. Rust is winning in many contexts, because you can use it to replace C++, but the standard library and tools are substantially better.
I think the main problem is that Rust doesn't allow for basic data structures like graphs or doubly-linked lists without extra effort, which is why I've generally always preferred pseudocode.
The problem with rust is that it’s has lots of features but isn’t c like. Python doesn’t have much syntax so it’s fine to be not c like but rust has 100% new syntax and a lot of syntax with respect to c. I can pick up c like languages easy but rust just makes me mad to look at.
Nah just the final bits, stuff like `Simd::<u32, 8>::splat(q)`, I'm not sure what splatting is or how Rust's Simd library works or, admittedly, how SIMD works in detail at all. So eg I'm not sure what that 8 is doing there in the type parameters, details like that. Maybe this isn't a Rust thing but a SIMD thing, btw, I don't know much Rust but I also never wrote any SIMD code ever. I don't know how the article could be clearer, IMO once you go into the deep nitty gritty optimization stuff you just gotta assume the reader knows the tools you're using. I'm OK with dropping out halfway cause the core idea is there even before you squeeze out all the perf.
Because that’s the original problem statement for Algol! As in, we’ve been publishing all that pseudocode for quite a bit and it seems like conventions have emerged, let’s formalize them and program in that. ... People were markedly more naïve back then, but then that’s sometimes what it takes.
I don't know if it's good or bad, with pseudo code you put no constraints on how it should be implemented. It's known that some kind of algorithm are really hard to implement in Rust (every one use the link list data structure as an exemple). So having the article that use rust is good to see it can fit to rust constraints but at the same time does this constraint limit the algorithm itself ?
Quite honestly, doesn't seem like Rust is a win here over C++. In fact, C++ has fewer sigils that make it somewhat easier to read sort of striking the balance between being simple enough (C-like) and abstractions (templates). Readability wise, I would have preferred Julia as that would have been the cleanest explanation of the code through the different topics at sufficient level of detail. Alas, it seems to have stalled somewhat (relatively speaking). It also doesn't help that every Rust article has Rust advocates jumping on you with "WAHT ABOUT SAFETY" which is off-putting since not every program needs to deal with all the complexity of the Rust memory/lifetime model to enable safety (simple bounds checking etc. might be sufficient).
The article is about how to squeeze absolute maximum in performance from the hardware. So you need a language that allows a relatively low-level access to hardware. And big plus of Rust is that it has standard SIMD libraries while C++ will get them only in C++26
One dimension that is not explored is partitioning the queries in batches. The primary cost is doing lookups on the out-of-cache table, so if you have a sufficiently large amount of queries you can resolve a couple layers of the tree in one step while those layers are in cache, grouping them based on where they land deeper in the tree, and then resolving all those queries that touch the same deeper part of the tree in one batch as well.
In theory, with an infinitely large amount of input queries you will never have to do an out-of-cache lookup, it can be completely amortized away into linear scans.
That said, you now end up with a bunch of results that need to be put back into the correct output order, which likely makes it not worth it. But if the operation can be fused into a reduction (e.g. find the sum of the binary search results) or the output order does not matter in some other way then all of a sudden it might make sense.
I implemented this method in Dyalog 18.0 with BlockQuicksort-like partitioning, using vectorized comparison with bit-boolean output. It's faster than you'd expect, maybe twice as slow as regular binary search when searched values are in cache, and better once they fall out of L2. But Dyalog reverted to 17.1 for later versions so you won't see it in a new download. It'll probably make it into CBQN eventually, perhaps with radix partitioning. Note that both quicksort and radix partitioning can be done and undone in a cache-friendly way.
Unlike quicksort, there's no issue of pivot selection since you always choose the midpoint of the searched values. However, there's a complementary issue of memory if the partitions become unbalanced, because the larger partition can require saved memory of roughly the depth times the number of elements. With a bit per comparison it's bounded by the size of the input.
Nice overview of sorting methods, thanks for sharing!
I also looked a bit into radix and distribution sort at some point over the past year, but in the end high performance sorting is actually too big of a thing to just do quickly on the side, as your post well shows :")
In fact I wasn't aware of the associativity issues for radix sort. That's definitely something to keep in mind and investigate.
Will definitely refer back to it once I'm looking at sorting again in more detail at some point!
There are theory papers on "buffer trees"---B-trees where each node is augmented with an O(B)-length array of pending updates and queries. I believe there were also some attempts at building SQL databases based on them. It sounds like you're reaching for the same trick.
that's a hybrid compacting data structure: compacting within the btree node, normal btree topology otherwise.
And it works much better than pure compacting (i.e. the leveldb lineage), because you avoid lock contention at the root on multithreaded update workloads, and the compaction/resort is much lower overhead when it fits in l2.
incidentally, there's a filesystem that uses this technique.
Interesting stuff, definitely the kind of real world optimization that happens when you’re able to look at actual access characteristics rather highly abstracted models.
At one level you could just be saving the results into a hashtable to bubble them out again, but at the extreme end the lookup is actually spread across multiple machines in a cluster, so it’s more like throwing the query to the proper one with the right chunk of the index, along with instructions about where the final RPC is supposed to go with the result.
There was some work a while back on streaming data past queries, but you need a fairly bounded data set for that to work. Having ten years of historical data in a data set would gum that up severely.
It enhances the throughput (on average everyone waits less) at the price of a higher max latency (some, who posted a request mobilizing a very-recently-out-of-cache-index, will wait way, way more...), isn't it? In the real world those worst cases quite often kill such optimization.
The (data size / cache size) ratio and queries local (in time) dispersion are key.
Ah yes, I did consider sorting the queries at some point but I guess I then forgot about it again. If the queries are random and much less than the input size, probably most queries will hit different cache lines in the final layer, and I suspect benefits will be limited at maybe 2x best case since the last layer is where the bottleneck is.
Also (radix) sorting is very memory bound usually, and we probably need to sort in at 16^2=256 buckets or so to get sufficient reusing of the higher layers, but I don't have the numbers of what a round of radix sort takes. (My guess is order 1ns per query? Maybe I'll find time to investigate and add it to the post.)
high-performance sorting algos do either merging or partitioning. I.e., you merge R input streams into one, or split one input stream into R (for quick, radix and sample sort).
1. For merge sort of N elements, you have to perform log(N)/log(R) passes
2. For sample sort - C*log(N)/log(R), where С depends on the distribution of your data, there are no strict guarantees
3. For radix sort of K-byte elements exactly K passes (indeed, 256 ways is optimal according to my tests), which is usually larger than the previous value
While Mergesort looks preferable since it uses the least and fixed amount of passes, the merging code itself is less efficient - there are not much independent CPU operations, making it slower than samplesort and especially radix sort for small inputs.
So, it seems that the best strategy combines two different levels - for larger blocks you absolutely need to minimize the amount of passes [over memory] and employ multithreading, while for smaller blocks we need to minimize the amount of CPU operations and increase ILP.
Today two well-recognized algos are IPS4O for larger blocks and Google's SIMD QuickSort for smaller ones.
Amazingly thorough ! I love how the author leaves no stone unturned. I had no idea you could do the kind of low level efficiency shaving in Rust.
I wonder how a C++ implementation with https://github.com/jfalcou/eve would compare.
Thanks!
It's somewhat tiring to not have loose ends, but I agree it pays off :)
Doing this stuff in Rust is absolutely possible, and I'd do it again since my C++ days are now past me, but the endless transmuting between portable-simd types, plain rust arrays, and intrinsic types is quite annoying.
Also rust is not made for convenient raw pointer arithmetic. There plain C would be much more to the point.
You kind of lost me towards the end, so I’m not sure whether you attempted it, but I was wondering whether it would be possible for the internal nodes to store only the upper 8/16 bits (that are nonzero for the current subtree). This implies that one 64 byte cache line stores 32 or 64 entries instead of 16 (or better: 31 or 62/63, since u may need some bookkeeping dat).
The next level would be to keep track of how many prefix bits are implied by the parents, so leaf nodes could perhaps also only use 8/12/16 bits, if the the higher bits are implied by the parent. or instead of bit masks, use offsets (i.e. leaves store k bits offset from parent value).
That may screw around with the branch predictor, and may work very good with evenly distributed data vs uneven data.
There are stones still unturned. For typical unicode table lookup's you'd need batches for all characters of a word. It would be much faster to lookup eg 16 chars at once, to optimize cache misses.
But when I tested this even Eytzinger was too slow.
indeed, highway is the popularity leader, it implements lots of SIMD operations and supports lots of hardware including SVE/RVV with runtime SIMD width, although IMHO it has a bit longer learning curve
Nifty thing about eytzinger trees is that there's a formula for converting from a node index to its position in an inorder traversal.
This is useful if your primary keystore is a normal sorted set of keys - easier to work with - you then don't need to store explicit pointers.
Also, he didn't mention that with eytzinger you can prefetch multiple loop iterations in advance - use 1-based indexing so that sibling nodes line up nicely on cachelines.
You're right. While I wasn't very explicit about this (because algorithmica and the linked paper spend plenty of words on it already), the code snippet for Eytzinger does indeed do both the prefetching and the formula.
In fact, I'm pretty sure a similar formula could be made to work for higher branching factors, although it would surely be slower. (Probably it depends on the number of times 17 divides the final index it so, which is not great, but with B=15 it would be the number of factors of 16 which is easy again.)
The more annoying issue with not storing everything in the final layer is that we have to keep a running-answer for each query as we go down the tree, which I suspect will add measurable runtime overhead. But maybe that's a tradeoff one is willing to make to avoid the 6.25% overhead.
this is unbelievably cool. ~27ns overhead for searching for a u32 in a 4GB set in memory is unreal.
it's interesting that the wins for batching start diminishing at 8. I'm curious then how the subsequent optimizations fare with batch size 8 (rather than 128).
smaller batch sizes are nice since it determines how much request throughput we'd need to saturate this system. at batch size 8, we need 1s / ~30ns * 8 = 266M searches per second to fully utilize this algorithm.
the multithreading results are also interesting -- going from 1 to 6 threads only improves overhead by 4x. curious how this fares on a much higher core count machine.
Just fyi: the throughput numbers with batching are per _query_, not per _batch_, so I think the *8 is too optimistic ")
I suspect that at higher core counts, we can still saturate the full RAM bandwidth with only 4-5 cores, so that the marginal gains with additional cores will be very small.
That's good though, because that gives CPU time to work on the bigger problem to determine the right queries, and to deal with the outputs (as long as that is not too memory bound in itself, although it probably is).
The number of unique int32 values is not that great, and a full bitset would only be 512MB. Hard to bit a dense bitset in performance.
As a general purpose data structure, I would look at roaring bitmaps for this purpose, which has good trade-offs with great performance for most use-cases.
If only simple lookups are needed over a static set, then it's worth looking at minimal perfect hash functions (https://sux.di.unimi.it).
Hmm, bitmaps is an interesting idea!
If the data is dense enough, then yeah I guess a quick linear scan would work.
There's also the extreme of simply storing the answer for each possible query as a u32 and just index the array, but there the overhead is much larger.
Rank-select is also interesting, but I doubt it comes anywhere close in performance.
I happen to also be working on a minimal perfect hashing project that has way higher throughput than other methods (mostly because batching), see https://curiouscoding.nl/posts/ptrhash-paper/ and the first post linked from there.
Nice work! I love the care and discussion around low level optimizations, as such things are often ignored.
There are a lot of interesting variants of rank/select and other succinct data structures which has interesting tradeoffs. Maybe most useful when compression is critical. At least my own data structures can’t compete with the query performance you are showing, but they are great when the memory footprint matters.
I'm not up to that, but I have a working S-Tree builder and basic AVX2-using `lower-bound` for it (as described in the Algorithmica article linked to in the post) up and running in SBCL. Haven't played with any of the fancier optimizations yet, much less done any benchmarking. Should put it up in a gist or paste site to share...
This is really interesting and a thorough write up. Thanks to the author for sharing their work.
Whenever I read about super low-level optimisation, my immediate feeling is that of gratitude to the author for spending so much time shaving off nanoseconds which the entire SE community gets to enjoy.
I wonder how much time humanity has collectively saved simply as a result of how all these optimisations stack up.
Assuming this is going to save someone 10ns per query in the end and I spent 150h on coding and writing this up, we only(?) need around 10^14 queries to make it worth it!
But yes, as new technologies stack up, we achieve exponential growth in throughput in the end, which usually enables new science :)
And, well the language choice doesn't really matter for the core subject here. I could follow this post fine even though I don't really know Rust (just like I assume a Rust programmer could follow a well written algorithms deep dive with examples in C with a bit of dedication). EDIT okok that was a lie, I could follow the first half of this post, ish, and then it got way too deep for me.
But anyway, standardize on Rust, I like it. I'd love if a bit more Zig was thrown into the mix here but either of them feel like a better default than C at this point to me. After decades of C being The Standard for this stuff, I love that this is finally changing.
Still, Zig seems somewhat more intuitive to me, even though I’ve used it as little as Rust thus far.
A nice write up on the topic Zig and Rust: https://matklad.github.io/2023/03/26/zig-and-rust.html
edit: typos
Deleted Comment
Scientific stuff (scipy) and now all the ML stuff too
Dead Comment
In theory, with an infinitely large amount of input queries you will never have to do an out-of-cache lookup, it can be completely amortized away into linear scans.
That said, you now end up with a bunch of results that need to be put back into the correct output order, which likely makes it not worth it. But if the operation can be fused into a reduction (e.g. find the sum of the binary search results) or the output order does not matter in some other way then all of a sudden it might make sense.
I implemented this method in Dyalog 18.0 with BlockQuicksort-like partitioning, using vectorized comparison with bit-boolean output. It's faster than you'd expect, maybe twice as slow as regular binary search when searched values are in cache, and better once they fall out of L2. But Dyalog reverted to 17.1 for later versions so you won't see it in a new download. It'll probably make it into CBQN eventually, perhaps with radix partitioning. Note that both quicksort and radix partitioning can be done and undone in a cache-friendly way.
Unlike quicksort, there's no issue of pivot selection since you always choose the midpoint of the searched values. However, there's a complementary issue of memory if the partitions become unbalanced, because the larger partition can require saved memory of roughly the depth times the number of elements. With a bit per comparison it's bounded by the size of the input.
I also looked a bit into radix and distribution sort at some point over the past year, but in the end high performance sorting is actually too big of a thing to just do quickly on the side, as your post well shows :") In fact I wasn't aware of the associativity issues for radix sort. That's definitely something to keep in mind and investigate.
Will definitely refer back to it once I'm looking at sorting again in more detail at some point!
And it works much better than pure compacting (i.e. the leveldb lineage), because you avoid lock contention at the root on multithreaded update workloads, and the compaction/resort is much lower overhead when it fits in l2.
incidentally, there's a filesystem that uses this technique.
At one level you could just be saving the results into a hashtable to bubble them out again, but at the extreme end the lookup is actually spread across multiple machines in a cluster, so it’s more like throwing the query to the proper one with the right chunk of the index, along with instructions about where the final RPC is supposed to go with the result.
The (data size / cache size) ratio and queries local (in time) dispersion are key.
Also (radix) sorting is very memory bound usually, and we probably need to sort in at 16^2=256 buckets or so to get sufficient reusing of the higher layers, but I don't have the numbers of what a round of radix sort takes. (My guess is order 1ns per query? Maybe I'll find time to investigate and add it to the post.)
1. For merge sort of N elements, you have to perform log(N)/log(R) passes
2. For sample sort - C*log(N)/log(R), where С depends on the distribution of your data, there are no strict guarantees
3. For radix sort of K-byte elements exactly K passes (indeed, 256 ways is optimal according to my tests), which is usually larger than the previous value
While Mergesort looks preferable since it uses the least and fixed amount of passes, the merging code itself is less efficient - there are not much independent CPU operations, making it slower than samplesort and especially radix sort for small inputs.
So, it seems that the best strategy combines two different levels - for larger blocks you absolutely need to minimize the amount of passes [over memory] and employ multithreading, while for smaller blocks we need to minimize the amount of CPU operations and increase ILP.
Today two well-recognized algos are IPS4O for larger blocks and Google's SIMD QuickSort for smaller ones.
Walk the larger tree, using the smaller tree.
That's not terribly common.
Doing this stuff in Rust is absolutely possible, and I'd do it again since my C++ days are now past me, but the endless transmuting between portable-simd types, plain rust arrays, and intrinsic types is quite annoying.
Also rust is not made for convenient raw pointer arithmetic. There plain C would be much more to the point.
The next level would be to keep track of how many prefix bits are implied by the parents, so leaf nodes could perhaps also only use 8/12/16 bits, if the the higher bits are implied by the parent. or instead of bit masks, use offsets (i.e. leaves store k bits offset from parent value).
That may screw around with the branch predictor, and may work very good with evenly distributed data vs uneven data.
But when I tested this even Eytzinger was too slow.
https://github.com/google/highway
indeed, highway is the popularity leader, it implements lots of SIMD operations and supports lots of hardware including SVE/RVV with runtime SIMD width, although IMHO it has a bit longer learning curve
This is useful if your primary keystore is a normal sorted set of keys - easier to work with - you then don't need to store explicit pointers.
Also, he didn't mention that with eytzinger you can prefetch multiple loop iterations in advance - use 1-based indexing so that sibling nodes line up nicely on cachelines.
In fact, I'm pretty sure a similar formula could be made to work for higher branching factors, although it would surely be slower. (Probably it depends on the number of times 17 divides the final index it so, which is not great, but with B=15 it would be the number of factors of 16 which is easy again.) The more annoying issue with not storing everything in the final layer is that we have to keep a running-answer for each query as we go down the tree, which I suspect will add measurable runtime overhead. But maybe that's a tradeoff one is willing to make to avoid the 6.25% overhead.
it's interesting that the wins for batching start diminishing at 8. I'm curious then how the subsequent optimizations fare with batch size 8 (rather than 128).
smaller batch sizes are nice since it determines how much request throughput we'd need to saturate this system. at batch size 8, we need 1s / ~30ns * 8 = 266M searches per second to fully utilize this algorithm.
the multithreading results are also interesting -- going from 1 to 6 threads only improves overhead by 4x. curious how this fares on a much higher core count machine.
I suspect that at higher core counts, we can still saturate the full RAM bandwidth with only 4-5 cores, so that the marginal gains with additional cores will be very small. That's good though, because that gives CPU time to work on the bigger problem to determine the right queries, and to deal with the outputs (as long as that is not too memory bound in itself, although it probably is).
As a general purpose data structure, I would look at roaring bitmaps for this purpose, which has good trade-offs with great performance for most use-cases.
If only simple lookups are needed over a static set, then it's worth looking at minimal perfect hash functions (https://sux.di.unimi.it).
There's also the extreme of simply storing the answer for each possible query as a u32 and just index the array, but there the overhead is much larger.
Rank-select is also interesting, but I doubt it comes anywhere close in performance.
I happen to also be working on a minimal perfect hashing project that has way higher throughput than other methods (mostly because batching), see https://curiouscoding.nl/posts/ptrhash-paper/ and the first post linked from there.
There are a lot of interesting variants of rank/select and other succinct data structures which has interesting tradeoffs. Maybe most useful when compression is critical. At least my own data structures can’t compete with the query performance you are showing, but they are great when the memory footprint matters.
Whenever I read about super low-level optimisation, my immediate feeling is that of gratitude to the author for spending so much time shaving off nanoseconds which the entire SE community gets to enjoy.
I wonder how much time humanity has collectively saved simply as a result of how all these optimisations stack up.
But yes, as new technologies stack up, we achieve exponential growth in throughput in the end, which usually enables new science :)