Sorting with SIMD - Readit News

Dwedit · 3 years ago

If you are exclusively moving numbers around in an array, SIMD sorting sounds great.

But when you want to actually sort things, it's different. You have objects in an array, the objects will have one member which is a Key that you will sort the objects on.

In order to create a sorted permutation of the list, you either need to rearrange the list of object pointers (fast), rearrange the memory of the objects (slow), or you simply output a list of array indexes that represents what the sort order should be (fast).

Code that doesn't physically move the object memory around creates indirection. Indirection makes any SIMD sorting depend on scatter-gather in order to get data in. Scatter-Gather causes random memory accesses which don't cache as well as sequential access.

ww520 · 3 years ago

It might be easier for SIMD to work by extracting the key value from each object into its own array.

Form an array with fixed element size: [ {index0 | key0}, {index1 | key1}, ...], where indices are the element index to the original array of objects and the keys are fixed size. The fat element {index | key} is moved as a whole unit to carry the original index along with the swap during sorting. SIMD can deal with fixed size units very well.

nickelpro · 3 years ago

That would be called a gather operation, which is itself slow, exactly as the OP comment said.

dan-robertson · 3 years ago

Yeah I find the thing you generally want but which many languages don’t support well is to have a struct-of-arrays or data frame layout rather than an array of objects. Then you can do one of:

Sort one array and rearrange other arrays the same way

Compute the array of indices into another array such that array[index[i]] < array[index[i+1]], I.e. the sorted version of the array is then [array[i] for i in index]. If you have a vectorised index operation (eg APL, R, …) getting the sorted array is one line of code.

I think with simd you’d want to sort your array and simultaneously permute either a second array or the array of indices 1, 2, …, n so you can use that to do accesses to corresponding elements in your other associated columns.

ddlutz · 3 years ago

Maybe Google's new "Rune" language will become prevalent https://github.com/google/rune, which supports SoA.

hoten · 3 years ago

Last I checked, Jai was going this direction (a type modifier that allowed one to do either AOS or SOA) but abandoned it (or at least, abandoned the particular syntax JB originally prototyped).

watersb · 3 years ago

I've been playing with Intel's "Implicitly Parallel SIMD Compiler" (ISPC) [0].

The ISPC User Guide was easy for me to work through, and there's a section on "Structure of Arrays" (SOA) programming [1].

This approach to vector-parallel programming was new to me, but got me to try some C code for the kernel transformation. My simple applications are probably fast enough in Python, but I thought it quite worthwhile to look at tiny kernels and the code the compiler generates.

- [0] https://ispc.github.io/index.html

- [1] https://ispc.github.io/ispc.html#structure-of-array-types

srcreigh · 3 years ago

Yeah. Whether this is faster than normal sort is a big "depends" based on how the sorted data is accessed.

SIMD sort with indirection would be nice if you only need to access top 25 or 1/255 items in the array or something. Faster sorting helps pay off the cost of cache misses when accessing the other fields.

If you need to sort and then process every item with associated data, I wouldn't be surprised if quicksort on the structs themselves is faster. The cache misses would add up.

geysersam · 3 years ago

Would using a structure-of-arrays (instead of array-of-structues) make this less of a problem?

Dwedit · 3 years ago

Okay, let's suppose we are defining the problem like this:

• ObjectArray, an Array of objects, will be treated as immutable

• IndexArray, an array of ints, indexes into the first array. Initialize to [0...N). After sort finishes, this contains an ordering into the first array.

Now we want to run the sort.

Every time we do a comparison, we read from IndexArray, and use that index into ObjectArray. Every time we do a swap, we only mutate IndexArray.

There is guaranteed to be at least one level of indirection, because you need to read from IndexArray to determine which object is actually there. So you get a scatter-gather requirement here for SIMD code.

Now for the different ways to represent the array:

• Plain pointers to objects (like C# reference types)

• One object after the other in a contiguous memory block (Array of structures)

• All Keys contiguous in memory (Structure of arrays)

If you go with Plain Pointers to objects (reference types in C#), you get an extra read to some object in memory that was allocated elsewhere. Can be a cache miss.

If you go with Big Memory Block with objects inside (Array of structures), you avoid the read of the plain pointer value. You calculate the address of a key with a Multiply (by size of object) and an Add (base address). The data that's loaded into the processor cache may be unnecessary for sorting.

If you go with All Keys Contiguous in memory (Structure of arrays), you calculate the address of a key with a Bitshift (size of key) and an Add (base address). With keys contiguous in memory, it helps with caching, as you are loading only keys into the cache at that time.

thecompilr · 3 years ago

The original (AFAICT) work on SIMD quick sort, also mentioned in the google post also implemented pointer sort by loading a pointed key using gather instructions and the method can be used for an array of structs. https://github.com/vkrasnov/avx_qsort/blob/master/qsort_AVX2...

camuel · 3 years ago

I wonder why no one mentions bitonic sort? If you want to do anything in SIMD you better avoid branching as much as possible... and ideally altogether. Here is an implementation I co-authored some 10 years ago: https://github.com/zerovm/zerovm-samples/blob/master/disort/...

Sorting-networks which were already mentioned seems similar but a bit too abstract.

My code above doesn't contains values but those are easy to add I think. Of course it is better to permute fixed size pointers / offsets and not the entire blobs which can be of variable size and then it will complicate everything beyond feasible for SIMD

anovikov · 3 years ago

SIMD is a fantastic tool. My first try with AVX2 (256-bit), building an "overlay with transparency" for two YUVA420P frames, yielded speed better than 1 pixel per CPU cycle! Although i didn't even try optimising it all that much.

bmc7505 · 3 years ago

There is an interesting algorithm for sorting that is highly amenable to parallelization and which I thought this article was going to be about: https://en.wikipedia.org/wiki/Sorting_network

cjalmeida · 3 years ago

Though about that as well. Few years ago I implemented a SIMD sort using those and mergesort.

ww520 · 3 years ago

Has anyone been working on using SIMD on sorting networks? For the 16-input network, for each stage on the first half of the pipeline, all comparisons have no dependency and seems to be a good candidate for SIMD.

janwas · 3 years ago

Yes indeed. IIRC std::sort's sorting network is only 3-7 elements wide. With SIMD we can handle 256. The trick is to minimize shuffling by reshaping 1D input to 2D and first sorting the columns. Previously one would then transpose (expensive) and then again sort columns. We instead fuse those two steps, see https://arxiv.org/abs/2205.05982.

0xf00ff00f · 3 years ago

I played a bit with sorting 8 floats in 2 SSE registers with a sorting network. I had a bit of trouble shuffling the registers so the values would be in the right places, so I ended up brute-forcing it:

https://github.com/0xf00ff00f/short-simd-sorter

I don't know if you could use the same idea to sort 16 values in 2 AVX registers. There's probably a better way.

ww520 · 3 years ago

That's pretty cool.

The compare_and_swap ops of the sorting network can be done in SIMD using _mm_min_epi32, _mm_max_epi32, and _mm_blend_epi32. For the 8-value network, they can operate on 4 pairs of the values and 8 pairs for 16-value network, as long as the pairs have no dependency with each other.

Const-me · 3 years ago

I recently tried to do that as well, but failed. Specifically, I have implemented AA sort [1] but for my use case the performance was about the same as std::sort in C++, for the substantial code complexity cost. I reverted to std::sort. The code is on github [2]

Still, in that particular project the vectors being sorted are relatively small, typically under than 100kb, so I have only implemented their “inner” algorithm which works on a single CPU core. The complete AA sort algorithm was apparently designed for large vectors, and uses both SIMD and multithreading. Might be still useful for very long vectors.

[1] https://ieeexplore.ieee.org/document/4336211

[2] https://github.com/Const-me/fTetWild/blob/master/MeshRepair/...

janwas · 3 years ago

We also experimented with 4x4 networks but it's very helpful to use 256 or 512-bit SIMD. You might give our vqsort a try, it's quite a bit faster than std::sort: https://github.com/google/highway/blob/master/hwy/contrib/so...

anonymoushn · 3 years ago

This is a good explanation. Writing the masks in big-endian somewhat obscures things.

As an aside, while recent developments in quicksort are quite good, it seems like MSB radix sort performs comparably to (or slightly better than!) these algorithms on random inputs.

bee_rider · 3 years ago

Random inputs for which you don’t know the bounds?

anonymoushn · 3 years ago

Yes. If the input is bounded in a narrow range compared to the length if the input, it seems like it should actually be a bit slower than normal, since the first few passes may do nothing and the later passes will need to be run on more of the input than normal. We're using it for arrays of doubles that have fairly small exponents though, and this causes us to have lots of empty buckets in the first pass, and it performs well enough.

xphos · 3 years ago

This is a well written SIMD example I find these are the hardest topic to clearly explain because of how much is happening. Anyone got other nice links I've been doing a lot of vector processing so I need to learn!

gww · 3 years ago

Someone previously shared this YouTube playlist on HN, and I have found it to be very helpful [1].

[1] https://www.youtube.com/playlist?list=PLYCMvilhmuPEM8DUvY6Wg...

kristianp · 3 years ago

You probably know this, but Daniel Lemire's blog often has simple examples using avx-512 and/or branchless algorithms:

https://news.ycombinator.com/from?site=lemire.me