Intel Publishes Fast AVX-512 Sorting Library, 10~17x Faster Sorts in NumPy

The ironic part is the latest Intel CPU no longer support AVX-512 and AMD now provide better AVX-512 CPU implementation in Zen 4.

https://www.anandtech.com/show/17047/the-intel-12th-gen-core...https://www.phoronix.com/review/amd-epyc-9004-genoa

ivegotnoaccount · 3 years ago

Isn't it on their consumer lines only that Intel removed AVX-512?

Saphire Rapids is indicated as having support for it.

I know that even though you pointed to an EPYC CPU, all Zen4 support it, but Intel probably released it more for the professional users than their non-pro ones.

WJW · 3 years ago

Yes it's only the Alder Lake (ie cheaper, consumer oriented CPUs) in which it has been removed. Server chips still have it AFAIK.

Even on Alder Lake, the official explanation is that it has both P(erformance) and E(fficiency) cores, with the E cores being significantly more power efficient and the P cores being significantly faster. The P cores have AVX512, the E cores don't. Since most kernels have no idea that this is a thing and treat all CPUs in a system as equal, they will happily schedule code with AVX512 instructions on an E core. This obviously crashes, since the CPU doesn't know how to handle those instructions. Some motherboard manufacturers allowed you to work around this by simply turning off all the E-cores, so only the P-cores (with AVX512 support) remained. Intel was not a fan of this and eventually disabled AVX512 in hardware.

As ivegotnoaccount mentioned, the Sapphire Rapids range of CPUs will have AVX512. Those are not intended for the typical consumer or mobile platform though, but for servers and big workstations where power consumption is much less of a concern. You would probably not want such a chip in your laptop or phone.

On an Ice Lake GCE instance, Highway's vqsort was 40% faster when sorting uint64s. vqsort also doesn't require avx512, and supports a wider array of types (including 128-bit integers and 64-bit k/v pairs), so it's more useful IMO. It's a much heavier weight dependency though.

Code / scripts here: https://github.com/funrollloops/parallel-sort-bench

I had to use a cloud instance for testing since I don't have an avx512-capable personal machine.

janwas · 3 years ago

Thanks for sharing the benchmark :D Is there anything we could do to make Highway/vqsort (feel like) a lighter dependency?

sagarm · 3 years ago

I made that comment because the Intel library is header only. While header only libraries can be convenient, for non-trivial projects I prefer a well engineered CMake-based build for better compile times.

sigtstp · 3 years ago

Interesting! The benchmark appears to be using only random data though. Any measurements for partially sorted or reverse sorted data?

sagarm · 3 years ago

Good question. I added a test for sorted runs in addition to random data, and added pdqsort for comparison.

avx512_qsort benefits from sorted runs while vqsort does not, but vqsort is still 20% faster in this case as well.

The full results are in the README, but a short version:

  --------------------------------------------------                                                                   
  Benchmark                            Time      CPU                                                                   
  --------------------------------------------------                                                                   
  TbbSort/random                     3.17 s  16.7  s                                                                   
  HwySort/random                     2.27 s   2.27 s                                                                   
  StdPartitionHwySort/random         2.06 s   4.01 s                                              
  PdqSort/random                     5.67 s   5.67 s                                              
  IntelX86SIMDSort/random            3.73 s   3.73 s                                              
  TbbSort/sorted_runs                1.58 s   8.02 s                                              
  HwySort/sorted_runs                2.38 s   2.38 s                                              
  StdPartitionHwySort/sorted_runs    1.11 s   3.21 s                                              
  PdqSort/sorted_runs                5.30 s   5.30 s                                                                   
  IntelX86SIMDSort/sorted_runs       2.90 s   2.90 s

eliasmacpherson · 3 years ago

Here's the vqsort discussion from the last time I saw it on this site: https://news.ycombinator.com/item?id=31622548

mekpro · 3 years ago

fancyfredbot · 3 years ago

AVX-512 is wide enough to process 8 64 bit floats at once. To get a 10x speedup with an 8 wide SIMD unit is a little difficult to explain. Some of this speedup is presumably coming from fewer branch instructions in addition to the vector width. It's extremely impressive. Also, it has taken Intel a surprisingly long time!

dragontamer · 3 years ago

L1 cache on Intel machines reads/writes in 512-bit chunks. So you get a 2x faster L1 cache when working with AVX512 on Intel IIRC.

Or perhaps more accurately: L1 cache that can process twice the data in the same amount of time.

usrusr · 3 years ago

That sounds suspiciously as if an implementation tailored to that cache line size might see a considerable part of the speedup even running on non-SIMD operations? (or on less wide SIMD)

Deleted Comment

skavi · 3 years ago

AVX-512 has masks and a lot of new instructions. It's not just wider.

VHRanger · 3 years ago

My understanding is that AVX-512 also has a lot more functions, so composing something less naturally parallel (eg. simdJSON) is easier in it

adgjlsfhk1 · 3 years ago

avx512 also gives you 2x more register space which can be very useful.

lodi · 3 years ago

It's not "8-wide", it's "512-bits wide". The basic "foundation" profile supports splitting up those bits into 8 qword, 16 dword, etc. while other profiles support finer granularity up to 64 bytes. Plus you get more registers, new instructions, and so on.

IanCutress · 3 years ago

AVX2 is like a portion of a pie without filling. AVX512 is like a full pie with extra filling. You're getting filling, not simply more pie.

posnet · 3 years ago

It would be interesting to see it benchmarked against the highway qsort[1] Google published last year.

[1] https://github.com/google/highway/tree/master/hwy/contrib/so...

sagarm has posted one result in another thread. I'll also look into adding their code to our benchmark :)

It's great to see more vector code, but caveat for anyone using this: the pivot sampling is quite basic, just median of 16 evenly spaced samples. This is will perform poorly on skewed distributions including all-equal and very few unique values. Yes, in the worst case it can resort to std::sort but that's a >10x speed hit and until recently also potentially O(N^2)!.

We have drawn larger samples (nine vectors, not one), and subsequently extended the vqsort algorithm beyond what is described in our paper, e.g. special handling for 1..3 unique keys, see https://github.com/google/highway/blob/master/hwy/contrib/so....

I've posted bench_sort results in another thread. vqsort is about 1.8 times as fast for uniform random 32/64-bit.

mumumu · 3 years ago

Now we only need a consumer CPU from Intel with AVX-512 enabled.

kristianp · 3 years ago

For consumer cpus, you can go back to 11th Gen Intel if you want avx-512 support.[1] Not ideal, I know.

[1] https://blog.reyem.dev/post/which-consumer-computers-support...

Kon-Peki · 3 years ago

Not listed on that page is the Microsoft Surface Laptop Go, which has the same i5-1135G7 as the X1 Carbon listed.

It appears that MS is clearing out their remaining stock with discounts, and they are really nice little machines with outstanding build quality, very good keyboards, and a 3:2 touchscreen.

It was never a popular machine, I think it had very unfortunate naming which leads people to confuse it with other MS products. You have to think of it as something like a super-premium Chromebook to understand what it is for. But regardless, you can dump Windows and install Linux just fine.

stephencanon · 3 years ago

RIP Icelake, we hardly knew you.

smcl · 3 years ago

Wtf come on I was joking https://news.ycombinator.com/item?id=34803128

divbzero · 3 years ago

Dedicated hardware for common algorithms is actually not very far fetched. In addition to GPUs, we already have examples of HSMs [1] and TPUs [2] that optimize for specific cryptographic and machine learning operations.

[1]: https://en.wikipedia.org/wiki/Hardware_security_module

[2]: https://en.wikipedia.org/wiki/Tensor_Processing_Unit

orangeoxidation · 3 years ago

There's video de- and encoding as well.

garbagecoder · 3 years ago

well, really that's basically the only place left to go at the moment. I don't think we're likely to have 10GHz any time soon or 1,024 cores either. Specialized circuits are probably all that's left before we start hitting asymptotes.

erk__ · 3 years ago

Well it does already exist https://www.ibm.com/docs/en/zos/2.5.0?topic=works-integrated...

This is incredible, I feel like I am manifesting things by posting HN comments. What'll they think of next, one billion euros in cash hidden in my wardrobe!?

sylware · 3 years ago

This is probably the vectorized quicksort. I remember the paper detailing the algorithm here on HN.

Since then, I know that if I really need to sort numbers very fast one day, I would have to learn the vectorized way to quicksort.

I would probably write it directly in assembly though, with a C API (coze tinycc, cproc, scc...)

Blackthorn · 3 years ago

Thanks Intel! Now maybe you can release some processors for those of us at home, that actually have AVX-512!

hummus_bae · 3 years ago

Please.