Isn't it on their consumer lines only that Intel removed AVX-512?
Saphire Rapids is indicated as having support for it.
I know that even though you pointed to an EPYC CPU, all Zen4 support it, but Intel probably released it more for the professional users than their non-pro ones.
Yes it's only the Alder Lake (ie cheaper, consumer oriented CPUs) in which it has been removed. Server chips still have it AFAIK.
Even on Alder Lake, the official explanation is that it has both P(erformance) and E(fficiency) cores, with the E cores being significantly more power efficient and the P cores being significantly faster. The P cores have AVX512, the E cores don't. Since most kernels have no idea that this is a thing and treat all CPUs in a system as equal, they will happily schedule code with AVX512 instructions on an E core. This obviously crashes, since the CPU doesn't know how to handle those instructions. Some motherboard manufacturers allowed you to work around this by simply turning off all the E-cores, so only the P-cores (with AVX512 support) remained. Intel was not a fan of this and eventually disabled AVX512 in hardware.
As ivegotnoaccount mentioned, the Sapphire Rapids range of CPUs will have AVX512. Those are not intended for the typical consumer or mobile platform though, but for servers and big workstations where power consumption is much less of a concern. You would probably not want such a chip in your laptop or phone.
AVX-512 is wide enough to process 8 64 bit floats at once. To get a 10x speedup with an 8 wide SIMD unit is a little difficult to explain. Some of this speedup is presumably coming from fewer branch instructions in addition to the vector width. It's extremely impressive. Also, it has taken Intel a surprisingly long time!
That sounds suspiciously as if an implementation tailored to that cache line size might see a considerable part of the speedup even running on non-SIMD operations? (or on less wide SIMD)
It's not "8-wide", it's "512-bits wide". The basic "foundation" profile supports splitting up those bits into 8 qword, 16 dword, etc. while other profiles support finer granularity up to 64 bytes. Plus you get more registers, new instructions, and so on.
On an Ice Lake GCE instance, Highway's vqsort was 40% faster when sorting uint64s. vqsort also doesn't require avx512, and supports a wider array of types (including 128-bit integers and 64-bit k/v pairs), so it's more useful IMO. It's a much heavier weight dependency though.
I made that comment because the Intel library is header only. While header only libraries can be convenient, for non-trivial projects I prefer a well engineered CMake-based build for better compile times.
Good question. I added a test for sorted runs in addition to random data, and added pdqsort for comparison.
avx512_qsort benefits from sorted runs while vqsort does not, but vqsort is still 20% faster in this case as well.
The full results are in the README, but a short version:
--------------------------------------------------
Benchmark Time CPU
--------------------------------------------------
TbbSort/random 3.17 s 16.7 s
HwySort/random 2.27 s 2.27 s
StdPartitionHwySort/random 2.06 s 4.01 s
PdqSort/random 5.67 s 5.67 s
IntelX86SIMDSort/random 3.73 s 3.73 s
TbbSort/sorted_runs 1.58 s 8.02 s
HwySort/sorted_runs 2.38 s 2.38 s
StdPartitionHwySort/sorted_runs 1.11 s 3.21 s
PdqSort/sorted_runs 5.30 s 5.30 s
IntelX86SIMDSort/sorted_runs 2.90 s 2.90 s
sagarm has posted one result in another thread. I'll also look into adding their code to our benchmark :)
It's great to see more vector code, but caveat for anyone using this: the pivot sampling is quite basic, just median of 16 evenly spaced samples. This is will perform poorly on skewed distributions including all-equal and very few unique values. Yes, in the worst case it can resort to std::sort but that's a >10x speed hit and until recently also potentially O(N^2)!.
We have drawn larger samples (nine vectors, not one), and subsequently extended the vqsort algorithm beyond what is described in our paper, e.g. special handling for 1..3 unique keys, see https://github.com/google/highway/blob/master/hwy/contrib/so....
Not listed on that page is the Microsoft Surface Laptop Go, which has the same i5-1135G7 as the X1 Carbon listed.
It appears that MS is clearing out their remaining stock with discounts, and they are really nice little machines with outstanding build quality, very good keyboards, and a 3:2 touchscreen.
It was never a popular machine, I think it had very unfortunate naming which leads people to confuse it with other MS products. You have to think of it as something like a super-premium Chromebook to understand what it is for. But regardless, you can dump Windows and install Linux just fine.
Dedicated hardware for common algorithms is actually not very far fetched. In addition to GPUs, we already have examples of HSMs [1] and TPUs [2] that optimize for specific cryptographic and machine learning operations.
well, really that's basically the only place left to go at the moment. I don't think we're likely to have 10GHz any time soon or 1,024 cores either. Specialized circuits are probably all that's left before we start hitting asymptotes.
This is incredible, I feel like I am manifesting things by posting HN comments. What'll they think of next, one billion euros in cash hidden in my wardrobe!?
https://www.anandtech.com/show/17047/the-intel-12th-gen-core...https://www.phoronix.com/review/amd-epyc-9004-genoa
Saphire Rapids is indicated as having support for it.
I know that even though you pointed to an EPYC CPU, all Zen4 support it, but Intel probably released it more for the professional users than their non-pro ones.
Even on Alder Lake, the official explanation is that it has both P(erformance) and E(fficiency) cores, with the E cores being significantly more power efficient and the P cores being significantly faster. The P cores have AVX512, the E cores don't. Since most kernels have no idea that this is a thing and treat all CPUs in a system as equal, they will happily schedule code with AVX512 instructions on an E core. This obviously crashes, since the CPU doesn't know how to handle those instructions. Some motherboard manufacturers allowed you to work around this by simply turning off all the E-cores, so only the P-cores (with AVX512 support) remained. Intel was not a fan of this and eventually disabled AVX512 in hardware.
As ivegotnoaccount mentioned, the Sapphire Rapids range of CPUs will have AVX512. Those are not intended for the typical consumer or mobile platform though, but for servers and big workstations where power consumption is much less of a concern. You would probably not want such a chip in your laptop or phone.
Or perhaps more accurately: L1 cache that can process twice the data in the same amount of time.
Deleted Comment
Code / scripts here: https://github.com/funrollloops/parallel-sort-bench
I had to use a cloud instance for testing since I don't have an avx512-capable personal machine.
avx512_qsort benefits from sorted runs while vqsort does not, but vqsort is still 20% faster in this case as well.
The full results are in the README, but a short version:
[1] https://github.com/google/highway/tree/master/hwy/contrib/so...
It's great to see more vector code, but caveat for anyone using this: the pivot sampling is quite basic, just median of 16 evenly spaced samples. This is will perform poorly on skewed distributions including all-equal and very few unique values. Yes, in the worst case it can resort to std::sort but that's a >10x speed hit and until recently also potentially O(N^2)!.
We have drawn larger samples (nine vectors, not one), and subsequently extended the vqsort algorithm beyond what is described in our paper, e.g. special handling for 1..3 unique keys, see https://github.com/google/highway/blob/master/hwy/contrib/so....
[1] https://blog.reyem.dev/post/which-consumer-computers-support...
It appears that MS is clearing out their remaining stock with discounts, and they are really nice little machines with outstanding build quality, very good keyboards, and a 3:2 touchscreen.
It was never a popular machine, I think it had very unfortunate naming which leads people to confuse it with other MS products. You have to think of it as something like a super-premium Chromebook to understand what it is for. But regardless, you can dump Windows and install Linux just fine.
[1]: https://en.wikipedia.org/wiki/Hardware_security_module
[2]: https://en.wikipedia.org/wiki/Tensor_Processing_Unit
Since then, I know that if I really need to sort numbers very fast one day, I would have to learn the vectorized way to quicksort.
I would probably write it directly in assembly though, with a C API (coze tinycc, cproc, scc...)