AVX 512 has a number of instructions that make it useful for UTF-8 parsing, floating point parsing, XML parsing, JSON parsing, things like that. It is tricky coding
All things that are good for HFT but also good for speeding up your web browser and maybe even saving power because you can dial down the clock rate. It's a tragedy that Intel is fusing off AVX 512 in consumer parts so they can stuff the chips with thoroughly pizzled phonish low-performance cores.
Vector instructions (ala Cray) are future proof in ways that SIMD are not. Eventually every high performance CPU will have HBM, I am not sure SIMD is the mechanism we will use in to extract the most efficiency out of the platform.
I feel that SIMD vs vector is not the most useful distinction to make.
Yes, at the ISA and binutils level there is a difference in terms of number of instructions and ugly encodings.
But the actual application source code can actually look the same for SIMD and vector - vector-length agnostic style is helpful for both.
Instead, it seems more interesting to focus on preparing more applications for either SIMD or vector: with data-parallel algorithms, avoiding branches etc.
Of course when you only have packed instructions (a la SIMD) you have no choice but using them in place of vector instructions, but vector instructions in general do not fully subsume packed instructions. I think packed instructions are better at exploiting opportunistic parallelism from either a short loop (where vector instructions would have higher overhead) or a series of similar but slightly different scalar instructions.
Not sure if you're referring to high frequency trading here, but if that's the case then it's just not true.
AVX is infamously bad for low latency becuse these instructions result in insane amounts of heat which forces the CPU to get down clocked (sometimes the cores on the neighboring tiles on the SoC too). P-State transitions are a serious source of execution jitter and a big no-no in our business.
Many HFT shops actually disable AVX with noxsave.
It's fantastically useful for HPC thought.
I think I found there’s also a quite expensive ‘warmup’ which makes much of AVX only worth it for larger loops. But my memory or original understanding might be incorrect and my understanding of why that warmup might be the case (maybe microcode caches in the front end??) is very poor.
Didn't read the post ? THat is explicitly mentioned
> Yes, yes, yes, you can use it in random places. It's not only some JSON parsing library, you'll find it in other libraries too. I suspect the most use it gets in many situations is - drum roll - implementing memmove/strchr/strcmp style things. And in absolutely zero of those cases will it have been done by the compiler auto-vectorizing it.
I couldn't agree more. I don't think compiler vectorization is that useful even for columnar (!) database we're building. The specialized JIT doesn't even use AVX512 because too much effort for little to no gain.
> And in absolutely zero of those cases will it have been done by the compiler auto-vectorizing it.
How much of that is due to language semantics? Can't FORTRAN compilers make better use of the AVX instructions than, say, a C or Rust compiler could due to memory reference semantics?
I think that's what "if you aren't into parsing (but not using) JSON" is supposed to be referencing.
And, "It's not only some JSON parsing library, you'll find it in other libraries too. I suspect the most use it gets in many situations is - drum roll - implementing memmove/strchr/strcmp style things. And in absolutely zero of those cases will it have been done by the compiler auto-vectorizing it."
One thing I learned in grad school is you can do a stupendous number of floating point operations in the time it takes to serialize a matrix to ASCII numbers and deserialize it. Whatever it is you are doing with a JSON document might be slower than parsing it.
It's true that autovectorization accomplishes very little but specialized libraries could have a notable effect on perceived performance if they were widely developed and used.
Frankly Intel has had less interest in getting you to buy a new computer by making it the best computer you ever bought than it has been in taking as much revenue as it can from the rest of the BOM, for instance the junk integrated graphics that sabotaged the Windows Vista launch and has been making your computer crash faster ever since. Another example is that they don't ship MKL out of the box on Windows or Linux although they do on MacOS. And Intel wonders why their sales are slipping...
I don't think Intel has even decided whether to fuse them off or not. In my newest Intel desktop, a Core i7-13700K, AVX-512 is actually available. It wasn't available on the i7-12700K that was in the same system on the same motherboard a few weeks ago.
i thought the reason they wanted to fuse them off was for all cores being completely compatible (just slower), without the need for the OS to force scheduling on specific cores (I'd guess what would have to happen is if an instruction not available on efficiency core was called, it trap, the OS would see it and then mark the process to only be executable on performance core?)
Generally speaking, the idea that the average person uses more than 4 cores is insane. Even as a power user/dev I can count the times when I needed more than 4 cores on one hand.
All the benchmarks that these CPUs are evaluated by, test either games, or run the same crazy stuff like fluid simulations that OPs article complains about. Hardware reviewers need a reality check.
I hate to break it to you, but outside of gaming, fluid sim, and battery life / power draw tests, nothing else matters for modern chips. Schmucks only need the kind of horsepower a phone is capable of, or want to know how long it will last. The only people who care about performance either really care because more perf = more dollars (fluid sim, hft, etc.) or they're gamers.
>"Although AVX-512 was not fuse-disabled on certain early Alder Lake desktop products, Intel plans to fuse off AVX-512 on Alder Lake products going forward." -Intel Spokesperson to Tom's Hardware.
Whether or not AVX 512 is the future, Intel's handling of this left a sour taste for me, and I'd rather be future-proof in case it does gain traction, since I do use a CPU for many years before building a new system. Intel's big/little cores (with nothing else new of note compared to 5+ years ago) offer nothing that future-proofs my workflows. 16 cores of equally performant power with the latest instruction sets does.
I very recently upgraded and had considered both Raptor Lake and Zen 4 CPU options, ultimately going with the latter due to, among other considerations, the AVX-512 support.
Future proofing is no doubt a valid consideration, but some of these benefits are already here today. For example, in a recent Linux distro (using glibc?), attach a debugger to just about any process and break on something like memmove or memset in libc, and you can see some AVX-512 code paths will be taken if your CPU supports these instructions.
Have you been programming with the Zen 4? I bought one, and I've been using the avx512 intrinsics via C++ and Rust (LLVM Clang for both), and I've been a little underwhelmed by the performance. Like say using Agner Fog's vector library, I'm getting about a 20% speedup going from avx2 to avx512. I was hoping for 2x.
If I need to crunch small amounts of data in a hurry, existing instructions are fine for that.
If I need to crunch large amounts of data in a hurry, I'll send it to a GPU. The CPU has no chance to compete with that.
I honestly don't understand who/what AVX512 is really for, other than artificial benchmarks that are intentionally engineered to depend on AVX512 far more often than any real-world application would.
Wasn't the issue that efficiency cores don't support AVX-512 and that operating system schedulers/software don't deal with this yet and end up running AVX-512 code on the efficiency cores?
That's a terrible CPU design. You might as well ship arm cores if you are going to have a mismatch of instruction set support on efficiency vs power cores.
Especially since apps using AVX-512 will likely sniff for it at runtime. So now, you have a thread that if rescheduled onto an efficiency core it will break on you. So now what, does the app dev need to start sending "capabilities requests" when it makes new threads to ensure the OS knows not to put it's stuff on an efficiency core?
On the other hand, you can now virtually guarantee that a GPGPU is present on Intel consumer chips. So now you can write your code 3 times; no vector acceleration (for the really old stuff), AVX-512 for the servers, and GPU for the consumer chips!
AVX-512 in general purpose CPUs was designed to not really make sense but start seeding the market at 10nm, vaguely make sense at 7nm, and really make sense starting at 5nm (or Intel's 14nm (Intel10) -> 10nm (Intel7) -> 7nm (Intel4)).
So Intel's process woes have been hurting their ability to execute in a meaningful way. Additionally Alder Lake was hurt by needing to pivot to heterogeneous cores (E cores and P cores), which hurt their ability to keep pushing this even in a 'maybe it makes sense, maybe it doesn't' state it had been in.
AVX-512 is in Alder Lake P-cores but not E-coree. AVX-512 is disabled in Alder Lake CPUs because the world is not ready for heterogenous ISA extensions. AVX-512 could be enabled by disabling E-cores.
It was supposedly maybe actually taken out of Raptor Lake, but the latest info I can find on that is from July: long before it was released. I have a Raptor Lake CPU but haven't found the time to experiment with disabling E-cores (far too busy overclocking memory and making sensible rainmeter skins).
Early Alder Lake could but they fused it off in the later ones and newer microcode also blocks it on the older processors.
Raptor Lake is basically "v-cache alder lake" (the major change is almost doubling cache size) so it's unsurprising they still don't have an answer there, and if they did it could be backported into Alder Lake, but they don't seem to have an immediate fix for AVX-512 in either generation.
Nobody really knows why, or what's up with Intel's AVX-512 strategy in general. I have not heard a fully conclusive/convincing answer in general.
The most convincing idea to me is Intel didn't want to deal with heterogeneous ISA between cores, maybe they are worried about the long-term tail of support that a mixed generation will entail.
Long term I think they will move to heterogeneous cores/architectures but with homogeneous ISA. The little cores will have it too, emulated if necessary, and that will solve all the CPUID-style "how many cores should I launch and what type" problems. That still permits big.little architecture mixtures, but fixes the wonky stuff with having different ISAs between cores.
there are probably some additional constraints like Samsung or whomever bumped into with their heterogenous-ISA architectures... like cache line size probably should not vary between big/little cores either. That will pin a few architectural features together, if they (or their impacts) are observable outside the "black box".
> AVX-512 is disabled in Alder Lake CPUs because the world is not ready for heterogenous ISA extensions
It was already opt-in (disabled unless you also disable efficiency cores), that is no justification to make it impossible to use for people who want to try it out.
But I suppose Intel just doesn't want people to write software using those new instructions.
Linus is correct that auto-vectorization is basically useless, but I have some minor disagreements. JSON is obviously not the only format or protocol that can be parsed using SIMD instructions. In general computers spend a lot of time parsing byte streams, and almost all of these could be parsed at gigabytes per second if someone familiar with the protocol, the intel intrinsics guide, and the tricks used in simdjson spent an afternoon on them. It's sort of unfortunate that our current software ecosystem makes it difficult for many developers to adopt these sorts of specialized parsers written in low-level languages and targeting specific vector extensions.
If someone wants to use a hash table or TLS, they may find AES-NI useful.
Part of his argument though was that for most cases where you're parsing JSON (or some other protocol), the parsing time is small compared to the amount of time you spend actually doing things with the data. And for most of the examples where the code is mostly parsing (say, an application doing RPC routing, where the service just decodes an RPC and then forwards it to another host) you can often use tricks to avoid parsing the whole document anyway. For example, if you have a protocol based on protobuf (like gRPC) you can put the fields needed for routing in the lowest number fields, and then use protobuf APIs to decode the protobuf lazily/incrementally; this way you will usually just need to decode the first (or perhaps the first few) fields, not the entire protobuf.
Making code faster and more efficient is always better, it's just that you might not see much improvement in end-to-end latency times if you make your protocol parsing code faster.
I think this is maybe true in practice because business logic is usually written really badly, but a lot of the time the work that must be done is ~linear in the size of the message.
One of the specific design properties of AVX-512 is to make auto-vectorization less useless. If Intel had committed to more predictable and wide support, we might be seeing the fruits of that now. Whether it would have been a "Game changer" or just a slight improvement, I don't know, but even slight improvements in cpu performance are worth it these days.
made me smirk, as games is what likely would have profited from it
games are also what tends to be some of the more havy compute applications outside of servers
through it likely would be mostly just some code here and there deep withing the game engine
but then this is also an example where auto-vectorization can matter as it means writing platform independent code which happens to be fast on the platforms you care about most but also happens to compile and run with some potential (but only potential) performance penalty when porting to other platforms. Not only makes this porting easier, it also for some games might be fine if the engine is a bit slower. Main drawback is that writing such "auto vectorized" code isn't easy and can be brittle.
I'm freshly out of the weeds of implementing some fixed-point decimal parsers, and only found ~1 cycle that I could save (at best) per iteration using avx512.
I largely agree with linus here. It's still a huge pain to do anything non-trivially-parallelizable with simd even in an avx512 world
1. I have NEVER been satisfied with the results of a compiler autovectorization, outside of the most trivial examples. Even when they can autovectorize, they fail to do basic transformations that greatly unlock ILP (to be fair these aren't allowed by FP rules since it changes operation ordering)
2. Gather/scatter still have awful performance, worse than manually doing the loads and piecemeal assembling the vector. The fancy 'results-partially-delivered-in-order-on-trap' semantics probably don't help. Speeding up these operations would instantly unlock SO much more performance (see my username)
3. The whole 'extract-upper-bit-to-bitmask' -> 'find-first-set-in-bitmask' really deserves an instruction to merge the last two (return index of first byte with uppermost bit set, etc). It's such a core series of instructions and ime often the bottleneck (each of those is 3 cycles)
4. The performance of some of the nicer avx512 intrinsics leaves a lot to be desired. Compress for example is neat in theory but is too slow for anything I've tried in real life. Masked loads are another example, cool in theory but in practice MUCH faster to replicate manually.
5. It still feels like there's a ton of really random/arbitrary holes in the instruction set. AVX512 closed some of these but introduced many more with the limited operations you can do on masks and expensive mask <-> integer conversions.
Some of the AVX512 instructions are nice additions though, even if you only use it on YMM or XMM registers. All the masked instructions make it much simpler to vectorize loops over variable amounts of data without special treatment for the head/tail of the data.
This is exactly right—-the 512b width is the least interesting thing about AVX512. All the new instructions on 128, 256, and 512b vectors, (finally) good support for unsigned integers, static rounding modes for floating point, and predication are much more exciting.
I don't mind head/tail stuff so much but the masked instructions and vcompress make a lot of "do some branchy different stuff to different data or filter a list" easier and faster compared to AVX2, so I'm a big fan
Some of the biggest investments into general purpose Linux performance, reliability, and sustainability were made by people with "specialty engines" that Linus is talking about. The specialty applications often turn out to be important because they push the platform to excel in unexpected ways.
With that said, the biggest practical problem we saw with AVX-512 was that it's so big, power hungry, and dissimilar to other components of the CPU. One tenant turning that thing on can throttle the whole chip, causing unpredictable performance and even reliability issues for everything on the machine. I imagine it's even worse on mobile/embedded, where turning it on can blow out your power and thermal budget. The alternatives (integrated GPUs/ML cores) seem to be more power efficient.
It's not terribly dissimilar from the SSE instructions before it. The biggest weirdness about AVX is it's CPU throttling. Which, IMO, should not be a thing for the reasons you mention.
I imagine that future iterations of AVX will likely ditch CPU throttling. I'm guessing it was a "our current design can't handle this and we need to ship!" sort of solution. (If they haven't already? Do current CPUs throttle AVX-512 instructions?)
> I imagine it's even worse on mobile/embedded
Fortunately, x86 mobile/embedded isn't terribly common, certainly not with AVX-512 support.
> I imagine that future iterations of AVX will likely ditch CPU throttling. I'm guessing it was a "our current design can't handle this and we need to ship!" sort of solution. (If they haven't already? Do current CPUs throttle AVX-512 instructions?)
exactly what it looks like to me, and it was already fixed in Ice Lake and probably doesn't exist at all in Zen4. Just nobody cares about Ice Lake-SP when Epyc is crushing the server market.
the whole thing looks very much to me like "oops we put not just AVX-512 but dual AVX-512 in our cores and on 14nm it can hammer the chip so hard we need more voltage... guess we're gonna have to stop and swing the voltage upwards".
not only is that less of a problem with higher-efficiency nodes, but the consumer core designs drop the idea of dual-AVX 512 entirely which reduces the worst-case voltage droop as well...
The turbo curves depend on the exact type of Xeon. Were they Bronze or Silver? Those are much worse in that regard.
It is also unclear to me that AVX-512 is necessarily problematic in terms of power and thermal effects. Scalar code on many threads actually uses more energy (dispatch/control of OoO instructions being the main driver) than vector code (running far fewer instructions). The much-maligned power-up phase does not matter if running vector code for at least a millisecond.
I agree dedicated HW can be more power-efficient still, assuming it does exactly what your algorithm requires. And yet they make the fragmentation and deployment issues (can I use it?) of SIMD look trivial by comparison :)
> One tenant turning that thing on can throttle the whole chip,
This wasn't related to AVX-512, it was related to its implementation on that specific, somewhat old, architecture. Current AMD and next generation Intel don't have this limitation, so it shouldn't be considered when thinking of "the future".
> One tenant turning that thing on can throttle the whole chip, causing unpredictable performance and even reliability issues for everything on the machine
This is where ironically where the double pumping of Zen 4 shines, it's less of an problem with it.
The problem with AVX isn’t that compilers find it difficult to emit enough of them or that people don’t often work on problems that are easily transformed to simd at all.
It’s enough if a small number of libraries use these instructions, and a lot larger number of programs can benefit.
The problem with AVX-512 in my opinion is the sparse hardware support, power/thermal problems and noisy neighbor phenomenon.
To be useful simd instructions need to be widely available and not have adverse power effects such as making other programs go much slower.
Firstly, there are in fact a lot of problems that benefit from parallelization, but are not quite wide enough to justify dealing with (say) a GPU. For example, even in graphics you often have medium-sized rasterization tasks that are much more convenient to run on CPUs. Audio processing is similar. Often you have chunks of vectorizable work inter-mixed with CPU-friendly pointer chasing, and the ability to dispatch SIMD work within nanoseconds of SISD work is very helpful.
Secondly, AVX-512 (and Arm's SVE) support predication and scatter/gather ops, both of which open the door to much more aggressive forms of automatic vectorization.
https://lemire.me/blog/2022/05/25/parsing-json-faster-with-i...
All things that are good for HFT but also good for speeding up your web browser and maybe even saving power because you can dial down the clock rate. It's a tragedy that Intel is fusing off AVX 512 in consumer parts so they can stuff the chips with thoroughly pizzled phonish low-performance cores.
https://people.eecs.berkeley.edu/~krste/papers/yunsup-ms.pdf
https://parlab.eecs.berkeley.edu/sites/all/parlab/files/mave...
https://www.cs.uic.edu/~ajayk/c566/VectorProcessors.pdf
https://inst.eecs.berkeley.edu/~cs152/sp18/lectures/L16-RISC...
Instead, it seems more interesting to focus on preparing more applications for either SIMD or vector: with data-parallel algorithms, avoiding branches etc.
https://developer.arm.com/Architectures/Scalable%20Vector%20...
Not sure if you're referring to high frequency trading here, but if that's the case then it's just not true. AVX is infamously bad for low latency becuse these instructions result in insane amounts of heat which forces the CPU to get down clocked (sometimes the cores on the neighboring tiles on the SoC too). P-State transitions are a serious source of execution jitter and a big no-no in our business.
Many HFT shops actually disable AVX with noxsave. It's fantastically useful for HPC thought.
> Yes, yes, yes, you can use it in random places. It's not only some JSON parsing library, you'll find it in other libraries too. I suspect the most use it gets in many situations is - drum roll - implementing memmove/strchr/strcmp style things. And in absolutely zero of those cases will it have been done by the compiler auto-vectorizing it.
How much of that is due to language semantics? Can't FORTRAN compilers make better use of the AVX instructions than, say, a C or Rust compiler could due to memory reference semantics?
And, "It's not only some JSON parsing library, you'll find it in other libraries too. I suspect the most use it gets in many situations is - drum roll - implementing memmove/strchr/strcmp style things. And in absolutely zero of those cases will it have been done by the compiler auto-vectorizing it."
It's true that autovectorization accomplishes very little but specialized libraries could have a notable effect on perceived performance if they were widely developed and used.
Frankly Intel has had less interest in getting you to buy a new computer by making it the best computer you ever bought than it has been in taking as much revenue as it can from the rest of the BOM, for instance the junk integrated graphics that sabotaged the Windows Vista launch and has been making your computer crash faster ever since. Another example is that they don't ship MKL out of the box on Windows or Linux although they do on MacOS. And Intel wonders why their sales are slipping...
All the benchmarks that these CPUs are evaluated by, test either games, or run the same crazy stuff like fluid simulations that OPs article complains about. Hardware reviewers need a reality check.
A browser, one process per tab, software updates continually spawning in the background…
Though I do agree that the benefits of high core count CPUs are oversold as far as most users are concerned.
Deleted Comment
>"Although AVX-512 was not fuse-disabled on certain early Alder Lake desktop products, Intel plans to fuse off AVX-512 on Alder Lake products going forward." -Intel Spokesperson to Tom's Hardware.
Whether or not AVX 512 is the future, Intel's handling of this left a sour taste for me, and I'd rather be future-proof in case it does gain traction, since I do use a CPU for many years before building a new system. Intel's big/little cores (with nothing else new of note compared to 5+ years ago) offer nothing that future-proofs my workflows. 16 cores of equally performant power with the latest instruction sets does.
Future proofing is no doubt a valid consideration, but some of these benefits are already here today. For example, in a recent Linux distro (using glibc?), attach a debugger to just about any process and break on something like memmove or memset in libc, and you can see some AVX-512 code paths will be taken if your CPU supports these instructions.
AVX512 is the future, but maybe not Intel's future, ironically.
If I need to crunch large amounts of data in a hurry, I'll send it to a GPU. The CPU has no chance to compete with that.
I honestly don't understand who/what AVX512 is really for, other than artificial benchmarks that are intentionally engineered to depend on AVX512 far more often than any real-world application would.
Especially since apps using AVX-512 will likely sniff for it at runtime. So now, you have a thread that if rescheduled onto an efficiency core it will break on you. So now what, does the app dev need to start sending "capabilities requests" when it makes new threads to ensure the OS knows not to put it's stuff on an efficiency core?
What a dumb design decision.
And if Intel would have wanted this, they could have deviced the kernel to handle the fault and pin the process to the big cores.
Deleted Comment
So Intel's process woes have been hurting their ability to execute in a meaningful way. Additionally Alder Lake was hurt by needing to pivot to heterogeneous cores (E cores and P cores), which hurt their ability to keep pushing this even in a 'maybe it makes sense, maybe it doesn't' state it had been in.
It was supposedly maybe actually taken out of Raptor Lake, but the latest info I can find on that is from July: long before it was released. I have a Raptor Lake CPU but haven't found the time to experiment with disabling E-cores (far too busy overclocking memory and making sensible rainmeter skins).
Raptor Lake is basically "v-cache alder lake" (the major change is almost doubling cache size) so it's unsurprising they still don't have an answer there, and if they did it could be backported into Alder Lake, but they don't seem to have an immediate fix for AVX-512 in either generation.
Nobody really knows why, or what's up with Intel's AVX-512 strategy in general. I have not heard a fully conclusive/convincing answer in general.
The most convincing idea to me is Intel didn't want to deal with heterogeneous ISA between cores, maybe they are worried about the long-term tail of support that a mixed generation will entail.
Long term I think they will move to heterogeneous cores/architectures but with homogeneous ISA. The little cores will have it too, emulated if necessary, and that will solve all the CPUID-style "how many cores should I launch and what type" problems. That still permits big.little architecture mixtures, but fixes the wonky stuff with having different ISAs between cores.
there are probably some additional constraints like Samsung or whomever bumped into with their heterogenous-ISA architectures... like cache line size probably should not vary between big/little cores either. That will pin a few architectural features together, if they (or their impacts) are observable outside the "black box".
It was already opt-in (disabled unless you also disable efficiency cores), that is no justification to make it impossible to use for people who want to try it out.
But I suppose Intel just doesn't want people to write software using those new instructions.
If someone wants to use a hash table or TLS, they may find AES-NI useful.
Making code faster and more efficient is always better, it's just that you might not see much improvement in end-to-end latency times if you make your protocol parsing code faster.
made me smirk, as games is what likely would have profited from it
games are also what tends to be some of the more havy compute applications outside of servers
through it likely would be mostly just some code here and there deep withing the game engine
but then this is also an example where auto-vectorization can matter as it means writing platform independent code which happens to be fast on the platforms you care about most but also happens to compile and run with some potential (but only potential) performance penalty when porting to other platforms. Not only makes this porting easier, it also for some games might be fine if the engine is a bit slower. Main drawback is that writing such "auto vectorized" code isn't easy and can be brittle.
I largely agree with linus here. It's still a huge pain to do anything non-trivially-parallelizable with simd even in an avx512 world
1. I have NEVER been satisfied with the results of a compiler autovectorization, outside of the most trivial examples. Even when they can autovectorize, they fail to do basic transformations that greatly unlock ILP (to be fair these aren't allowed by FP rules since it changes operation ordering)
2. Gather/scatter still have awful performance, worse than manually doing the loads and piecemeal assembling the vector. The fancy 'results-partially-delivered-in-order-on-trap' semantics probably don't help. Speeding up these operations would instantly unlock SO much more performance (see my username)
3. The whole 'extract-upper-bit-to-bitmask' -> 'find-first-set-in-bitmask' really deserves an instruction to merge the last two (return index of first byte with uppermost bit set, etc). It's such a core series of instructions and ime often the bottleneck (each of those is 3 cycles)
4. The performance of some of the nicer avx512 intrinsics leaves a lot to be desired. Compress for example is neat in theory but is too slow for anything I've tried in real life. Masked loads are another example, cool in theory but in practice MUCH faster to replicate manually.
5. It still feels like there's a ton of really random/arbitrary holes in the instruction set. AVX512 closed some of these but introduced many more with the limited operations you can do on masks and expensive mask <-> integer conversions.
With that said, the biggest practical problem we saw with AVX-512 was that it's so big, power hungry, and dissimilar to other components of the CPU. One tenant turning that thing on can throttle the whole chip, causing unpredictable performance and even reliability issues for everything on the machine. I imagine it's even worse on mobile/embedded, where turning it on can blow out your power and thermal budget. The alternatives (integrated GPUs/ML cores) seem to be more power efficient.
It's not terribly dissimilar from the SSE instructions before it. The biggest weirdness about AVX is it's CPU throttling. Which, IMO, should not be a thing for the reasons you mention.
I imagine that future iterations of AVX will likely ditch CPU throttling. I'm guessing it was a "our current design can't handle this and we need to ship!" sort of solution. (If they haven't already? Do current CPUs throttle AVX-512 instructions?)
> I imagine it's even worse on mobile/embedded
Fortunately, x86 mobile/embedded isn't terribly common, certainly not with AVX-512 support.
exactly what it looks like to me, and it was already fixed in Ice Lake and probably doesn't exist at all in Zen4. Just nobody cares about Ice Lake-SP when Epyc is crushing the server market.
https://travisdowns.github.io/blog/2020/08/19/icl-avx512-fre...
and some measurements of Skylake-X displaying the effect, note the voltage swings in particular: https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html
the whole thing looks very much to me like "oops we put not just AVX-512 but dual AVX-512 in our cores and on 14nm it can hammer the chip so hard we need more voltage... guess we're gonna have to stop and swing the voltage upwards".
not only is that less of a problem with higher-efficiency nodes, but the consumer core designs drop the idea of dual-AVX 512 entirely which reduces the worst-case voltage droop as well...
Already happening:
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Dow...
It is also unclear to me that AVX-512 is necessarily problematic in terms of power and thermal effects. Scalar code on many threads actually uses more energy (dispatch/control of OoO instructions being the main driver) than vector code (running far fewer instructions). The much-maligned power-up phase does not matter if running vector code for at least a millisecond.
I agree dedicated HW can be more power-efficient still, assuming it does exactly what your algorithm requires. And yet they make the fragmentation and deployment issues (can I use it?) of SIMD look trivial by comparison :)
So despite running many codes that use all cores 100%, using AVX512 is the only way I've seen them throttle.
This wasn't related to AVX-512, it was related to its implementation on that specific, somewhat old, architecture. Current AMD and next generation Intel don't have this limitation, so it shouldn't be considered when thinking of "the future".
This is where ironically where the double pumping of Zen 4 shines, it's less of an problem with it.
It’s enough if a small number of libraries use these instructions, and a lot larger number of programs can benefit.
The problem with AVX-512 in my opinion is the sparse hardware support, power/thermal problems and noisy neighbor phenomenon.
To be useful simd instructions need to be widely available and not have adverse power effects such as making other programs go much slower.
There are still measurable effects but not like on skylake, to my knowledge at least.
Secondly, AVX-512 (and Arm's SVE) support predication and scatter/gather ops, both of which open the door to much more aggressive forms of automatic vectorization.