AVX 512 will be the future

PaulHoule · 3 years ago

AVX 512 has a number of instructions that make it useful for UTF-8 parsing, floating point parsing, XML parsing, JSON parsing, things like that. It is tricky coding

https://lemire.me/blog/2022/05/25/parsing-json-faster-with-i...

All things that are good for HFT but also good for speeding up your web browser and maybe even saving power because you can dial down the clock rate. It's a tragedy that Intel is fusing off AVX 512 in consumer parts so they can stuff the chips with thoroughly pizzled phonish low-performance cores.

sitkack · 3 years ago

Vector instructions (ala Cray) are future proof in ways that SIMD are not. Eventually every high performance CPU will have HBM, I am not sure SIMD is the mechanism we will use in to extract the most efficiency out of the platform.

https://people.eecs.berkeley.edu/~krste/papers/yunsup-ms.pdf

https://parlab.eecs.berkeley.edu/sites/all/parlab/files/mave...

https://www.cs.uic.edu/~ajayk/c566/VectorProcessors.pdf

https://inst.eecs.berkeley.edu/~cs152/sp18/lectures/L16-RISC...

janwas · 3 years ago

I feel that SIMD vs vector is not the most useful distinction to make. Yes, at the ISA and binutils level there is a difference in terms of number of instructions and ugly encodings. But the actual application source code can actually look the same for SIMD and vector - vector-length agnostic style is helpful for both.

Instead, it seems more interesting to focus on preparing more applications for either SIMD or vector: with data-parallel algorithms, avoiding branches etc.

fooblaster · 3 years ago

There have been attempts to remedy the fixed width simd vector problem that are already shipping:

https://developer.arm.com/Architectures/Scalable%20Vector%20...

lifthrasiir · 3 years ago

Of course when you only have packed instructions (a la SIMD) you have no choice but using them in place of vector instructions, but vector instructions in general do not fully subsume packed instructions. I think packed instructions are better at exploiting opportunistic parallelism from either a short loop (where vector instructions would have higher overhead) or a series of similar but slightly different scalar instructions.

bitcharmer · 3 years ago

> All things that are good for HFT (...)

Not sure if you're referring to high frequency trading here, but if that's the case then it's just not true. AVX is infamously bad for low latency becuse these instructions result in insane amounts of heat which forces the CPU to get down clocked (sometimes the cores on the neighboring tiles on the SoC too). P-State transitions are a serious source of execution jitter and a big no-no in our business.

Many HFT shops actually disable AVX with noxsave. It's fantastically useful for HPC thought.

dan-robertson · 3 years ago

I think I found there’s also a quite expensive ‘warmup’ which makes much of AVX only worth it for larger loops. But my memory or original understanding might be incorrect and my understanding of why that warmup might be the case (maybe microcode caches in the front end??) is very poor.

ilyt · 3 years ago

Didn't read the post ? THat is explicitly mentioned

> Yes, yes, yes, you can use it in random places. It's not only some JSON parsing library, you'll find it in other libraries too. I suspect the most use it gets in many situations is - drum roll - implementing memmove/strchr/strcmp style things. And in absolutely zero of those cases will it have been done by the compiler auto-vectorizing it.

bluestreak · 3 years ago

I couldn't agree more. I don't think compiler vectorization is that useful even for columnar (!) database we're building. The specialized JIT doesn't even use AVX512 because too much effort for little to no gain.

gumby · 3 years ago

> And in absolutely zero of those cases will it have been done by the compiler auto-vectorizing it.

How much of that is due to language semantics? Can't FORTRAN compilers make better use of the AVX instructions than, say, a C or Rust compiler could due to memory reference semantics?

eesmith · 3 years ago

I think that's what "if you aren't into parsing (but not using) JSON" is supposed to be referencing.

And, "It's not only some JSON parsing library, you'll find it in other libraries too. I suspect the most use it gets in many situations is - drum roll - implementing memmove/strchr/strcmp style things. And in absolutely zero of those cases will it have been done by the compiler auto-vectorizing it."

PaulHoule · 3 years ago

One thing I learned in grad school is you can do a stupendous number of floating point operations in the time it takes to serialize a matrix to ASCII numbers and deserialize it. Whatever it is you are doing with a JSON document might be slower than parsing it.

It's true that autovectorization accomplishes very little but specialized libraries could have a notable effect on perceived performance if they were widely developed and used.

Frankly Intel has had less interest in getting you to buy a new computer by making it the best computer you ever bought than it has been in taking as much revenue as it can from the rest of the BOM, for instance the junk integrated graphics that sabotaged the Windows Vista launch and has been making your computer crash faster ever since. Another example is that they don't ship MKL out of the box on Windows or Linux although they do on MacOS. And Intel wonders why their sales are slipping...

jeffbee · 3 years ago

I don't think Intel has even decided whether to fuse them off or not. In my newest Intel desktop, a Core i7-13700K, AVX-512 is actually available. It wasn't available on the i7-12700K that was in the same system on the same motherboard a few weeks ago.

borissk · 3 years ago

Are you sure? According to Intel spec AVX-512 is not available - https://ark.intel.com/content/www/us/en/ark/products/230500/...

compsciphd · 3 years ago

i thought the reason they wanted to fuse them off was for all cores being completely compatible (just slower), without the need for the OS to force scheduling on specific cores (I'd guess what would have to happen is if an instruction not available on efficiency core was called, it trap, the OS would see it and then mark the process to only be executable on performance core?)

torginus · 3 years ago

Generally speaking, the idea that the average person uses more than 4 cores is insane. Even as a power user/dev I can count the times when I needed more than 4 cores on one hand.

All the benchmarks that these CPUs are evaluated by, test either games, or run the same crazy stuff like fluid simulations that OPs article complains about. Hardware reviewers need a reality check.

jl6 · 3 years ago

Eh, a modern desktop can have a lot of concurrent processes running.

A browser, one process per tab, software updates continually spawning in the background…

Though I do agree that the benefits of high core count CPUs are oversold as far as most users are concerned.

eyegor · 3 years ago

I hate to break it to you, but outside of gaming, fluid sim, and battery life / power draw tests, nothing else matters for modern chips. Schmucks only need the kind of horsepower a phone is capable of, or want to know how long it will last. The only people who care about performance either really care because more perf = more dollars (fluid sim, hft, etc.) or they're gamers.

gpderetta · 3 years ago

I'm a C++ dev. I'll use as many cores as I can get for the build! Even my IDE can consume all the cores just by opening some files.

fomine3 · 3 years ago

Games are really really popular application

Deleted Comment

Iwan-Zotow · 3 years ago

good porno will saturate all available cores

zeusk · 3 years ago

They have a replacement coming in 2023/24

kristianp · 3 years ago

Replacement cpu generation or replacement for avx-512? Will it be variable length?

superkuh · 3 years ago

UH... didn't Intel actually stop including AVX 512 in their newest (alder lake) processors? This seems unlikely. ie, https://www.tomshardware.com/news/intel-nukes-alder-lake-avx...

>"Although AVX-512 was not fuse-disabled on certain early Alder Lake desktop products, Intel plans to fuse off AVX-512 on Alder Lake products going forward." -Intel Spokesperson to Tom's Hardware.

Aardwolf · 3 years ago

I switched from Intel to AMD Zen 4 over this.

Whether or not AVX 512 is the future, Intel's handling of this left a sour taste for me, and I'd rather be future-proof in case it does gain traction, since I do use a CPU for many years before building a new system. Intel's big/little cores (with nothing else new of note compared to 5+ years ago) offer nothing that future-proofs my workflows. 16 cores of equally performant power with the latest instruction sets does.

ls65536 · 3 years ago

I very recently upgraded and had considered both Raptor Lake and Zen 4 CPU options, ultimately going with the latter due to, among other considerations, the AVX-512 support.

Future proofing is no doubt a valid consideration, but some of these benefits are already here today. For example, in a recent Linux distro (using glibc?), attach a debugger to just about any process and break on something like memmove or memset in libc, and you can see some AVX-512 code paths will be taken if your CPU supports these instructions.

kolbe · 3 years ago

Have you been programming with the Zen 4? I bought one, and I've been using the avx512 intrinsics via C++ and Rust (LLVM Clang for both), and I've been a little underwhelmed by the performance. Like say using Agner Fog's vector library, I'm getting about a 20% speedup going from avx2 to avx512. I was hoping for 2x.

dragontamer · 3 years ago

Yeah but AMD Zen4 has AVX512 enabled by default. And they're not too shabby with the performance either.

AVX512 is the future, but maybe not Intel's future, ironically.

CamperBob2 · 3 years ago

If I need to crunch small amounts of data in a hurry, existing instructions are fine for that.

If I need to crunch large amounts of data in a hurry, I'll send it to a GPU. The CPU has no chance to compete with that.

I honestly don't understand who/what AVX512 is really for, other than artificial benchmarks that are intentionally engineered to depend on AVX512 far more often than any real-world application would.

microtonal · 3 years ago

Wasn't the issue that efficiency cores don't support AVX-512 and that operating system schedulers/software don't deal with this yet and end up running AVX-512 code on the efficiency cores?

cogman10 · 3 years ago

That's a terrible CPU design. You might as well ship arm cores if you are going to have a mismatch of instruction set support on efficiency vs power cores.

Especially since apps using AVX-512 will likely sniff for it at runtime. So now, you have a thread that if rescheduled onto an efficiency core it will break on you. So now what, does the app dev need to start sending "capabilities requests" when it makes new threads to ensure the OS knows not to put it's stuff on an efficiency core?

What a dumb design decision.

pantalaimon · 3 years ago

That's why it was only enabled if you disabled the efficiency cores - until Intel fused it off.

And if Intel would have wanted this, they could have deviced the kernel to handle the fault and pin the process to the big cores.

Deleted Comment

bogwog · 3 years ago

That was the exact same story I read, probably on this site.

Kon-Peki · 3 years ago

On the other hand, you can now virtually guarantee that a GPGPU is present on Intel consumer chips. So now you can write your code 3 times; no vector acceleration (for the really old stuff), AVX-512 for the servers, and GPU for the consumer chips!

convolvatron · 3 years ago

or we can finally starting looking into languages with implicit fine grained parallelism and run on whatever

monocasa · 3 years ago

AVX-512 in general purpose CPUs was designed to not really make sense but start seeding the market at 10nm, vaguely make sense at 7nm, and really make sense starting at 5nm (or Intel's 14nm (Intel10) -> 10nm (Intel7) -> 7nm (Intel4)).

So Intel's process woes have been hurting their ability to execute in a meaningful way. Additionally Alder Lake was hurt by needing to pivot to heterogeneous cores (E cores and P cores), which hurt their ability to keep pushing this even in a 'maybe it makes sense, maybe it doesn't' state it had been in.

willis936 · 3 years ago

AVX-512 is in Alder Lake P-cores but not E-coree. AVX-512 is disabled in Alder Lake CPUs because the world is not ready for heterogenous ISA extensions. AVX-512 could be enabled by disabling E-cores.

It was supposedly maybe actually taken out of Raptor Lake, but the latest info I can find on that is from July: long before it was released. I have a Raptor Lake CPU but haven't found the time to experiment with disabling E-cores (far too busy overclocking memory and making sensible rainmeter skins).

paulmd · 3 years ago

Early Alder Lake could but they fused it off in the later ones and newer microcode also blocks it on the older processors.

Raptor Lake is basically "v-cache alder lake" (the major change is almost doubling cache size) so it's unsurprising they still don't have an answer there, and if they did it could be backported into Alder Lake, but they don't seem to have an immediate fix for AVX-512 in either generation.

Nobody really knows why, or what's up with Intel's AVX-512 strategy in general. I have not heard a fully conclusive/convincing answer in general.

The most convincing idea to me is Intel didn't want to deal with heterogeneous ISA between cores, maybe they are worried about the long-term tail of support that a mixed generation will entail.

Long term I think they will move to heterogeneous cores/architectures but with homogeneous ISA. The little cores will have it too, emulated if necessary, and that will solve all the CPUID-style "how many cores should I launch and what type" problems. That still permits big.little architecture mixtures, but fixes the wonky stuff with having different ISAs between cores.

there are probably some additional constraints like Samsung or whomever bumped into with their heterogenous-ISA architectures... like cache line size probably should not vary between big/little cores either. That will pin a few architectural features together, if they (or their impacts) are observable outside the "black box".

pantalaimon · 3 years ago

> AVX-512 is disabled in Alder Lake CPUs because the world is not ready for heterogenous ISA extensions

It was already opt-in (disabled unless you also disable efficiency cores), that is no justification to make it impossible to use for people who want to try it out.

But I suppose Intel just doesn't want people to write software using those new instructions.

jeffbee · 3 years ago

For some reason this doesn't hold on Raptor Lake, the successor to Alder Lake. AVX-512 works if you disable all the efficiency cores.

anonymoushn · 3 years ago

Linus is correct that auto-vectorization is basically useless, but I have some minor disagreements. JSON is obviously not the only format or protocol that can be parsed using SIMD instructions. In general computers spend a lot of time parsing byte streams, and almost all of these could be parsed at gigabytes per second if someone familiar with the protocol, the intel intrinsics guide, and the tricks used in simdjson spent an afternoon on them. It's sort of unfortunate that our current software ecosystem makes it difficult for many developers to adopt these sorts of specialized parsers written in low-level languages and targeting specific vector extensions.

If someone wants to use a hash table or TLS, they may find AES-NI useful.

eklitzke · 3 years ago

Part of his argument though was that for most cases where you're parsing JSON (or some other protocol), the parsing time is small compared to the amount of time you spend actually doing things with the data. And for most of the examples where the code is mostly parsing (say, an application doing RPC routing, where the service just decodes an RPC and then forwards it to another host) you can often use tricks to avoid parsing the whole document anyway. For example, if you have a protocol based on protobuf (like gRPC) you can put the fields needed for routing in the lowest number fields, and then use protobuf APIs to decode the protobuf lazily/incrementally; this way you will usually just need to decode the first (or perhaps the first few) fields, not the entire protobuf.

Making code faster and more efficient is always better, it's just that you might not see much improvement in end-to-end latency times if you make your protocol parsing code faster.

anonymoushn · 3 years ago

I think this is maybe true in practice because business logic is usually written really badly, but a lot of the time the work that must be done is ~linear in the size of the message.

jackmott42 · 3 years ago

One of the specific design properties of AVX-512 is to make auto-vectorization less useless. If Intel had committed to more predictable and wide support, we might be seeing the fruits of that now. Whether it would have been a "Game changer" or just a slight improvement, I don't know, but even slight improvements in cpu performance are worth it these days.

dathinab · 3 years ago

> "Game changer"

made me smirk, as games is what likely would have profited from it

games are also what tends to be some of the more havy compute applications outside of servers

through it likely would be mostly just some code here and there deep withing the game engine

but then this is also an example where auto-vectorization can matter as it means writing platform independent code which happens to be fast on the platforms you care about most but also happens to compile and run with some potential (but only potential) performance penalty when porting to other platforms. Not only makes this porting easier, it also for some games might be fine if the engine is a bit slower. Main drawback is that writing such "auto vectorized" code isn't easy and can be brittle.

vgatherps · 3 years ago

I'm freshly out of the weeds of implementing some fixed-point decimal parsers, and only found ~1 cycle that I could save (at best) per iteration using avx512.

I largely agree with linus here. It's still a huge pain to do anything non-trivially-parallelizable with simd even in an avx512 world

1. I have NEVER been satisfied with the results of a compiler autovectorization, outside of the most trivial examples. Even when they can autovectorize, they fail to do basic transformations that greatly unlock ILP (to be fair these aren't allowed by FP rules since it changes operation ordering)

2. Gather/scatter still have awful performance, worse than manually doing the loads and piecemeal assembling the vector. The fancy 'results-partially-delivered-in-order-on-trap' semantics probably don't help. Speeding up these operations would instantly unlock SO much more performance (see my username)

3. The whole 'extract-upper-bit-to-bitmask' -> 'find-first-set-in-bitmask' really deserves an instruction to merge the last two (return index of first byte with uppermost bit set, etc). It's such a core series of instructions and ime often the bottleneck (each of those is 3 cycles)

4. The performance of some of the nicer avx512 intrinsics leaves a lot to be desired. Compress for example is neat in theory but is too slow for anything I've tried in real life. Masked loads are another example, cool in theory but in practice MUCH faster to replicate manually.

5. It still feels like there's a ton of really random/arbitrary holes in the instruction set. AVX512 closed some of these but introduced many more with the limited operations you can do on masks and expensive mask <-> integer conversions.

the8472 · 3 years ago

Some of the AVX512 instructions are nice additions though, even if you only use it on YMM or XMM registers. All the masked instructions make it much simpler to vectorize loops over variable amounts of data without special treatment for the head/tail of the data.

stephencanon · 3 years ago

This is exactly right—-the 512b width is the least interesting thing about AVX512. All the new instructions on 128, 256, and 512b vectors, (finally) good support for unsigned integers, static rounding modes for floating point, and predication are much more exciting.

anonymoushn · 3 years ago

I don't mind head/tail stuff so much but the masked instructions and vcompress make a lot of "do some branchy different stuff to different data or filter a list" easier and faster compared to AVX2, so I'm a big fan

ak217 · 3 years ago

Some of the biggest investments into general purpose Linux performance, reliability, and sustainability were made by people with "specialty engines" that Linus is talking about. The specialty applications often turn out to be important because they push the platform to excel in unexpected ways.

With that said, the biggest practical problem we saw with AVX-512 was that it's so big, power hungry, and dissimilar to other components of the CPU. One tenant turning that thing on can throttle the whole chip, causing unpredictable performance and even reliability issues for everything on the machine. I imagine it's even worse on mobile/embedded, where turning it on can blow out your power and thermal budget. The alternatives (integrated GPUs/ML cores) seem to be more power efficient.

cogman10 · 3 years ago

> and dissimilar to other components of the CPU

It's not terribly dissimilar from the SSE instructions before it. The biggest weirdness about AVX is it's CPU throttling. Which, IMO, should not be a thing for the reasons you mention.

I imagine that future iterations of AVX will likely ditch CPU throttling. I'm guessing it was a "our current design can't handle this and we need to ship!" sort of solution. (If they haven't already? Do current CPUs throttle AVX-512 instructions?)

> I imagine it's even worse on mobile/embedded

Fortunately, x86 mobile/embedded isn't terribly common, certainly not with AVX-512 support.

paulmd · 3 years ago

> I imagine that future iterations of AVX will likely ditch CPU throttling. I'm guessing it was a "our current design can't handle this and we need to ship!" sort of solution. (If they haven't already? Do current CPUs throttle AVX-512 instructions?)

exactly what it looks like to me, and it was already fixed in Ice Lake and probably doesn't exist at all in Zen4. Just nobody cares about Ice Lake-SP when Epyc is crushing the server market.

https://travisdowns.github.io/blog/2020/08/19/icl-avx512-fre...

and some measurements of Skylake-X displaying the effect, note the voltage swings in particular: https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

the whole thing looks very much to me like "oops we put not just AVX-512 but dual AVX-512 in our cores and on 14nm it can hammer the chip so hard we need more voltage... guess we're gonna have to stop and swing the voltage upwards".

not only is that less of a problem with higher-efficiency nodes, but the consumer core designs drop the idea of dual-AVX 512 entirely which reduces the worst-case voltage droop as well...

Kon-Peki · 3 years ago

> I imagine that future iterations of AVX will likely ditch CPU throttling

Already happening:

https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Dow...

janwas · 3 years ago

The turbo curves depend on the exact type of Xeon. Were they Bronze or Silver? Those are much worse in that regard.

It is also unclear to me that AVX-512 is necessarily problematic in terms of power and thermal effects. Scalar code on many threads actually uses more energy (dispatch/control of OoO instructions being the main driver) than vector code (running far fewer instructions). The much-maligned power-up phase does not matter if running vector code for at least a millisecond.

I agree dedicated HW can be more power-efficient still, assuming it does exactly what your algorithm requires. And yet they make the fragmentation and deployment issues (can I use it?) of SIMD look trivial by comparison :)

sliken · 3 years ago

I have some water cooled Xeon Gold (6100 series) chips. Generally they never throttle, unless running a code well known for heavy use of AVX 512.

So despite running many codes that use all cores 100%, using AVX512 is the only way I've seen them throttle.

nomel · 3 years ago

> One tenant turning that thing on can throttle the whole chip,

This wasn't related to AVX-512, it was related to its implementation on that specific, somewhat old, architecture. Current AMD and next generation Intel don't have this limitation, so it shouldn't be considered when thinking of "the future".

dathinab · 3 years ago

> One tenant turning that thing on can throttle the whole chip, causing unpredictable performance and even reliability issues for everything on the machine

This is where ironically where the double pumping of Zen 4 shines, it's less of an problem with it.

alkonaut · 3 years ago

The problem with AVX isn’t that compilers find it difficult to emit enough of them or that people don’t often work on problems that are easily transformed to simd at all.

It’s enough if a small number of libraries use these instructions, and a lot larger number of programs can benefit.

The problem with AVX-512 in my opinion is the sparse hardware support, power/thermal problems and noisy neighbor phenomenon.

To be useful simd instructions need to be widely available and not have adverse power effects such as making other programs go much slower.

mhh__ · 3 years ago

Be warned that a lot of the power issues are basically Chinese whispers from processors that are getting on for a decade old now.

There are still measurable effects but not like on skylake, to my knowledge at least.

alkonaut · 3 years ago

In the latest iteration, they removed AVX-512 iirc. So was there even as window when it worked well and shipped on a majority of new CPUs?

jmoss20 · 3 years ago

No, 512 still fries skylake.

gok · 3 years ago

Firstly, there are in fact a lot of problems that benefit from parallelization, but are not quite wide enough to justify dealing with (say) a GPU. For example, even in graphics you often have medium-sized rasterization tasks that are much more convenient to run on CPUs. Audio processing is similar. Often you have chunks of vectorizable work inter-mixed with CPU-friendly pointer chasing, and the ability to dispatch SIMD work within nanoseconds of SISD work is very helpful.

Secondly, AVX-512 (and Arm's SVE) support predication and scatter/gather ops, both of which open the door to much more aggressive forms of automatic vectorization.