Dynamic bit shuffle using AVX-512

mikewarot · 3 years ago

I have this weird architecture I'm emulating that I call a bit-grid.[1] The basic idea is an FPGA with NO routing fabric at all... just a 2d grid of look up tables that take a single bit input from each neighbor, and output a single bit to each, which means each look up table holds 64 bits (4*16 possible input states) of "program". Clocking is in two phases (like the colors of a chess board) to prevent all manner of grief, and make it easier to think about. Each phase is NOT Turning Complete, but the grid is.

So, I can emulate this thing on my desktop PC at about 28 nSec/cell using some Pascal code[2]. I'm thinking that if I upgrade to a machine with AVX512 instructions, it might get radically faster. What I can't figure out is how this instruction actually works, and what gains I would actually get.

The Intel documentation on this instruction is as clear as mud. There's no example with all the bits shown and worked through, leading the reader to have no ledge on which to make some intellectual purchase towards understanding.

Questions:

  If I were to fork over the cash for a machine with AVX512 instructions, how many of these instructions can actually execute/second?

  Wouldn't moving a bunch of values to/from memory basically empty out all the caches and make this thing really slow anyway?

  Does anyone have a worked out example with bits shown for all the sources and destinations before/after the instruction, so I can see what it does?

[1] https://esolangs.org/wiki/Bitgrid

[2] https://github.com/mikewarot/Bitgrid

less_less · 3 years ago

You might already be able to get good acceleration with SSSE3 or AVX2 or NEON, which also has a 4-bit-input permutation instruction. The problem is that you're doing parallel lookup into many different tables, whereas NEON/SSSE3's lookups are 16x in parallel into the same table (and AVX2 is two copies of the SSSE3 one in parallel I think). So it's not as useful unless you're simulating the same grid on several different inputs for bulk testing. It might still be faster than scalar but I'm not sure.

tkanarsky · 3 years ago

You could rent a VPS with AVX512 support and tinker around with it. I believe GCP has a few instance types that support it, AWS surely has some too.

mikewarot · 3 years ago

That's a great idea! It would only cost a few bucks to find out.

The interesting idea about the bitgrid is that you can spread the emulation across cores, as none of the cells are Turing complete. Do all of phase A, then all of phase B, repeat.

Grimburger · 3 years ago

It's quite variable even in instance types, I've emailed some VPS providers asking and the response was always it's not guaranteed, you'll have to test and see.

I'm guessing there some out there, but it's not the people I have accounts with.

Const-me · 3 years ago

About micro-optimizations, I don’t know AVX512 but with AVX2 you could compute 4 nodes in parallel, an example in C++, untested https://gist.github.com/Const-me/90a52f291c1fcb06142307facdb...

Here’s another idea how to optimize. Instead of a single 2D array, I would rework the memory layout. Specifically, make 6 2D arrays. Two with uint64_t values, for even/odd cells in the lookup tables. Two with bits for even/odd cells in the old grid state. Two with bits for even/odd cells in the new grid state. This improves RAM access patterns because any half of the cycle loads / stores half as many cache lines. After the complete cycle, swap old state with new state.

Gathering inputs from neighbors, and scattering output to them could be a bit tricky this way, but you can simplify if you can limit grid size to even number in both X and Y direction. At least the even/odd halves will be of the equal size this way.

cjs_ac · 3 years ago

Disclaimer: I'm not am expert.

For your first question, the instructions are documented on the Intel website[0]. Many instructions have Latency and Throughput figures, which indicate how many cycles they take to execute. These are not straightforward to interpret due to instruction pipelining.

As for cache exhaustion, that depends on how quickly you consume more data from memory. It's worth noting that the registers are an entire cache line in width, and that Latency figures are given for the instructions that load from memory into the AVX registers.

[0] https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

Matumio · 3 years ago

> Wouldn't moving a bunch of values to/from memory basically empty out all the caches and make this thing really slow anyway?

It's probably the same issues as with neural networks: they have to work on a whole batch of inputs to be efficient. They load only a few weights at a time into registers, then use those on the whole batch before loading the next weights.

So my guess would be: If you only run a single grid, and the grid is large, you may end up memory-bound and you can afford running inefficient instructions.

(Btw. sounds like bitgrid would map 1:1 into the LUTs of an FPGA. Might be a fun project to generate VHDL for it and then let it actually run in parallel. The less straight-forward part being the logic to get the data out of the FPGA again.)

Deleted Comment

foota · 3 years ago

Does the shape of the grid matter? If not, I think you'd benefit from constraining the width such that the data for three rows fits within L1, so that iterating over one row leaves the data from the start of the next row in L1 by the time you get to it.

foota · 3 years ago

Edit:

Actually... what if you broke the grid up in to N by N blocks, and simulated 2N half time steps (each decreasing the side length by 2) of that block? You'd need to do a similar sort of chess grid over the blocks, and emit like "partial outputs" from each block, then run a pass going the reverse direction, consuming the partial outputs and initial inputs. This way, you can fit a block entirely within a cores L1 for N time steps, and naturally get parallelism by simulating multiple blocks at once.

You should be able to simulate N time steps like this with just 2 loads of the data (program and state) instead of 2 * N.

llogiq · 3 years ago

I sometimes wonder what CPU designers think when building such weird instructions? They must have some programs in mind that could be run faster with them, or else the additional transistors are just lost weight. But then compilers and language runtimes might or might not use those instructions. Add to that the fact that modern CPUs are basically their own compilers (going from "machine code" to microcode) and you have weirdness atop of more weirdness. But perhaps this is just for business sake; adding more instruction sets to provide a barrier to the competition, because programs using this run faster but are no longer portable to competitors' CPUs.

Matumio · 3 years ago

I once thought "bit shuffling" was a special-purpose use-case you'd only ever need to deinterleave RGBA channels or something. Later I wanted to implement a small look-up table (surely something more common) and realized that it's just a different name for the same operation. (I think "bit shuffling" instructions have evolved to be generic enough that they're just programmable LUTs now.)

less_less · 3 years ago

Yeah, those vector permute instructions are super useful for both patterns. There are dedicated instructions for some specific permutations (shifting over by a constant number of bytes, and some interleavings) but you can easily end up needing the general case. And of course parallel LUTs are also very useful. Depending on what you're doing you could easily end up with both in the same algorithm.

galangalalgol · 3 years ago

Good for software defined radio too because real and imaginary are interleaved.

no92 · 3 years ago

I would have never thought to phrase bit-shuffling as deinterleaving, which makes so much more sense to my not-professionally-computer-related eyes and mind.

Realistically though, how likely would a GCC/clang be to emit these instructions when I'm working on some lookup tables, assuming I permit it to use them (e.g. via `-march=native` on a machine that supports the extension)? My gut feeling would be that unless I specifically make sure to structure my code to be as close to the semantics of the instructions as possible, these instructions would never ever be emitted. Or has the world of compiler optimizers advanced enough that rewriting that is commonplace now?

jeffbee · 3 years ago

This may surprise you but large CPU buyers go to Intel and ask for ISA extensions, and Intel implements them. This is how BMI came to exist. The fact that you do not see these instructions exploited by your vanilla Debian box says nothing about warehouse-scale datacenter operators.

xoranth · 3 years ago

Your vanilla Debian box uses BMI2 whenever you do a string comparison, unless you are on a decade old CPU [^0].

The "strange" instructions are actually not that niche, it's just that usage tends to be "indirect" and therefore people don't notice.

[^0] E.g. https://xoranth.net/memcmp-avx2

foobiekr · 3 years ago

Famously, popcount was a common request from intelligence agencies, at least as far back as the DEC Alpha.

pjmlp · 3 years ago

In Intel's case they were thinking about turning the CPU into a GPU, but failed at that several times, AVX is what is left from Larrabee.

jeffffff · 3 years ago

AVX and AVX2 are pretty awful because of lane-crossing limitations, but AVX512 is actually really nice and feels like a real programmer designed it rather than an electrical engineer.

dan-robertson · 3 years ago

See also: https://tomforsyth1000.github.io/papers/LRBNI%20origins%20v4... which describes some of that history.

dtx1 · 3 years ago

I think back to the Playstation 2-3 or N64 that all took years for developers to fully utilize the capabilities of the hardware. Yet the hardware engineers must have known how to do it long before the software side totally figured it out. After years of SW Development it's still just magic sand to me

NotCamelCase · 3 years ago

It's also difficult for both SW teams' and HW teams' visions to converge, even under the same company, such that the product can be put to use to maximize performance and programmability WRT another, already-established product.

Different constraints and challenges on both sides of the aisle give rise to compromises which end up with lowered performance or lowered ease of use. This is one area where great authority over the entire stack lends you lots of leeways, e.g. Apple designing Metal API and the HW for it.

ReactiveJelly · 3 years ago

I don't know. If the HW engineers knew something, wouldn't Sony or Nintendo have had them lend a hand on 1st-party titles?

cmrdporcupine · 3 years ago

Actually once you have these tools at your disposal and see them, it's actually hard to not see them and find places to use them. String processing, bitset indices in data structures, etc. all have places where they naturally fit.

I have a data structure library (in Rust) where I would love to have these. The problem is that AVX-512 just isn't common enough to rely on it yet, and I don't even have it on my workstation CPU (Radeon 6850, from just last year).

But in particular whether they had something in mind, I suspect Intel was thinking about video codecs and containers for a lot of these. If you read through the specs for them, you will find all sorts of places which call for things like this.

But yes, whether compiler developers can make good use of these. Questionable. They are really for specialized optimization workflows.

pjc50 · 3 years ago

I think the "shuffle" instructions are a generalization of "s-box", from various cryptographic algorithms?

> They must have some programs in mind that could be run faster with them

Yeah, all new instructions are built with some workload in mind. This may or may not be specified in the architecture manual or you have to reverse-engineer it from the press releases.

pavlov · 3 years ago

The vector permute was a highlight of the PowerPC vector extensions (AltiVec) in the late 1990s. It made a real difference for many media processing algorithms where the data is often packed and you need to pick apart the components for vector execution.

I’ve got the impression that AVX-512 is finally a good and comprehensive vector ISA from Intel, not just a tolerable one. Sadly it seems to be so comprehensive that the x86-64 vendors can’t afford the die area to ship it widely, or they treat it as an enterprise feature.

fsfod · 3 years ago

At least AMD kept AVX-512 in there small Zen4C cores and just sacrificed some cache instead. I have to wonder if it was intel marketing that killed off working AVX-512 in consumer P-cores after they were released because the E-cores just become dead weight with AVX-512 enabled.

paulmd · 3 years ago

nobody really knows what is going on with AVX-512 inside intel.

Linus has commented that it's trivial to trap the first time an AVX instruction is used and pin it to a P-core (they used to do this as an optimization to avoid saving/restoring AVX registers in non-AVX code) and he doesn't know why that patch isn't on his desk.

The other problem is presumably that CPUID depends on what core it's executed on but... it seems straightforward for code to just run a CPUID on every single core (using affinity) and analyze the results. OK, 16 AVX-512 threads and 8 AVX2 threads, that's fine! It is obviously not the way code is currently written but code isn't written for AVX-512 right now anyway, and it should be literally an hour of work for a C dev.

I guess maybe they just didn't want the long tail of support but with their future architectures being heterogeneous too, they don't seem to have any plan either, and they're still shipping the AVX-512 units in silicon, meaning they are paying millions of dollars for a feature that isn't enabled. Very, very weird.

Fun fact, Alder can actually be run with AVX-512 enabled even with E-cores active. There is an undocumented MSR flag that seems to allow this. Has to be the stepping/BIOS revision before it was locked out though.

slashdev · 3 years ago

I really love playing with vector instructions, especially AVX2, which is the most limited in my opinion. You get a whole bunch of oddly shaped legos, and it’s very challenging to build something beautiful from them. But also very fun and rewarding.

trentnelson · 3 years ago

You might enjoy https://trent.me/is-prefix-of-string-in-table/

anonymousDan · 3 years ago

By coincidence I just started playing with AVX-512 this week. It's a lot of fun, but I have to say there is a dearth of resources on how to use it. For example I would love a simple visualisation for each operation as to how the various vectors and/or memory are transformed. Can anyone recommend some good books/learning resources?

icelusxl · 3 years ago

* Visualization: https://www.officedaytime.com/simd512e/

* Book: https://link.springer.com/book/10.1007/978-1-4842-4063-2

anonymousDan · 3 years ago

Amazing, thanks!

alecco · 3 years ago

Short intro with some visuals http://const.me/articles/simd/simd.pdf

anonymousDan · 3 years ago

Answering my own question a little, on further exploration the Intel optimisation actually has a good chapter on avx512: https://www.intel.com/content/www/us/en/content-details/6714...

Deleted Comment

jeffffff · 3 years ago

there are only a handful of instructions that do interesting things beyond parallel versions of basic arithmetic and bitwise operations. https://branchfree.org/2019/05/29/why-ice-lake-is-important-... provides a good overview of them.

SomeRndName11 · 3 years ago

Not many people know, but Alder Lake's implementation of AVX-512 supports FP16 format, and FP16 version of AVX-512 is so extremely fast, it is almost touching GPU performance.

LispSporks22 · 3 years ago

> Suppose that you want to reorder, arbitrarily, the bits in a 64-bit word.

What's the application of this? Theres's a Twitter article in the next sentence, but looks like I need to sign up to see it.

dan-robertson · 3 years ago

Static reorderings can be used for some data layouts, eg interleaving the bits of an x and y coordinate, but you should hope they’d be cheaper than dynamic reordering.

trillic · 3 years ago

Cryptography

pklausler · 3 years ago

This goes way back at least to the late 80's when the Cray C-90 introduced its "bit matrix multiplication" instructions.