Readit News logoReadit News
jart · 2 years ago
I discovered a really cool ARM64 trick today. One thing about x86 that I've found useful on so many occasions is the PCMPEQB + PMOVMSKB + BSF trick that lets me scan the bytes of a string 10x faster. I couldn't find any information on Google for doing PMOVMSKB with ARM, so I've been studying ARM's "Optimized Routines" codebase where I stumbled upon the answer in their strnlen() implementation. It turns out the trick is to use `shrn dst.8b, src.8h, 4` which turns a 128-bit mask into a 64-bit mask. You can then get the string offset index with fmov, rbit, clz and finally shift by 2.
danlark · 2 years ago
I am the author of this trick as well

You can read about it in https://community.arm.com/arm-community-blogs/b/infrastructu...

jart · 2 years ago
Wow. I love your work. Thank you for coming here and talking about it. You could write Hacker's Delight 2nd edition for a new generation.
zX41ZdbW · 2 years ago
Let's add it to ClickHouse: https://github.com/ClickHouse/ClickHouse/blob/master/base/ba...

It should significantly improve the performance on ARM.

Sesse__ · 2 years ago
The VSHRN trick is nice (I used it only two hours ago!), but it really does feel like a crutch; I don't understand why they couldn't simply implement a PMOVMSKB-like instruction to begin with (it cannot possibly be very expensive in silicon, at least not if it moved into a vector register). One-bit-per-byte is really the sweet spot for almost any kind of text manipulation, and often requires less setup/post-fixup on either side of the POVMSKB/VSHRN.
jeffreygoesto · 2 years ago
> However, developers often encounter problems with Arm NEON instructions being expensive to move to scalar code and back.

I remember talking to an ARM engineer easily 10 years ago and he told us in that nice british accent: "You know, NEON is like 'back in the yard'" :-D. This has changed a lot, but not enough from what you wrote... Bit sad that these SIMD optimizations are still hand written...

camel-cdr · 2 years ago
I found the following article about the topic really good: https://branchfree.org/2019/04/01/fitting-my-head-through-th...

In my experience using a 512 wide movemask (to uint64_t) is the fastest on both x86 and arm64. (Edit: just yo clarify, I meant the fastest for iteration, things like SwissMap are better off using 128 wide movemask)

With rvv you don't really what to go from a vector mask to a general purpose non vector register, because the vector length may vary. But I found it really useful that vector masks are always packed into v0. So even with LMUL=8, you can just to a vmseq, switch to LMUL=1 and use vfirst & vmsif & vmandn to iterate through all indices. (Alternatively vfirst & vmsof & vmclr would also work, I'm not sure which one would be faster)

gpvos · 2 years ago
I am very surprised that this is presented as something new. From the very beginning of ARM, all instructions have had a condition attached to them. Contrary to the article, it has absolutely nothing to do with making the processor more CISCy, but is instead one of its most RISCy aspects.
Tuna-Fish · 2 years ago
All 32-bit ARM opcodes had predication, but when ARM went 64-bit, they wanted to recover the encoding space for 32 instead of 16 registers, and removed predication from most instructions. When they did this, they looked at all the 32-bit ARM binaries they could find, and counted which instructions were actually used with predicates, and added the top 5 of those as separate instructions.
Symmetry · 2 years ago
Arguably the two most important features of RISC, for the modern era, were having regular instruction sizes leading to easier parallel decode and having at most one memory access per instruction making the combination of pipelining and precise exceptions much less of a pile of worms.

ARM's 32 bit ISA was very regular and it mostly had a single memory access per instruction, but there were some like store multiple which could potentially save every register to memory. By getting rid of that in A64 and replacing it with an instruction to concatenated two registers and stored them in a single memory access they ended up in a far RISCier place than A32.

klelatti · 2 years ago
Interesting - is there a reference for this?
robinsonb5 · 2 years ago
Yes, I had similar thoughts when I started reading, but I think only ARM32 has predication. (There's a prefix-instruction-based something or other in Thumb, I think, but it doesn't devote part of the encoding space to predication bits like ARM32 does.)

As I understand it they didn't carry predication across from ARM32 to ARM64 for various performance reasons (if you want to be able to re-order instructions, or even agressively pipeline them, you don't want them depending on the result of the immediately-prior instructon).

Predication everywhere (i.e. orthogonal to the rest of the instruction set, and not special-cased) is certainly more RISC than CISC - but having removed it in general, bringing it back for a few specific instructions is arguably CISCy.

peterfirefly · 2 years ago
> There's a prefix-instruction-based something or other in Thumb, I think, but it doesn't devote part of the encoding space to predication bits like ARM32 does.

Yes, the IT (if-then) instruction (prefix). It is not supported by Cortex-M0, Cortex-M0+, and Cortex-M1, though. Those are the smallest T32 (Thumb-2) microcontroller designs ARM has.

IT can be followed by up to 4 instructions and encodes their predicate bits would have been in 32-bit ARM code (A32). There is not total freedom regarding their predicate bits: they all have to share the same ground condition (3 bits) and then get an individual bit that says whether to execute when that condition is met or when it isn't. The IT instruction is a 16-bit instruction that devotes 8 bits to this -- not 7, because the encoding is weird.

Dead Comment

unwind · 2 years ago
I thought this was interesting, although of course I agree with many commenters' take that the lack of reference to the "old-school" ARM where everything was conditional is odd.

I got curious about how RISC-V handles this, but only curious enough to find [1] and not dig any further. That answer is from a year ago, so perhaps there have been changes.

[1]: https://stackoverflow.com/a/72341794/28169

Findecanor · 2 years ago
"cmov" and several more interesting instructions in the draft RISC-V Bitmanip proposal were dropped before it reached 1.0 though.

There is a new proposal: Zicond, but it is quite crude, with two instructions. The "czero.eqz" instruction does:

  rd = (rs2 == 0) ? 0 : rs1;
And the other "czero.nez" tests for "rs2 != 0". Both are supposed to be result in an operand for another instruction, where a zero operand makes it a nop: for conditional add,sub,xor, etc. Conditional move, however, takes three instructions: two results where either is zero which get or'ed together.

https://github.com/riscv/riscv-zicond/blob/main/zicondops.ad...

Otherwise, the intention was that bigger RISC-V cores would detect a conditional branch over a single instruction in the decoder and perform macro-op fusion into a conditional instruction.

klelatti · 2 years ago
> Otherwise, the intention was that bigger RISC-V cores would detect a conditional branch over a single instruction in the decoder and perform macro-op fusion into a conditional instruction.

This seems like an overhead compared to actually having the instruction available. Could anyone say how material an overhead this is?

stefan_ · 2 years ago
Not quite cmov but Alibabas T-Head extensions have mveqz (move if equal zero) and mvnez (move if not equal zero).
franky47 · 2 years ago
Before reading the article, my former DSP engineer brain kicked in and thought: "complex cardinal sine (sinc), why would you want that?"

https://en.wikipedia.org/wiki/Sinc_function

t8sr · 2 years ago
The while loop in the third paragraph is easier to read in assembly than in the original C++, which either says something about how well chosen the instruction set is, or about how bad some of C++ is.
menaerus · 2 years ago
Nothing to do with C++ - it's a plain C code as a matter of fact but that's not important at all. What the code does is that it employs low-level intrinsic knowledge about the CPU microarchitecture (x86-64) and compiler codegen ability (clang) so that they can pack as many instructions per cycle as they can so that the resulting (de)compression speed is improved. You cannot write such piece of code so that it looks "beautiful" to an average Joe.
t8sr · 2 years ago
Right, but the use of bitwise AND, and the repeated conditional expressions are the kind of weirdness I’d expect a good compiler to not need.

I’ve worked a lot on the kernel and I’m no stranger to optimized code. This is still really weirdly written, and in fact the assembly is much more readable, which is funny.

I know clang needs a lot of prodding to output good code (compared to gcc), but I’m curious whether even clang really needs the logic to be so warped.

layer8 · 2 years ago
It’s weirdly written, maybe to mimic conditional machine instructions. It’s also unusual in that it seems to assume that each input array contains each number only once, as it outputs numbers contained in both input arrays only once, but only under that prior assumption.
mtklein · 2 years ago
I love seeing this instruction pop up in disassembly. I've seen it come up when growing a dynamic array, with some C code like...

    if (is_pow2_or_zero(len)) {
        int grown = len ? len*2 : 1; 
        ptr = realloc(ptr, (size_t)grown * sizeof *ptr);
    }
compiling into this sort of disassembly to calculate the value of grown:

    lsl    w8, w19, #1      // w8 = len*2
    cmp    w19, #0x0        // is len zero?
    csinc  w8, w8, wzr, ne  // w8 = (w8 if len != 0) or (0+1 if len == 0)
Pretty clever to create that 1 constant using csinc on the wzr zero register.

dougall · 2 years ago
Though it'd be preferable to do:

    cmp wzr, w19      // set the carry flag if w19 is zero
    adc w8, w19, w19  // w8 = w19 + w19 + carry

userbinator · 2 years ago
Parent is 12 bytes, yours is 8, but x86 can do it in 5:

    add eax, eax
    setz al

mpweiher · 2 years ago
Wouldn't this be the ideal instruction for implementing multi-word arithmetic? If the carry flag is set from the previous (lower order) addition, increase the next word up by one and continue adding.

And of course ARM 32 had conditional execution for all instructions. These appear the variants that were useful enough to keep around when the general feature was removed from aarch64

gpvos · 2 years ago
ARM has both add-with-carry and add-without-carry instructions, a separate increment is not necessary. (I don't know much about AArch64, only ancient ARM2/3, but I expect they left this in).
Findecanor · 2 years ago
Yes AArch64 has "adds" for modifying the carry flag after the first addition and then "adcs" for using and modifying the carry flag in subsequent additions.
wbl · 2 years ago
ARM used to have the beautiful UMALL, a single instruction that would multiply two registers then accumulate two other values into the result, then store as a double word into the registers.

This is the inner loop of multiplication and was very nice to use, but died in the AArch64 transition.

throwawaylinux · 2 years ago
You have to be careful with turning control dependencies into data dependencies. It can be very hard to understand or predict how a CPU will behave.

If you are testing quite predictable things, you almost always want to use branch prediction and not predicated/conditional instructions.

If something is totally unpredictable, let's say a binary search that is looking up random elements in a well balanced heap or tree. Each comparison is very unpredictable. A conditional select would work best there:

    item = (val < item->val ? item->left : item->right);
    if (val == item->val) ...
You could do your tree walks entirely without branch misses if that first line was a select... But it turns out that is not true. Or it's not necessarily true, depending a few (not uncommon) factors, it can be worse to use a select there.