The highlighted instruction ("movabs rax, offset x") is not a load, it's just moving the address of x into rax. The "offset x" operand is a 64-bit immediate (not a displacement). It will be resolved to the address of x with a relocation.
Indeed, you can get the compiler to emit this form of "movabs" into other registers, which contradicts the point that 64-bit displacements are a* specific: https://godbolt.org/z/3MYTUC
To get a load or store to a 64-bit displacement, I think you want something like this: https://godbolt.org/z/4QMtpo
I noticed a few other things in the section about segments:
> The good news is that caring about them isn’t too bad: they essentially boil down to adding the value in the segment register to the rest of the address calculation.
Surprisingly, the segment register's value isn't added directly (my coworker and I discovered this recently). The segment bases are stored in model-specific registers, and I don't believe these are readable from user-space. https://en.wikipedia.org/wiki/X86_memory_segmentation#Later_...
If you try to read %fs directly, you'll get a completely unrelated value (an offset into table I think?)
Another surprising and unfortunate thing is that segment-qualified addresses don't work with lea. That means getting the address of a thread-local unfortunately is not as simple as "lea rax, fs:[var]." You actually have to do a load to get the base address of the thread-local block (eg. fs:0). The first pointer in the thread-local block is reserved for this. That's why this function has to do a load before the lea: https://godbolt.org/z/jkt28n
I actually find the intel manual on SIB bytes quite straightforward and useful. Section 2.1.5 and specifically tables 2-2 and 2-3 show really quite simply all possible values of the ModRM byte and their operands [1].
It can be quite a good exercise to try and produce your own hex opcodes from the tables using something like CyberCheff [2].
Basically each modrm pointer thingy goes through four layers of page table indirection for each memory access, in order to turn virtual memory addresses into real memory addresses. But it's mostly an implementation detail where access to the above data structure is restricted to the operating system only, unfortunately, and the closest thing we have to using it is the mmap api.
I think it’s good to understand why compilers emit these.
The dominant reason is: it saves registers. X86 is register starved even in 64-bit mode. Just 16 regs means shit gets spilled. If you also needed a tmp reg for each memory access, the way that makes things slower is that it causes spills - usually elsewhere - that wouldn’t have been there if the address was computed by the instruction.
It helps that CPUs make these things fast. But it can be hard to prove that using the addressing mode is faster in the absence of the register pressure scenario I described.
Well, if you care about register pressure you’ll only be saving at most one register, since you can just do a shl/add sequence into that to grab the address, and you can save even that if you’re willing to clobber one of your input registers when generating the address, or you want a load into a register from that address later. So I thought that the real reason it existed was that it saved you from the data dependency you’ve just created if you used the technique described above, plus you have that extra lea execution port/low uop unit or whatever that modern Intel processors have these days for doing this. Plus I guess you save a bit on decoding and code size too.
I recently wrote a x86-64 backed (as in like 4 years ago) and I recall that if address pattern matching (which is now very complete) was not there, you’d lose >1% perf across the board.
I wrote a pretty good x86-32 backend - for a totally different compiler (a whole program PTA-based AOT for JVM bytecode) a long ass time ago and vaguely remember it being like 10-15% there.
Note I’m talking about macrobenchmarks in both cases. And not individual ones - the average over many.
Also don’t forget those functions that sit on the ledge of caller saves. Having any caller saves is more costly than having none. So if a function is using all of the volatile regs, makes no calls (so no need to promote to caller save), and you cause it to use one more register, then it’s prologue/epilogue slows down.
Registers matter a lot. :-)
And about your point about freeing up ports or other things: that may be a cool theory, but I’m just saying that it’s hard to show that using those addressing modes is a speedup if register allocation doesn’t care either way (I.e the use of addressing modes doesn’t help spills or prologues). Meaning, most of the reason why compilers emit these is for regalloc, and it is the only reason that I’ve been able to detect as being the one that changes perf across two different back ends.
> it can be hard to prove that using the addressing mode is faster in the absence of the register pressure scenario I described.
Maybe you personally weren't able to prove to yourself if you did some very small microbenchmarking, but I really believe that measuring the differences of bigger code built with one or another approach it should be relatively straightforward to demonstrate the advantage of using the more compact instructions.
These were giant macrobenchmarks. And I said hard, not impossible. As in, most code doesn’t care if you shift or lea, unless it affects regalloc, which it almost always does.
It’s true that having smaller code is better regardless of perf - so if perf was neutral we would still use those instructions.
The benefit of the smaller instructions for perf is better register allocation, as I said. So, for macrobenchmarks, using the smaller instructions is a win every single time. And that win comes mostly from fewer spills. That’s the point of what I’m saying.
I suppose someone could do the experiment of turning on instruction selection patterns for address modes but still “pinning down” a register as if it was needed for the shift-add sequence you would have otherwise emitted. Feel free to do that if you want to prove me wrong. But just hand waving that I must not have run big benchmarks isn’t going to help you learn more about compilers.
Ah yes, takes me back. Page zero is kind of like a large bank of registers with the fast access instructions.
The one that I remember puzzling with back in the day was 'SEI' - I mean, why have an interrupt disable bit? Wouldn't it be more sensible to set and clear interrupts, not set and clear disabling interrupts.
Have been digging into segmentation and paging in Linux as well as x86_64 instruction encoding lately. Almost all the technically detailed information I know of elides discussion of historical context. Coming to these things for the first time, there is so much that feels counter to how one would want to design things if starting from scratch.
Thus, I spend quite a bit of thought, trying to infer the historical constraints and motivations that give us x86's beauty, but I'd love to have some resources that could flesh out my understanding in this regard.
See if you can find any books about assembly language programming for the 8086 (or 8088) and the 80286. Just those two should give enough context for why things are the way they are on the x86 line.
Not sure about that. Segmentation as it was used on Intel CPUs before the 386 ("Real Mode", "Protected Mode") wasn't used in Linux (other then during the boot as PC BIOS was running in 8086 compatible Real Mode). Linus was quite outspoken about the awkward programming model of those earlier CPUs and there's a reason Linux started not before he got a 386 (he was accustomed to 32bit wide flat address space from the 68008 in his earlier Sinclair QL).
Paging is an old OS concept and predates Intel CPUs and even Unix. How far back in time do you want to go?
I think some concepts which were influential in the early days of Linux are well covered in Minix (1.0) code and book (after all, Linus first experimented with a 386 scheduler for Minix).
I don't think the example for this (https://gcc.godbolt.org/z/35ytYW) is correct.
The highlighted instruction ("movabs rax, offset x") is not a load, it's just moving the address of x into rax. The "offset x" operand is a 64-bit immediate (not a displacement). It will be resolved to the address of x with a relocation.
Indeed, you can get the compiler to emit this form of "movabs" into other registers, which contradicts the point that 64-bit displacements are a* specific: https://godbolt.org/z/3MYTUC
To get a load or store to a 64-bit displacement, I think you want something like this: https://godbolt.org/z/4QMtpo
I noticed a few other things in the section about segments:
> The good news is that caring about them isn’t too bad: they essentially boil down to adding the value in the segment register to the rest of the address calculation.
Surprisingly, the segment register's value isn't added directly (my coworker and I discovered this recently). The segment bases are stored in model-specific registers, and I don't believe these are readable from user-space. https://en.wikipedia.org/wiki/X86_memory_segmentation#Later_...
If you try to read %fs directly, you'll get a completely unrelated value (an offset into table I think?)
Another surprising and unfortunate thing is that segment-qualified addresses don't work with lea. That means getting the address of a thread-local unfortunately is not as simple as "lea rax, fs:[var]." You actually have to do a load to get the base address of the thread-local block (eg. fs:0). The first pointer in the thread-local block is reserved for this. That's why this function has to do a load before the lea: https://godbolt.org/z/jkt28n
It can be quite a good exercise to try and produce your own hex opcodes from the tables using something like CyberCheff [2].
[1] https://www.intel.com/content/dam/www/public/us/en/documents...
[2] https://gchq.github.io/CyberChef/#recipe=Disassemble_x86('32...
[1] https://www.intel.com/content/dam/www/public/us/en/documents...
The dominant reason is: it saves registers. X86 is register starved even in 64-bit mode. Just 16 regs means shit gets spilled. If you also needed a tmp reg for each memory access, the way that makes things slower is that it causes spills - usually elsewhere - that wouldn’t have been there if the address was computed by the instruction.
It helps that CPUs make these things fast. But it can be hard to prove that using the addressing mode is faster in the absence of the register pressure scenario I described.
Even bigger deal on x86-32.
I recently wrote a x86-64 backed (as in like 4 years ago) and I recall that if address pattern matching (which is now very complete) was not there, you’d lose >1% perf across the board.
I wrote a pretty good x86-32 backend - for a totally different compiler (a whole program PTA-based AOT for JVM bytecode) a long ass time ago and vaguely remember it being like 10-15% there.
Note I’m talking about macrobenchmarks in both cases. And not individual ones - the average over many.
Also don’t forget those functions that sit on the ledge of caller saves. Having any caller saves is more costly than having none. So if a function is using all of the volatile regs, makes no calls (so no need to promote to caller save), and you cause it to use one more register, then it’s prologue/epilogue slows down.
Registers matter a lot. :-)
And about your point about freeing up ports or other things: that may be a cool theory, but I’m just saying that it’s hard to show that using those addressing modes is a speedup if register allocation doesn’t care either way (I.e the use of addressing modes doesn’t help spills or prologues). Meaning, most of the reason why compilers emit these is for regalloc, and it is the only reason that I’ve been able to detect as being the one that changes perf across two different back ends.
Maybe you personally weren't able to prove to yourself if you did some very small microbenchmarking, but I really believe that measuring the differences of bigger code built with one or another approach it should be relatively straightforward to demonstrate the advantage of using the more compact instructions.
These were giant macrobenchmarks. And I said hard, not impossible. As in, most code doesn’t care if you shift or lea, unless it affects regalloc, which it almost always does.
It’s true that having smaller code is better regardless of perf - so if perf was neutral we would still use those instructions.
The benefit of the smaller instructions for perf is better register allocation, as I said. So, for macrobenchmarks, using the smaller instructions is a win every single time. And that win comes mostly from fewer spills. That’s the point of what I’m saying.
I suppose someone could do the experiment of turning on instruction selection patterns for address modes but still “pinning down” a register as if it was needed for the shift-add sequence you would have otherwise emitted. Feel free to do that if you want to prove me wrong. But just hand waving that I must not have run big benchmarks isn’t going to help you learn more about compilers.
The one that I remember puzzling with back in the day was 'SEI' - I mean, why have an interrupt disable bit? Wouldn't it be more sensible to set and clear interrupts, not set and clear disabling interrupts.
Deleted Comment
Am I the only one seeing this?
Thus, I spend quite a bit of thought, trying to infer the historical constraints and motivations that give us x86's beauty, but I'd love to have some resources that could flesh out my understanding in this regard.
Paging is an old OS concept and predates Intel CPUs and even Unix. How far back in time do you want to go?
I think some concepts which were influential in the early days of Linux are well covered in Minix (1.0) code and book (after all, Linus first experimented with a 386 scheduler for Minix).