Readit News logoReadit News
noelwelsh · a month ago
This is how many registers the ISA exposes, but not the number of registers actually in the CPU. Typical CPUs have hundreds of registers. For example, Zen 4 's integer register file has 224 registers, and the FP/vector register file has 192 registers (per Wikipedia). This is useful to know because it can effect behavior. E.g. I've seen results where doing a register allocation pass with a large number of registers, followed by a pass with the number of registers exposed in the ISA, leads to better performance.
saagarjha · a month ago
What compilers do this?
noelwelsh · a month ago
One writeup I know about is: "Smlnj: Intel x86 back end compiler controlled memory."
Someone · a month ago
FTA: “For design reasons that are a complete mystery to me, the MMX registers are actually sub-registers of the x87 STn registers”

I think the main argument for doing that was that it meant that existing OSes didn’t need changes for the new CPU. Because they already saved the x87 registers on context switch, they automatically saved the MMX registers, and context switches didn’t slow down.

It also may have decreased the amount of space needed, but that difference can’t have been very large, I think

adrian_b · 25 days ago
By "existing OSes", that really means Microsoft Windows, other OSes would not have had any problems with the negligible update required to save and restore more registers.

During many decades, Intel has introduced a lot of awful workarounds in their CPUs for the only reason that Microsoft was too lazy to update their OS so the newer better CPUs had to be managed by the OS exactly in the same way as the old worse CPUs, even if that moved inside the CPUs various functions that can be done much more efficiently by the OS, so their place is not inside the CPU.

So the MMX registers were aliased over the FPU registers because in this way the existing MS Windows saved them automatically at thread switching. Eventually the limitations of MMX were too great, and due to competitive pressure from AMD (3DNow!) and Motorola (AltiVec), Intel and Microsoft were forced to transition to SSE in 1999, for which a couple of new save and restore instructions have been added and used by the OS, allowing an increase in the number and size of registers.

rep_lodsb · a month ago
Nitpick (footnote 3): "64-bit kernels can run 32-bit userspace processes, but 64-bit and 32-bit code can’t be mixed in the same process. ↩"

That isn't true on any operating system I'm aware of. If both modes are supported at all, there will be a ring 3 code selector defined in the GDT for each, and I don't think there would be any security benefit to hiding the "inactive" one. A program could even use the LAR instruction to search for them.

At least on Linux, the kernel is perfectly fine with being called from either mode. FASM example code (with hardcoded selector, works on my machine):

    format elf executable at $1_0000
    entry start
    
    segment readable executable
    
    start:  mov     eax,4                   ;32-bit syscall# for write
            mov     ebx,1                   ;handle
            mov     ecx,Msg1                ;pointer
            mov     edx,Msg1.len            ;length
            int     $80
    
            call    $33:demo64
    
            mov     eax,4
            mov     ebx,1
            mov     ecx,Msg3
            mov     edx,Msg3.len
            int     $80
            mov     eax,1                   ;exit
            xor     ebx,ebx                 ;status
            int     $80
    
    use64
    demo64: mov     eax,1                   ;64-bit syscall# for write
            mov     edi,1                   ;handle
            lea     rsi,[Msg2]              ;pointer
            mov     edx,Msg2.len            ;length
            syscall
            retfd                           ;return to caller in 32 bit mode

    Msg1    db      "Hello from 32-bit mode",10
    .len=$-Msg1
    
    Msg2    db      "Now in 64-bit mode",10
    .len=$-Msg2
    
    Msg3    db      "Back to 32 bits",10
    .len=$-Msg3

ronsor · a month ago
This is also true on Windows. Malware loves it! https://encyclopedia.kaspersky.com/glossary/heavens-gate/
bonzini · a month ago
Isn't it how recent Wine runs 32-bit programs?
josephh · a month ago
Much like there is 64-bit "code", there is also 32-bit "code" that can only be executed in the 32-bit (protected) mode, namely all the BCD, segment-related, push/pop-all instructions that will trigger an invalid opcode exception (#UD) when executed under long mode. In that strictest sense, "64-bit and 32-bit code can’t be mixed".
jcranmer · a month ago
x86 has (not counting the system-management mode stuff) 4 major modes: real mode, protected mode, virtual 8086 mode, and IA-32e mode. Protected mode and IA-32e mode rely on the bits within the code segment's descriptor to figure out whether or not it is 16-bit, 32-bit, or 64-bit. (For extra fun, you can also have "wrong-size" stack segments, e.g., 32-bit code + 16-bit stack segment!)

16-bit and 32-bit code segments work almost exactly in IA-32e mode (what Intel calls "compatibility mode") as they do in protected mode; I think the only real difference is that the task management stuff doesn't work in IA-32e mode (and consequently features that rely on task management--e.g., virtual-8086 mode--don't work either). It's worth pointing out that if you're running a 64-bit kernel, then all of your 32-bit applications are running in IA-32e mode and not in protected mode. This also means that it's possible to have a 32-bit application that runs 64-bit code!

But I can run the BCD instructions, the crazy segment stuff, etc. all within a 16-bit or 32-bit code segment of a 64-bit executable. I have the programs to prove it.

JonChesterfield · a month ago
Good post! Stuff I didn't know x64 has. Sadly doesn't answer the "how many registers are behind rax" question I was hoping for, I'd love to know how many outstanding writes one can have to the various architectural registers before the renaming machinery runs out and things stall. Not really for immediate application to life, just a missing part of my mental cost model for x64.
cmovq · a month ago
If you’re asking about the register file, it’s around a couple hundred registers varying by architecture.

You’d need several usages of the ISA register without dependencies to run out of physical registers. You’re more likely to be bottlenecked by execution ports or the decoder way before that happens.

JonChesterfield · a month ago
I've seen claims that it's different for different architectural registers, e.g. _lots_ of backing store for rax, less for rbx. It's likely to be significant for the vector registers too which could plausibly have features like one backing store for the various widths, in which case deliberately using the smaller vectors would sometimes win out. I'll never bother to write the asm by hand with that degree of attention but would like better cost models in the compiler backend.
fuhsnn · a month ago
Intel's next gen will add 16 more general purpose registers. Can't wait for the benchmarks.
Joker_vD · a month ago
So every function call will need to spill even more call-clobbered registers to the stack!

Like, I get that leaf functions with truly huge computational cores are a thing that would benefit from more ISA-visible registers, but... don't we have GPUs for that now? And TPUs? NPUs? Whatever those things are called?

cvoss · a month ago
With an increase in available registers, every value that a compiler might newly choose to keep in a register was a value that would previously have lived in the local stack frame anyway.

It's up to the compiler to decide how many registers it needs to preserve at a call. It's also up to the compiler to decide which registers shall be the call-clobbered ones. "None" is a valid choice here, if you wish.

jandrewrogers · a month ago
Most function calls are aggressively inlined by the compiler such that they are no longer "function calls". More registers will make that even more effective.
adrian_b · 25 days ago
ARM Aarch64, IBM POWER and most RISC CPUs have had 32 general-purpose registers for many decades, and this has always been an advantage for them, not a disadvantage.

So there already is plenty of experience with ABIs using 32 GPRs.

Even so, unnecessary spilling could be avoided, but that would require some changes in programmer habits. For instance, one could require an ordering of the compilation of source files, e.g. to always compile the dependencies of a program module before it. Then the compiler could annotate the module interfaces with register usage and then use this annotations when compiling a program that imports the modules. This would avoid the need to define an ABI that is never optimal in most cases, by either saving too many or too few registers.

throwaway17_17 · a month ago
Why does having more more registers lead to spilling? I would assume (probably) incorrectly, that more registers means less spill. Are you talking about calls inside other calls which cause the outer scope arguments to be preemptively spilled so the inner scope data can be pre placed in registers?
crest · 25 days ago
Big non-trivial functions generally profit from having more registers. They don't have to saved and restored by functions that don't use them. Not every computationally intense application has just a few numerical kernels that can be pushed on to GPUs and fit the GPU uarch well. Just have a look at the difference between optimized compiler generated ia32 and amd64 code how much difference more registers can make.

You could argue that 16 (integer) registers is a sweet spot, but you failed to state a proper argument.

bjourne · a month ago
Most modern compilers for modern languages do an insane amount of inlining so the problem you're mentioning isn't a big issue. And, basically, GPUs and TPUs can't handle branches. CPUs can.
woadwarrior01 · a month ago
I looked it up. It's called APX (Advanced Performance Extensions)[1].

[1]: https://www.intel.com/content/www/us/en/developer/articles/t...

nxobject · a month ago
Oh my, it has three-operand instructions now. VAX vindicated...
BobbyTables2 · a month ago
How are they adding GPRs? Won’t that utterly break how instructions are encoded?

That would be a major headache — even if current instruction encodings were somehow preserved.

It’s not just about compilers and assemblers. Every single system implementing virtualization has a software emulation of the instruction set - easily 10k lines of very dense code/tables.

Joker_vD · a month ago
The same way AMD added 8 new GPRs, I imagine: by introducing a new instruction prefix.
toast0 · a month ago
x86 is broadly extendable. APX adds a REX2 prefix to address the new registers, and also allows using the EVEX prefix in new ways. And there's new conditional instructions where the encoding wasn't really described on the summary page.

Presumably this is gated behind cpuid and/or model specific registers, so it would tend to not be exposed by virtualization software that doesn't support it. But yeah, if you decode and process instructions, it's more things to understand. That's a cost, but presumably the benefit outweighs the cost, at least in some applications.

It's the same path as any x86 extension. In the beginning only specialty software uses it, at some point libraries that have specialized code paths based on processor featurses will support it, if it works well it becomes standard on new processors, eventually most software requires it. Or it doesn't work out and it gets dropped from future processors.

vaylian · a month ago
Those general purpose registers will also need to grow to twice their size, once we get our first 128bit CPU architecture. I hope Intel is thinking this through.
SAI_Peregrinus · a month ago
That's a ways out. We're not even using all bits in addresses yet. Unless they want hardware pointer tagging a la CHERI there's not going to be a need to increase address sizes, but that doesn't expose the extra bits to the user.

Data registers could be bigger. There's no reason `sizeof int` has to equal `sizeof intptr_t`, many older architectures had separate address & data register sizes. SIMD registers are already a case of that in x86_64.

dlcarrier · a month ago
Doubling the number of bits squares the number range that can be stored, so there's a point of diminishing returns.

* Four-bit processors can only count to 15,or from -8 to 7, so their use has been pretty limited. It is very difficult for them to do any math, and they've mostly been used for state machines.

* Eight-bit processors can count to 255, or from -128 to 127, so much more useful math can run in a single instruction, and they can directly address hundreds of bytes of RAM, which is low enough an entire program still often requires paging, but at least a routines can reasonably fit in that range. Very small embedded systems still use 8-bit processors.

* Sixteen-bit processors can count to 65,535, or from -32,768 to 32,767, allowing far more math to work in a single instruction, and a computer can have tens of kilobytes of RAM or ROM without any paging, which was small but not uncommon when sixteen-bit processors initially gain popularity.

* Thirty-two-bit processors can count to 4,294,967,295, or from -2,147,483,648 to 2,147,483,647, so it's rare to ever need multiple instructions for a single math operation, and a computer can address four gigabytes of RAM, which was far more than enough when thirty-two-bit processors initially gain popularity. The need for more bits in general-purpose computing plateaus at this point.

* Sixty-four-bit processors can count to 18,446,744,073,709,551,615, or from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807, so only special-case calculations need multiple instructions for a single math operation, and a computer can address up to sixteen zettabytes of RAM, which is thousands of times more than current supercomputers use. There's so many bits that programs only rarely perform 64-bit operations, and 64-bit instructions are often performing single-instruction-multiple-data operations that use multiple 8-, 16-, or 32-bit numbers stored in a single register.

We're already at the point where we don't gain a lot from true 64-bit instructions, with the registers being more-so used with vector instructions that store multiple numbers in a single register, so a 128-bit processor is kind of pointless. Sure, we'll keep growing the registers specific to vector instructions, but those are already 512-bits wide on the latest processors, and we don't call them 512-bit processors.

Granted, before 64-bit consumer processors existed, no one would have conceived that simultaneously running a few chat interfaces, like Slack and Discord, while browsing a news web page, could fill up more RAM than a 32-bit processor can address, so software using zettabytes of RAM will likely happen as soon as we can manufacture it, thanks to Wirth's Law (https://en.wikipedia.org/wiki/Wirth%27s_law), but until then there's no likely path to 128-bit consumer processors.

rwmj · a month ago
There's a first time for everything.
dang · a month ago
Related. Others?

How many registers does an x86-64 CPU have? (2020) - https://news.ycombinator.com/item?id=36807394 - July 2023 (10 comments)

How many registers does an x86-64 CPU have? - https://news.ycombinator.com/item?id=25253797 - Nov 2020 (109 comments)

saagarjha · a month ago
These seem more “past discussion” than “related” to me tbh
burnt-resistor · a month ago
Conservatively though, another answer could be when not considering subset registers as distinct:

16 GP

2 state (flags + IP)

6 seg

4 TRs

11 control

32 ZMM0-31 (repurposes 8 FPU GP regs)

1 MXCSR

6 FPU state

28 important MSRs

7 bounds

6 debug

8 masks

8 CET

10 FRED

=========

145 total

And don't forget another 10-20 for the local APIC.

"The answer" depends upon the purpose and a specific set of optional extensions. Function call, task switching between processes in an OS, and emulation virtual machine process state have different requirements and expectations. YMMV.

Here's a good list for reference: https://sandpile.org/x86/initial.htm

bonzini · a month ago
ZMM registers are separate from the 8 FPU registers. That's because the ZMM set is separate from MM registers. So there's 8 more.

In hardware, however, renaming resources are shared between ST/MM registers and the eight Kn mask registers.

stevefan1999 · a month ago
You also have to account for register renaming, which basically means each pipeline may have their own register set invisible to the others. And that IIRC depends on how the instruction scheduler is planning the the execution and SRAM space needed, therefore calculating the total SRAM space that the registers taken would be far more realistic.