There's definitely several ways to ask the question, but IMHO the best answer to an unqualified question is to frame it in terms of what the necessary state to store in a thread context. To that end, the registers boil down to:
* 16 general-purpose registers.
* 16 or 32 vector registers.
* 7 vector mask registers
* 8 x87 floating-point registers, with 8 MMX registers aliased. (To separate or not separate x87/mmx is definitely a challenging question)
* Relevant MPX registers (I don't know this ISA extension very well, so I can't count these registers accurately)
These are the registers that I would expect to be able to poke at in a debugger or inspect/modify via something like ucontext_t, and they're going to be found in whatever kernel abstraction you use to save not-currently-running thread information.
That is a perfectly reasonable approach to that answer indeed. Registers do hold state and saving restoring them is burden of program code (user, compiler, lib, os).
Contrast this with a hypothetical cpu that has only one register "base" and allows to address 32 words after the address at that base register. i.e. things like arithmetic instructions would have 5 bits to address operands which will be interpreters as base+8*n. To make things even more interesting this architecture's instruction pointer lives at base+0.
Such an architecture would have one register under your metric (as only one register needs to be saved/restored to context switch an entire "register file").
However, implementations (microarchitecture) could actually shadow that memory range into hardware registers, and page in/out the whole register bank upon writes of the base register (effectively performing a hardware assisted context switch; hello TSS).
However, since each instruction in this hypothetical ISA must have enough space in the encoding to address these operands, for all intents and purposes this architecture would have 32 registers.
Deciding instructions, addressing operands, dealing with consequences of code density (icache misses), ... are all way more frequent events than context switches.
Hence I do agree with TFA that operand encoding should be the default metric to count registers. And this also includes sub/overlapping registers, if they are independently addressed.
> Contrast this with a hypothetical cpu that has only one register "base" and allows to address 32 words after the address at that base register.
This, er, wasn't really hypothetical. The TMS9900, the CPU used in the TI-99/4A, had three hardware registers: a program counter, a status word, and what was called a Workspace Pointer (WP). General purpose "registers" lived in RAM, and were referenced by an offset off the value in the WP. Subroutine calls were initiated by saving the PC and changing the WP to a fresh new register context before branching.
Weird architectures definitely make the question a lot more difficult to answer. My gut reaction would be to say that your hypothetical ISA has 33 registers, and that the register file is memory-mapped to a specified region of virtual memory. That's partially because of the way that you're going to worry about how cache coherence will work out, but also because I suspect the mcontext_t or OS-equivalent interface will also define its structure layout as such.
The broader point, though, is in deciding whether or not to include registers like CR0 and DR0. The principle I'm using here is that registers that are not expected to be saved/restored on task switches should be excluded. Registers that are per-process (i.e., page tables in general, or segment descriptors on x86) or per-CPU (most MSRs) are thus excluded by this criterion.
FSBASE/GSBASE are extremely borderline--I wouldn't complain if they were or if they weren't excluded from a list of registers. These act as a mixture of user-visible registers (even if accessibly only via syscalls until very recently) and segment descriptor information. They're not in Linux's userspace-visible mcontext_t struct, but they are in the kernel's equivalent to mcontext_t.
> I won’t count microarchitectural implementation details, like shadow registers.
I really think this article would have been more interesting if he had discussed the microarchitectural registers. Those are important to understand for optimization even if they're not directly visible.
Also discussing MSRs but not special information tables like the page table and VMCS is a somewhat odd distinction. While they're probably stored differently, they are somewhat similar in how they are used.
Also, isn't the TLB like a kind of set of registers? The TLB entries are very frequently accessed. How about store buffers and the like?
Yep, what really happens under the hood is dynamic register allocation at runtime. I wonder how important is static register allocation by the compiler in that scenario. In theory, even if the compiler uses the same register over and over in sequential instructions, the renamer should be smart enough to detect the false data dependencies and allocate registers from individual instructions to different locations in the register file.
Are there any x86 profiling tools which give any metrics about the real utilisation of the register file?
The number of architectural registers are still relevant for register allocation because of course overlapping and independent code sequences cannot share the same architectural register name. This is not very important for integer loads, but still relevant for FP where optimal scheduling requires having multiple computations in flight at the same time. In some cases 16 FP registers are not enough and Intel had to add 16 more FP registers with AVX512.
I don't know of any profiling tools, but I've always relied on llvm-mca[0] -register-file-stats to show me what the expected register file usage is on an ISA
Not just in theory. The way modern register renaming works is that every single write to a register name always allocates a new physical register. It's not even possible for there to be false dependencies because of reusing the same name over and over.
From the programmers point of view only the working registers are really important, other registers are just OS related (MSR), some are really important, but the internal representation may be total different.
Like a mode switch bit in a CR register. So MSR-r are just the interface. And the MSR register access can be "slow", so no synchronization or optimization required.
But, the idea of more register makes better architecture, is a total bad assumption. See the dead body of Itanium (128 general-purpose 64 bit integer registers, 128 floating point registers etc. )
With multitasking, one have to switch between context, and larger context (register file size) takes more time.
There are cases when you are better using just the GPRs, rather than the SIMD registers. (Linux kernel does not use FPU or SIMD registers)
Also SIMD usage may slow down the clock, like AVX in x86_64. So you may trust your compiler for vectorization, but it may make more harm than good.
IMHO what killed itanium wasn't too many registers, and not even compiler difficulties — it was an attempt to have a working x86 emulation.
So, instead of a weird but very fast CPU, it ended up being not very fast both in x86 and native modes, while still being weird. (The makers of the Cell CPU did not compromise, went full weird, and had a winner of sorts.)
More GPRs visible in the ISA also means more bits needed to encode instructions. If instruction length and encoding were not an issue, I bet we would have seen memory-to-memory ISAs where no GPRs exist, only instructions referencing memory locations. The dynamic register file would then be just a level below L1 cache, or even completely removed.
> With multitasking, one have to switch between context, and larger context (register file size) takes more time.
Sparc chips got around that by having sliding windows of registers: instead of having to push all the registers to the stack you just moved the window.
They share the underlying storage (i.e. they are aliased) but they are independently addressed, and thus they consume instruction encoding space.
Not saying that it's not interesting to know how much actual storage the register file offers; just highlighting that TFA focuses on the instruction encoding angle of the question, which is also important.
CPU architectures are masterpieces of tradeoffs.
Put too many registers and your instructions steam is not dense enough and you cannot keep your cpu busy due to stalls in the fetch phase. Also context switches become expensive (there are solutions to that though).
Put to few registers and you have to spill registers to memory too often, and thus also consume precious instruction stream space.
> I will count sub-registers (e.g., EAX for RAX) as distinct registers. My justification: they have different instruction encodings, and both Intel and AMD optimize/pessimize particular sub-register use patterns in their microcode.
I disagree strongly with that characterisation. Just no.
I think it's pointless to debate a methodology without a purpose. "How many registers does an x86-64 CPU have?" is interesting (to 58 voters so far) but too general to be useful for any particular purpose. Consider a couple alternate questions brought up in this thread:
* How many bytes of register files does the OS have to save on task switches? (With this question, subregisters shouldn't be counted.)
* How many registers does an emulator (such as Rosetta 2) have to implement and test? (Subregisters should be counted.)
Even these one might argue aren't directly useful; when considering context switching, one could dig down further into how much of the context switching time is attributable to saving the registers, validate that with experiments across architectures, etc.
> * How many bytes of register files does the OS have to save on task switches? (With this question, subregisters shouldn't be counted.)
This is something that has been bothering me for some time now - actually since the mid-80's: why not implement multiple contexts as an index into a large register file? This way, a context switch would take the time it takes to write to the `task-id` register. It will impact latencies, but would the impact of having, say, 8 contexts not be smaller than having to hit L1 or L2 for the same data?
It makes sense if you're counting register _encodings_ (that is, how much ISA encoding space the registers use). But yeah, a more useful count would consider the sub-registers as part of the main register (and the same for other ISAs like 64-bit ARM, which does have a 32-bit view of its 64-bit general purpose registers), and would not consider registers outside each core (like the MTRR registers).
It depends on if you’re looking at this from an implementation in hardware vs software I’d say. The article specifically mentions Rosetta 2 as the context so I’m guessing the enumeration is more important if you intend to understand all the things Rosetta 2 has to implement.
Funny to think that once x86 was the platform that had too few general-purpose registers, so people sacrificed the frame pointer register in their highly-optimized assembly routines...
I still remember benchmarking the various optimization options in GCC and the only one that consistently and significantly improved performance on real code was -fomit-frame-pointer.
* 16 general-purpose registers.
* 16 or 32 vector registers.
* 7 vector mask registers
* 8 x87 floating-point registers, with 8 MMX registers aliased. (To separate or not separate x87/mmx is definitely a challenging question)
* 3 normal status registers: RIP, RFLAGS, MXCSR
* 6 x87 status registers: FSW, FCW, FTP, FDP, FIP, FOP
* 6 segment registers
* 6 debug registers
* Relevant MPX registers (I don't know this ISA extension very well, so I can't count these registers accurately)
These are the registers that I would expect to be able to poke at in a debugger or inspect/modify via something like ucontext_t, and they're going to be found in whatever kernel abstraction you use to save not-currently-running thread information.
Contrast this with a hypothetical cpu that has only one register "base" and allows to address 32 words after the address at that base register. i.e. things like arithmetic instructions would have 5 bits to address operands which will be interpreters as base+8*n. To make things even more interesting this architecture's instruction pointer lives at base+0.
Such an architecture would have one register under your metric (as only one register needs to be saved/restored to context switch an entire "register file").
However, implementations (microarchitecture) could actually shadow that memory range into hardware registers, and page in/out the whole register bank upon writes of the base register (effectively performing a hardware assisted context switch; hello TSS).
However, since each instruction in this hypothetical ISA must have enough space in the encoding to address these operands, for all intents and purposes this architecture would have 32 registers.
Deciding instructions, addressing operands, dealing with consequences of code density (icache misses), ... are all way more frequent events than context switches.
Hence I do agree with TFA that operand encoding should be the default metric to count registers. And this also includes sub/overlapping registers, if they are independently addressed.
This, er, wasn't really hypothetical. The TMS9900, the CPU used in the TI-99/4A, had three hardware registers: a program counter, a status word, and what was called a Workspace Pointer (WP). General purpose "registers" lived in RAM, and were referenced by an offset off the value in the WP. Subroutine calls were initiated by saving the PC and changing the WP to a fresh new register context before branching.
The broader point, though, is in deciding whether or not to include registers like CR0 and DR0. The principle I'm using here is that registers that are not expected to be saved/restored on task switches should be excluded. Registers that are per-process (i.e., page tables in general, or segment descriptors on x86) or per-CPU (most MSRs) are thus excluded by this criterion.
FSBASE/GSBASE are extremely borderline--I wouldn't complain if they were or if they weren't excluded from a list of registers. These act as a mixture of user-visible registers (even if accessibly only via syscalls until very recently) and segment descriptor information. They're not in Linux's userspace-visible mcontext_t struct, but they are in the kernel's equivalent to mcontext_t.
I really think this article would have been more interesting if he had discussed the microarchitectural registers. Those are important to understand for optimization even if they're not directly visible.
Also discussing MSRs but not special information tables like the page table and VMCS is a somewhat odd distinction. While they're probably stored differently, they are somewhat similar in how they are used.
Also, isn't the TLB like a kind of set of registers? The TLB entries are very frequently accessed. How about store buffers and the like?
Are there any x86 profiling tools which give any metrics about the real utilisation of the register file?
[0] https://llvm.org/docs/CommandGuide/llvm-mca.html
Like a mode switch bit in a CR register. So MSR-r are just the interface. And the MSR register access can be "slow", so no synchronization or optimization required.
But, the idea of more register makes better architecture, is a total bad assumption. See the dead body of Itanium (128 general-purpose 64 bit integer registers, 128 floating point registers etc. )
With multitasking, one have to switch between context, and larger context (register file size) takes more time.
There are cases when you are better using just the GPRs, rather than the SIMD registers. (Linux kernel does not use FPU or SIMD registers)
Also SIMD usage may slow down the clock, like AVX in x86_64. So you may trust your compiler for vectorization, but it may make more harm than good.
So, instead of a weird but very fast CPU, it ended up being not very fast both in x86 and native modes, while still being weird. (The makers of the Cell CPU did not compromise, went full weird, and had a winner of sorts.)
Sparc chips got around that by having sliding windows of registers: instead of having to push all the registers to the stack you just moved the window.
Deleted Comment
Not saying that it's not interesting to know how much actual storage the register file offers; just highlighting that TFA focuses on the instruction encoding angle of the question, which is also important.
CPU architectures are masterpieces of tradeoffs.
Put too many registers and your instructions steam is not dense enough and you cannot keep your cpu busy due to stalls in the fetch phase. Also context switches become expensive (there are solutions to that though).
Put to few registers and you have to spill registers to memory too often, and thus also consume precious instruction stream space.
Maybe count how many bits of registers there are. Then count RAX as 64 bits of registers
I disagree strongly with that characterisation. Just no.
* How many bytes of register files does the OS have to save on task switches? (With this question, subregisters shouldn't be counted.)
* How many registers does an emulator (such as Rosetta 2) have to implement and test? (Subregisters should be counted.)
Even these one might argue aren't directly useful; when considering context switching, one could dig down further into how much of the context switching time is attributable to saving the registers, validate that with experiments across architectures, etc.
This is something that has been bothering me for some time now - actually since the mid-80's: why not implement multiple contexts as an index into a large register file? This way, a context switch would take the time it takes to write to the `task-id` register. It will impact latencies, but would the impact of having, say, 8 contexts not be smaller than having to hit L1 or L2 for the same data?
it still shouldn't because they can't be distinct in the emulator as writing on a subregister affects the larger register and viceversa.
"Rosetta is a translation process"
https://developer.apple.com/documentation/apple_silicon/abou...
http://archive.gamedev.net/archive/reference/articles/articl...
http://www.rcollins.org/Errata/Jan97/Bugs.html
I plan to do an update to the post this afternoon.