How many registers does an x86-64 CPU have?

There's definitely several ways to ask the question, but IMHO the best answer to an unqualified question is to frame it in terms of what the necessary state to store in a thread context. To that end, the registers boil down to:

* 16 general-purpose registers.

* 16 or 32 vector registers.

* 7 vector mask registers

* 8 x87 floating-point registers, with 8 MMX registers aliased. (To separate or not separate x87/mmx is definitely a challenging question)

* 3 normal status registers: RIP, RFLAGS, MXCSR

* 6 x87 status registers: FSW, FCW, FTP, FDP, FIP, FOP

* 6 segment registers

* 6 debug registers

* Relevant MPX registers (I don't know this ISA extension very well, so I can't count these registers accurately)

These are the registers that I would expect to be able to poke at in a debugger or inspect/modify via something like ucontext_t, and they're going to be found in whatever kernel abstraction you use to save not-currently-running thread information.

ithkuil · 5 years ago

That is a perfectly reasonable approach to that answer indeed. Registers do hold state and saving restoring them is burden of program code (user, compiler, lib, os).

Contrast this with a hypothetical cpu that has only one register "base" and allows to address 32 words after the address at that base register. i.e. things like arithmetic instructions would have 5 bits to address operands which will be interpreters as base+8*n. To make things even more interesting this architecture's instruction pointer lives at base+0.

Such an architecture would have one register under your metric (as only one register needs to be saved/restored to context switch an entire "register file").

However, implementations (microarchitecture) could actually shadow that memory range into hardware registers, and page in/out the whole register bank upon writes of the base register (effectively performing a hardware assisted context switch; hello TSS).

However, since each instruction in this hypothetical ISA must have enough space in the encoding to address these operands, for all intents and purposes this architecture would have 32 registers.

Deciding instructions, addressing operands, dealing with consequences of code density (icache misses), ... are all way more frequent events than context switches.

Hence I do agree with TFA that operand encoding should be the default metric to count registers. And this also includes sub/overlapping registers, if they are independently addressed.

bitwize · 5 years ago

> Contrast this with a hypothetical cpu that has only one register "base" and allows to address 32 words after the address at that base register.

This, er, wasn't really hypothetical. The TMS9900, the CPU used in the TI-99/4A, had three hardware registers: a program counter, a status word, and what was called a Workspace Pointer (WP). General purpose "registers" lived in RAM, and were referenced by an offset off the value in the WP. Subroutine calls were initiated by saving the PC and changing the WP to a fresh new register context before branching.

jcranmer · 5 years ago

Weird architectures definitely make the question a lot more difficult to answer. My gut reaction would be to say that your hypothetical ISA has 33 registers, and that the register file is memory-mapped to a specified region of virtual memory. That's partially because of the way that you're going to worry about how cache coherence will work out, but also because I suspect the mcontext_t or OS-equivalent interface will also define its structure layout as such.

The broader point, though, is in deciding whether or not to include registers like CR0 and DR0. The principle I'm using here is that registers that are not expected to be saved/restored on task switches should be excluded. Registers that are per-process (i.e., page tables in general, or segment descriptors on x86) or per-CPU (most MSRs) are thus excluded by this criterion.

FSBASE/GSBASE are extremely borderline--I wouldn't complain if they were or if they weren't excluded from a list of registers. These act as a mixture of user-visible registers (even if accessibly only via syscalls until very recently) and segment descriptor information. They're not in Linux's userspace-visible mcontext_t struct, but they are in the kernel's equivalent to mcontext_t.

> I will count sub-registers (e.g., EAX for RAX) as distinct registers. My justification: they have different instruction encodings, and both Intel and AMD optimize/pessimize particular sub-register use patterns in their microcode.

I disagree strongly with that characterisation. Just no.

scottlamb · 5 years ago

I think it's pointless to debate a methodology without a purpose. "How many registers does an x86-64 CPU have?" is interesting (to 58 voters so far) but too general to be useful for any particular purpose. Consider a couple alternate questions brought up in this thread:

* How many bytes of register files does the OS have to save on task switches? (With this question, subregisters shouldn't be counted.)

* How many registers does an emulator (such as Rosetta 2) have to implement and test? (Subregisters should be counted.)

Even these one might argue aren't directly useful; when considering context switching, one could dig down further into how much of the context switching time is attributable to saving the registers, validate that with experiments across architectures, etc.

rbanffy · 5 years ago

> * How many bytes of register files does the OS have to save on task switches? (With this question, subregisters shouldn't be counted.)

This is something that has been bothering me for some time now - actually since the mid-80's: why not implement multiple contexts as an index into a large register file? This way, a context switch would take the time it takes to write to the `task-id` register. It will impact latencies, but would the impact of having, say, 8 contexts not be smaller than having to hit L1 or L2 for the same data?

gpderetta · 5 years ago

> How many registers does an emulator (such as Rosetta 2) have to implement and test? (Subregisters should be counted.)

it still shouldn't because they can't be distinct in the emulator as writing on a subregister affects the larger register and viceversa.

spearo77 · 5 years ago

A reference for Rosetta being a translation, rather then emulation.

"Rosetta is a translation process"

https://developer.apple.com/documentation/apple_silicon/abou...

pizza234 · 5 years ago

The title "How many register can an x64-86 _address_" is probably more accurate and less controversial.

cesarb · 5 years ago

It makes sense if you're counting register _encodings_ (that is, how much ISA encoding space the registers use). But yeah, a more useful count would consider the sub-registers as part of the main register (and the same for other ISAs like 64-bit ARM, which does have a 32-bit view of its 64-bit general purpose registers), and would not consider registers outside each core (like the MTRR registers).

andy_ppp · 5 years ago

It depends on if you’re looking at this from an implementation in hardware vs software I’d say. The article specifically mentions Rosetta 2 as the context so I’m guessing the enumeration is more important if you intend to understand all the things Rosetta 2 has to implement.

thesuitonym · 5 years ago

Care to elaborate? Or is it, "just no"?

Macha · 5 years ago

You can't independently store different values in the sub-registers. Storing in EAX clobbers what's in RAX.

rwmj · 5 years ago

> I won’t count microarchitectural implementation details, like shadow registers.

I really think this article would have been more interesting if he had discussed the microarchitectural registers. Those are important to understand for optimization even if they're not directly visible.

Also discussing MSRs but not special information tables like the page table and VMCS is a somewhat odd distinction. While they're probably stored differently, they are somewhat similar in how they are used.

Also, isn't the TLB like a kind of set of registers? The TLB entries are very frequently accessed. How about store buffers and the like?

bluetomcat · 5 years ago

Yep, what really happens under the hood is dynamic register allocation at runtime. I wonder how important is static register allocation by the compiler in that scenario. In theory, even if the compiler uses the same register over and over in sequential instructions, the renamer should be smart enough to detect the false data dependencies and allocate registers from individual instructions to different locations in the register file.

Are there any x86 profiling tools which give any metrics about the real utilisation of the register file?

The number of architectural registers are still relevant for register allocation because of course overlapping and independent code sequences cannot share the same architectural register name. This is not very important for integer loads, but still relevant for FP where optimal scheduling requires having multiple computations in flight at the same time. In some cases 16 FP registers are not enough and Intel had to add 16 more FP registers with AVX512.

recraft · 5 years ago

I don't know of any profiling tools, but I've always relied on llvm-mca[0] -register-file-stats to show me what the expected register file usage is on an ISA

[0] https://llvm.org/docs/CommandGuide/llvm-mca.html

Tuna-Fish · 5 years ago

Not just in theory. The way modern register renaming works is that every single write to a register name always allocates a new physical register. It's not even possible for there to be false dependencies because of reusing the same name over and over.

andy_threos_io · 5 years ago

From the programmers point of view only the working registers are really important, other registers are just OS related (MSR), some are really important, but the internal representation may be total different.

Like a mode switch bit in a CR register. So MSR-r are just the interface. And the MSR register access can be "slow", so no synchronization or optimization required.

But, the idea of more register makes better architecture, is a total bad assumption. See the dead body of Itanium (128 general-purpose 64 bit integer registers, 128 floating point registers etc. )

With multitasking, one have to switch between context, and larger context (register file size) takes more time.

There are cases when you are better using just the GPRs, rather than the SIMD registers. (Linux kernel does not use FPU or SIMD registers)

Also SIMD usage may slow down the clock, like AVX in x86_64. So you may trust your compiler for vectorization, but it may make more harm than good.

nine_k · 5 years ago

IMHO what killed itanium wasn't too many registers, and not even compiler difficulties — it was an attempt to have a working x86 emulation.

So, instead of a weird but very fast CPU, it ended up being not very fast both in x86 and native modes, while still being weird. (The makers of the Cell CPU did not compromise, went full weird, and had a winner of sorts.)

More GPRs visible in the ISA also means more bits needed to encode instructions. If instruction length and encoding were not an issue, I bet we would have seen memory-to-memory ISAs where no GPRs exist, only instructions referencing memory locations. The dynamic register file would then be just a level below L1 cache, or even completely removed.

mr_toad · 5 years ago

> With multitasking, one have to switch between context, and larger context (register file size) takes more time.

Sparc chips got around that by having sliding windows of registers: instead of having to push all the registers to the stack you just moved the window.

Deleted Comment

mikewarot · 5 years ago

Only registers that are orthogonal to each other should be counted. RAX,EAX,AX,AL,AH can not be independently varied, and thus should only count as 1.

They share the underlying storage (i.e. they are aliased) but they are independently addressed, and thus they consume instruction encoding space.

Not saying that it's not interesting to know how much actual storage the register file offers; just highlighting that TFA focuses on the instruction encoding angle of the question, which is also important.

CPU architectures are masterpieces of tradeoffs.

Put too many registers and your instructions steam is not dense enough and you cannot keep your cpu busy due to stalls in the fetch phase. Also context switches become expensive (there are solutions to that though).

Put to few registers and you have to spill registers to memory too often, and thus also consume precious instruction stream space.

__s · 5 years ago

AL/AH are distinct at least

Maybe count how many bits of registers there are. Then count RAX as 64 bits of registers

Good point, but they are both part of AX and up, so I don't think they should count.

my123 · 5 years ago

TonyTrapp · 5 years ago

Funny to think that once x86 was the platform that had too few general-purpose registers, so people sacrificed the frame pointer register in their highly-optimized assembly routines...

jandrese · 5 years ago

I still remember benchmarking the various optimization options in GCC and the only one that consistently and significantly improved performance on real code was -fomit-frame-pointer.

praptak · 5 years ago

It saves one push per function call, which probably helps more than freeing up the register.

sroussey · 5 years ago

My custom builds of MySQL and PHP back in the day did exactly that.

nwmcsween · 5 years ago

It still is limited, this isn't general purpose registers which is 16, I semi regularly see register spills on x86_64 while looking at compiler output

klodolph · 5 years ago

You’ll see register spills no matter how many registers, right? I definitely see register spills on architectures with 32 registers.

jlokier · 5 years ago

Even sacrificed the stack pointer in their very highly optimised assembly routines :-)

varispeed · 5 years ago

I still remember programming on paper mapping U and V pipelines to get as much done in parallel...

userbinator · 5 years ago

MSRs are interesting. There are a lot more than are documented, and since they came into existence, people have been exploring them:

http://archive.gamedev.net/archive/reference/articles/articl...

http://www.rcollins.org/Errata/Jan97/Bugs.html

innocenat · 5 years ago

Did I miss something, or why the opmask registers k0-k7 weren't count?

woodruffw · 5 years ago

You didn't miss anything! I completely forgot them.

I plan to do an update to the post this afternoon.

Phew. I must have re-read the post like 3 times to confirm they were actually missing. Thanks for clarification.