Things I learned while writing an x86 emulator (2023)

aengelke · a year ago

Bonus quirk: there's BSF/BSR, for which the Intel SDM states that on zero input, the destination has an undefined value. (AMD documents that the destination is not modified in that case.) And then there's glibc, which happily uses the undocumented fact that the destination is also unmodified on Intel [1]. It took me quite some time to track down the issue in my binary translator. (There's also TZCNT/LZCNT, which is BSF/BSR encoded with F3-prefix -- which is silently ignored on older processors not supporting the extension. So the same code will behave differently on different CPUs. At least, that's documented.)

Encoding: People often complain about prefixes, but IMHO, that's by far not the worst thing. It is well known and somewhat well documented. There are worse quirks: For example, REX/VEX/EVEX.RXB extension bits are ignored when they do not apply (e.g., MMX registers); except for mask registers (k0-k7), where they trigger #UD -- also fine -- except if the register is encoded in ModRM.rm, in which case the extension bit is ignored again.

APX takes the number of quirks to a different level: the REX2 prefix can encode general-purpose registers r16-r31, but not xmm16-xmm31; the EVEX prefix has several opcode-dependent layouts; and the extension bits for a register used depend on the register type (XMM registers use X3:B3:rm and V4:X3:idx; GP registers use B4:B3:rm, X4:X3:idx). I can't give a complete list yet, I still haven't finished my APX decoder after a year...

[1]: https://sourceware.org/bugzilla/show_bug.cgi?id=31748

bonzini · a year ago

On and off over the last year I have been rewriting QEMU's x86 decoder. It started as a necessary task to incorporate AVX support, but I am now at a point where only a handful of opcodes are left to rewrite, after which it should not be too hard to add APX support. For EVEX my plan is to keep the raw bits until after the opcode has been read (i.e. before immediates and possibly before modrm) and the EVEX class identified.

My decoder is mostly based on the tables in the manual, and the code is mostly okay—not too much indentation and phases mostly easy to separate/identify. Because the output is JITted code, it's ok to not be super efficient and keep the code readable; it's not where most of the time is spent. Nevertheless there are several cases in which the manual is wrong or doesn't say the whole story. And the tables haven't been updated for several years (no K register instructions, for example), so going forward there will be more manual work to do. :(

The top comment explains a bit what's going on: https://github.com/qemu/qemu/blob/59084feb256c617063e0dbe7e6...

(As I said above, there are still a few instructions handled by the old code predating the rewrite, notably BT/BTS/BTR/BTC. I have written the code but not merged it yet).

aengelke · a year ago

Thanks for the pointer to QEMU's decoder! I actually never looked at it before.

So you coded all the tables manually in C -- interesting, that's quite some effort. I opted to autogenerate the tables (and keep them as data only => smaller memory footprint) [1,2]. That's doable, because x86 encodings are mostly fairly consistent. I can also generate an encoder from it (ok, you don't need that). Re 'custom size "xh"': AVX-512 also has fourth and eighth. Also interesting that you have a separate row for "66+F2". I special case these two (CRC32, MOVBE) instructions with a flag.

I think the prefix decoding is not quite right for x86-64: 26/2e/36/3e are ignored in 64-bit mode, except for 2e/3e as branch-not-taken/taken hints and 3e as notrack. (See SDM Vol. 1 3.3.7.1 "Other segment override prefixes (CS, DS, ES, and SS) are ignored.") Also, REX prefixes that don't immediately preceed the opcode (or VEX/EVEX prefix) are ignored. Anyhow, I need to take a closer look at the decoder with more time. :-)

> For EVEX my plan is to keep the raw bits until after the opcode has been read

I came to the same conclusion that this is necessary with APX. The map+prefix+opcode combination identifies how the other fields are to be interpreted. For AVX-512, storing the last byte was sufficient, but with APX, vvvv got a second meaning.

> Nevertheless there are several cases in which the manual is wrong or doesn't say the whole story.

Yes... especially for corner cases, getting real hardware is the only reliable way to find out, how the CPU behaves.

[1]: https://github.com/aengelke/fadec/blob/master/instrs.txt [2]: https://github.com/aengelke/fadec/blob/master/decode.c

torusle · a year ago

Another bonus quirk, from the 486 and Pentium area..

BSWAP EAX converts from little endian to big endian and vice versa. It was a 32 bit instruction to begin with.

However, we have the 0x66 prefix that switches between 16 and 32 bit mode. If you apply that to BSWAP EAX undefined funky things happen.

On some CPU architectures (Intel vs. AMD) the prefix was just ignored. On others it did something that I call an "inner swap". E.g. in your four bytes that are stored in EAX byte 1 and 2 are swapped.

  0x11223344 became 0x11332244.

userbinator · a year ago

Also known as "bswap ax", and research shows that it does something surprising but consistent on almost all hardware: It zeros the register.

https://www.ragestorm.net/blogs/?p=141

https://gynvael.coldwind.pl/?id=268

However, this page, now gone, suggests that some CPUs (early 486s?) did something different: http://web.archive.org/web/20071231192014/http://www.df.lth....

Unfortunately I have not found any evidence nor reason to believe that this "inner swap" behaviour you mention exists in some CPU --- except perhaps some flawed emulators?

CoastalCoder · a year ago

Can you imagine having to make all this logic work faithfully, let alone fast, in silicon?

X86 used to be Intel's moat, but what a nightmarish burden to carry.

dzaima · a year ago

A fun thing is that e.g. "cmp ax, 0x4231" differs from "cmp eax, 0x87654321" only in the presence of the data16 prefix, and thus the longer immediate; and it's the only significant case (I think?) of a prefix changing the total instruction size, and thus, for some such instructions, the 16-bit version, sometimes (but not always!) is significantly slower. "but not always" as in, if you try to microbenchmark a loop of such, sometimes you can have entire microseconds of it consistently running at 0.25 cycles/instr avg, and sometimes that same exact code (in the same process!) will measure it at 3 cycles/instr (tested on Haswell, but uops.info indicates this happens on all non-atom Intel since Ivy Bridge).

dx4100 · a year ago

Did people just... do this by hand (in software), transistor by transistor, or was it laid out programmatically in some sense? As in, were segments created algorithmically, then repeated to obtain the desired outcome? CPU design baffles me, especially considering there are 134 BILLION transistors or so in the latest i7 CPU. How does the team even keep track of, work on, or even load the files to WORK on the CPUs?

gumby · a year ago

A lot of this is done in software (microcode). But even with that case, your statement still holds: "Can you imagine having to make all this logic work faithfully, let alone fast, in the chip itself?" Writing that microcode must be fiendishly difficult given all the functional units, out of order execution, register renaming...

im3w1l · a year ago

It's a lot easier to more or less accidentally create something quirky than it is to create a second quirk-for-quirk compatible system.

turndown · a year ago

Intel is coming out with an improved x86 instruction set that removes a lot of the cruft, called ‘APX’ for advanced performance extensions.

kimixa · a year ago

It seems that the complexity of state management in a modern superscalar CPU is orders of magnitude more complex than even this.

mananaysiempre · a year ago

The semantics of LZCNT combined with its encoding feels like an own goal: it’s encoded as a BSR instruction with a legacy-ignored prefix, but for nonzero inputs its return value is the operand size minus the return value of the legacy version. Yes, clz() is a function that exists, but the extra subtraction in its implementation feels like a small cost to pay for extra compatibility when LZCNT could’ve just been BSR with different zero-input semantics.

BeeOnRope · a year ago

I'm not following: as long as you are introducing a new, incompatible instruction for leading zero counting, you'd definitely choose LZCNT over BSR as LZCNT has definitely won in retrospect over BSR as the primitive for this use case. BSR is just a historical anomaly which has a zero-input problem for no benefit.

What would be the point of offering a new variation BSR with different input semantics?

bonzini · a year ago

Yes, it's like someone looked at TZCNT and thought "let's encode LZCNT the same way", but it makes no sense.

__turbobrew__ · a year ago

I know nothing about this space, but it would be interesting to hook up a jtag interface to a x86 CPU and them step instruction by instruction and record all the register values.

You could then use this data to test whether or not your emulator perfectly emulates the hardware by running the same program through the emulated CPU and validate the state is the same at every instruction.

saagarjha · a year ago

No need to JTAG; you can test most of the processor in virtualization. The only part you can’t check is the interface for that itself if you’re aiming to emulate it (most don’t). (Also I’m pretty sure the debugging interface is either fused off or locked out on modern Intel processors.)

tyfighter · a year ago

The BSF/BSR quirk is annoying, but I think there is a reason for it is that they were only thinking about it being used in a loop (or maybe an if) with something like:

int mask = something; ... for (int index; _bit_scan_forward(&index, mask); mask ^= 1<<index) { ... }

Since it sets the ZF on a zero input, they thought that must be all you need. But there are many other uses for (trailing|leading) zero count operations, and it would have been much better for them to just write the register anyway.

sdsd · a year ago

What a cool person. I really enjoy writing assembly, it feels so simple and I really enjoy the vertical aesthetic quality.

The closest I've ever come to something like OP (which is to say, not close at all) was when I was trying to help my JS friend understand the stack, and we ended up writing a mini vm with its own little ISA: https://gist.github.com/darighost/2d880fe27510e0c90f75680bfe...

This could have gone much deeper - i'd have enjoyed that, but doing so would have detracted from the original educational goal lol. I should contact that friend and see if he still wants to study with me. it's hard since he's making so much money doing fancy web dev, he has no time to go deep into stuff. whereas my unemployed ass is basically an infinite ocean of time and energy.

actionfromafar · a year ago

You should leverage that into your friend teaching you JS, maybe.

djaouen · a year ago

It’s like my friend 0x86 always said: “Stay away from JavaScript. But stay away from TypeScript harder.”

changexd · a year ago

thank you for this, even though I took some time to understand what's going on which lead me to a series of cool(and very challenging) assembly journey, as a non CS major myself, this code here is a really nice entry for me to start understand how things work, I will definitely dig deeper.

AstroJetson · a year ago

Check out Justine Tunney and her emulator. https://justine.lol/blinkenlights/

The docs are an amazing tour of how the cpu works.

zarathustreal · a year ago

Astonishing.. they never cease to amaze me

trallnag · a year ago

That name, Tunney. Remember it from around 2014, being homeless, bumming around, and shit posting on Twitter about Occupy lol

pm2222 · a year ago

Prior discussion here https://news.ycombinator.com/item?id=34636699

Cannot believe it’s been 16months. How time flies.

trollied · a year ago

> Writing a CPU emulator is, in my opinion, the best way to REALLY understand how a CPU works

Hard disagree.

The best way is to create a CPU from gate level, like you do on a decent CS course. (I really enjoyed making a cut down ARM from scratch)

timmisiak · a year ago

I think both are useful, but designing a modern CPU from the gate level is out of reach for most folks, and I think there's a big gap between the sorts of CPUs we designed in college and the sort that run real code. I think creating an emulator of a modern CPU is a somewhat more accessible challenge, while still being very educational even if you only get something partially working.

WalterBright · a year ago

When I was at Caltech, another student in the dorm had been admitted because he'd designed and implemented a CPU using only 7400 TTL.

Woz wasn't the only supersmart young computer guy at the time :-)

(I don't know how capable it was, even a 4 bit CPU would be quite a challenge with TTL.)

akira2501 · a year ago

> and the sort that run real code

And the sort that are commercially viable in today's marketplace. The nature of the code has nothing to do with it. The types of machines we play around with today surpass the machines we used to land men on the moon. What's not "real code" about that?

banish-m4 · a year ago

This is an illusion and a red herring. RTL synthesis is the typical functional prototype stage reached which is generally sufficient for FPGA work. To burn an ASIC as part of an educational consortium run is doable, but it's uncommon.

banish-m4 · a year ago

Seconded. A microcoded, pipelined, superscalar, branch-predicting basic processor with L1 data & instruction caches and write-back L2 cache controller is nontrivial. Most software engineers have an incomplete grasp of data hazards, cache invalidation, or pipeline stalls.

nsguy · a year ago

IIRC reading some Intel CPU design history some of their designers are from a CS/software background. But I agree. Software is naturally very sequential which is different than digital hardware which is naturally/inherently parallel. A clock can change the state of a million flip-flops all at once, it's a very different way of thinking about computation (though ofcourse at the theoretical level all the same) and then there's the physics and EE parts of a real world CPU. Writing software and designing CPUs are just very different disciplines and the CPU as it appears to the software developer isn't how it appears to the CPU designer.

quantified · a year ago

Well, I think you're both right. It's satisfying as heck to sling 74xx chips together and you get a feel for the electrical side of things and internal tradeoffs.

When you get to doing that for the CPU that you want to do meaningful work with, you start to lose interest in that detail. Then the complexities of the behavior and spec become interesting and the emulator approach is more tractable, can cover more types of behavior.

IshKebab · a year ago

I think trollied is correct actually. I work on a CPU emulator professionally and while it gives you a great understanding of the spec there are lots of details about why the spec is the way it is that are due to how you actually implement the microarchitecture. You only learn that stuff by actually implementing a microarchitecture.

Emulators tend not to have many features that you find in real chips, e.g. caches, speculative execution, out-of-order execution, branch predictors, pipelining, etc.

This isn't "the electrical side of things". When he said "gate level" he meant RTL (SystemVerilog/VHDL) which is pretty much entirely in the digital domain; you very rarely need to worry about actual electricity.

brailsafe · a year ago

So far on my journey through Nand2Tetris (since I kind of dropped out of my real CS course) I've found the entire process of working my way up from gate level, and just finished the VM emulator chapter which took an eternity. Now onto compilation.

commandlinefan · a year ago

OTOH, are you really going to be implementing memory segmenting in your gate-level CPU? I'd say actually creating a working CPU and _then_ emulating a real CPU (warts and all) are both necessary steps to real understanding.

mrspuratic · a year ago

This. I mean, why not start with wave theory and material science if you really want a good understanding :)

In my CS course I learned a hell of a lot from creating a 6800 emulator; though it wasn't on the course, building a working 6800 system was. The development involved running an assembler on a commercial *nix system and then typing the hex object code into an EPROM programmer. You get a lot of time to think about where your bugs are when you have to wait for a UV erase cycle...

monocasa · a year ago

> OTOH, are you really going to be implementing memory segmenting in your gate-level CPU?

I have, but it was a PDP-8 which I'll be the first to admit is kind of cheating.

trollied · a year ago

I agree.

whobre · a year ago

Reading Petzold’s “Code” comes pretty close to, though and is easier.

snvzz · a year ago

CPU was a poor choice of words. ISA would have worked.

dmitrygr · a year ago

I've written fast emulators for a dozen non-toy architectures and a few JIT translators for a few as well. x86 still gives me PTSD. I have never seen a messier architecture. There is history, and a reason for it, but still ... damn

trealira · a year ago

Studying the x86 architecture is kind of like studying languages with lots of irregularities and vestigial bits, and with competing grammatical paradigms, e.g. French. Other architectures, like RISC-V and ARMv8, are much more consistent.

aengelke · a year ago

> Other architectures, like [...] ARMv8, are much more consistent.

From an instruction/operation perspective, AArch64 is more clean. However, from an instruction operand and encoding perspective, AArch64 is a lot less consistent than x86. Consider the different operand types: on x86, there are a dozen register types, immediate (8/16/32/64 bits), and memory operands (always the same layout). On AArch64, there's: GP regs, incremented GP reg (MOPS extension), extended GP reg (e.g., SXTB), shifted GP reg, stack pointer, FP reg, vector register, vector register element, vector register table, vector register table element, a dozen types of memory operands, conditions, and a dozen types of immediate encodings (including the fascinating and very useful, but also very non-trivial encoding of logical immediates [1]).

AArch64 also has some register constraints: some vector operations can only encode register 0-15 or 0-7; not to mention SVE with it's "movprfx" prefix instruction that is only valid in front of a few selected instructions.

[1]: https://github.com/aengelke/disarm/blob/master/encode.c#L19-...

x0x0 · a year ago

I think English may be a better example; we just stapled chunks of vulgar latin to an inconsistently simplified proto-germanic and then borrowed words from every language we met along the way. Add in 44 sounds serialized to the page with 26 letters and tada!

jcranmer · a year ago

> I have never seen a messier architecture.

Itanium. Pretty much every time I open up the manual, I find a new thing that makes me go "what the hell were you guys thinking!?" without even trying to.

snazz · a year ago

What sorts of projects are you working on that use Itanium?

Arech · a year ago

Haha, man, I feel you :DD You probably should have started with it from the very beginning :D

lifthrasiir · a year ago

I recently implemented a good portion of x86(-64) decoder for some side project [1] and kinda surprised how it got even more complicated in recent days. Sandpile.org [2] was really useful for my purpose.

[1] Namely, a version of Fabian Giesen's disfilter for x86-64, for yet another side project which is still not in public: https://gist.github.com/lifthrasiir/df47509caac2f065032ef72e...

[2] https://sandpile.org/

t_sea · a year ago

> Writing a CPU emulator is, in my opinion, the best way to REALLY understand how a CPU works.

The 68k disassembler we wrote in college was such a Neo “I know kung fu” moment for me. It was the missing link that let me reason about code from high-level language down to transistors and back. I can only imagine writing a full emulator is an order of magnitude more effective. Great article!

astrange · a year ago

I would say writing an ISA emulator is actually not helpful for understanding how a modern superscalar CPU works, because almost all of it is optimizations that are hidden from you.