Does a compiler use all x86 instructions? (2010)

It doesn't, because there are lots of special-purpose x86 instructions that would be more trouble than they're worth to teach a compiler about. For example, instructions for accessing particular CPU features that the C and C++ languages have no concept of (cryptographic acceleration and instructions used for OS kernel code spring to mind). Some of these the compiler might know about via intrinsic functions, but won't generate for pure C/C++ code.

Regarding the large number of LEA instructions in x86 code - this is actually a very useful instruction for doing several mathematical operations very compactly. You can multiply the value in a register by 1, 2, 4 or 8, add the value in another register (which can be the same one, yielding multiplication by 3, 5 or 9), add a constant value and place the result in a third register.

emily-c · 9 years ago

lea's addressing mode arithmetic also doesn't wipe out any flags so it can be extremely useful for throwing in between arithmetic operations whose resulting flags you care about and the compare operation itself so as to obtain a greater degree of instruction level parallelism.

api · 9 years ago

There are also AFIAK a few "deprecated" instructions that are implemented for backward compatibility but do not perform well on modern cores or have much better modern alternatives. These would be things like old MMX instructions, cruft left over from the 16-bit DOS days, etc.

X86 is crufty. Of course all old architectures are crufty, and using microcode it's probably possible to keep the cruft from taking up much silicon.

There are also instructions not universally available. A typical compiler will not emit these unless told to do so or unless they are coded in intrinsics or inline ASM.

rwmj · 9 years ago

And the counterpoint is that some instructions that are commonly believed to be slow (eg. the string instructions like rep movs, lods, stos), are in fact fast on modern processors.

cptskippy · 9 years ago

Since P6, Intel's CPUs have used a RISC like core with a very heavy decoder that translates x86 CISC instructions to run on the internal ISA. With that in mind, do older or lesser used instructions actually perform poorly or are they just the wrong choice but actually preferred for other scenarios?

Deleted Comment

nneonneo · 9 years ago

Good examples of (essentially-deprecated) instructions include the rep prefixed instructions for string operations (modern library code for string operations typically involve a mixture of SSE, full-word loads and unrolled loops for speed); the "loop" instruction (compilers usually generate explicit loops for flexibility); pretty much all the BCD arithmetic instructions (since programming languages don't typically use BCD anymore), and many other such examples. The x86 architecture is chock-full of rarely-used legacy instructions that are mostly carried around for compatibility these days.

nneonneo · 9 years ago

lea is also used frequently to actually load addresses. When passing the address of a stack buffer, for example, the compiler will usually generate code like "lea eax, [ebp-0x80]; push eax". Or, when loading the address of a struct member, you might have "lea ecx, [eax+8]".

praptak · 9 years ago

Yup, the original intention behind lea is the & operator. Compute the address but instead of accessing the memory immediately (like mov does) just keep it for later use. Just what & does.

martincmartin · 9 years ago

Loading an address is just adding a fixed number to the value in a register. Your two examples are mathematically equivalent to "eax = ebp - 0x80" and "ecx = eax + 8."

jwr · 9 years ago

About LEA: adding to the above correct information, it is also useful because it can be scheduled in parallel with other ALU instructions (through a different "port" in x86 parlance), or at least that's how it used to be (I haven't looked at most recent x86 architectures). Thus compilers can use LEA to perform some arithmetic operations and generate code that will effectively run faster.

Tuna-Fish · 9 years ago

On recent Intel CPUs, LEA is generally executed with the same resources as other arithmetic.

LEAs touching 16-bit registers issue multiple micro-ops and are slow (the machine no longer has native 16-bit register support and has to mask the results in a separate op).

LEAs with 1 or 2 address components are issued to port 1 or 5 and have a latency of 1, and LEAs with 3 components or a RIP-relative address are issued to port 1 exclusively, and have a latency of 3.

In contrast, register-register/immediate adds and subs go to any of ports 0,1,5 or 6 and have a latency of 1, and register, immediate shifts go to ports 0 or 6 and have a latency of 1.

Converting operations to LEAs really doesn't buy as much today as it used to, but a smart programmer or a compiler can occasionally grab a cycle here or there.

Majster · 9 years ago

Could you provide some links or examples? I'd love to learn more about this.

DSMan195276 · 9 years ago

For kernel/OS functions, there are lots of things that can only be done while in 'kernel-mode' with kernel privileges. These are things such as turning interrupts on and off, changing which page table is being used, changing the GDT/IDT, and other various things. C has no concept of these things, and they are very special purpose. The AES instructions are similar - C has no built-in AES functionality, and the compiler just can't figure-out in which situations it could use them. All of these instructions can be accesses through inline assembly, so you can still use them in C, the compiler just won't understand what you're doing besides "It takes in these inputs, does something, and then puts out these outputs".

As for `lea`, there's nothing particularly special about it. x86 allows you to do some 'funny' addressing within instructions that take memory addresses - where you can do several calculations on the address within the instruction itself, without having to use a temporary to hold the address. `lea` just lets you 'load' the result of that calculation back into a register, presumably to let you avoid making the CPU do the calculation a bunch of times in a row. But there's nothing about `lea` that requires you to actually use addresses, rather then just plain numbers you want to shift and add. It is used a lot because `lea` is a single instruction and will generally run faster then doing a shift and some additions over multiple instructions.

saynsedit · 9 years ago

Just read the output of your compiler for simple functions. objdump -d, or, cc -S

Deleted Comment

column · 9 years ago

is there any point in multiplying by 1 ?

noselasd · 9 years ago

It might be used to avoid a conditional.

webkike · 9 years ago

It doesn't because when would a compiler issue syscall? Maybe if you count instructions emitted by inline assembly

vardump · 9 years ago

Compilers do have intrinsics. No inline assembly needed for something that trivial.

And therein lies the rub.

What is the minimum number of instructions a compiler could make use of to get everything done that it needs?

I came across an article that says 'mov is turing complete' [1]. But they had to do some convoluted tricks to use mov for all purposes.

I think it's safe to say that about 5-7 instructions are all that's needed to perform all computation tasks.

But then:

- Why do compilers not strive to simplify their code-gen phase, or enable themselves to do advanced instruction-level program analysis, or both?

- Why do microprocessors not strive for simplicity, implement only a handful of instructions in an optimized way, with a very small chip footprint, to be followed by proliferation of cores (think 256-core, 512-core, 1024-core).

Besides the completely valid reason that humans tend to overly-complicate their solutions, and then brag about it, the main reason is historical baggage and the need for backwards compatibility.

Intel started with a bad architecture design, and only made it worse decades after decades, by piling one bunch of instructions over another, and what we now have is a complete mess.

On the compiler front, the LLVM white-knights come along and tell people 'you guys are wimps for using C to do compilers. Real men use monsters like C++, with dragons like design-patterns. No one said compiler programming is supposed to be as simple as possible.'

To those lamenting javascript and the web being broken, wait till you lift the rug and get a peek at the innards of your computing platform and infrastructure!

[1] https://www.cl.cam.ac.uk/~sd601/papers/mov.pdf

aschampion · 9 years ago

We "over-complicate" ISAs for the same reason we're constantly adopting new vocabulary: there is efficiency in specialization. Good design is not about simplicity; it's about managed complexity.

> - Why do compilers not strive to simplify their code-gen phase, or enable themselves to do advanced instruction-level program analysis, or both?

Because specialized instructions formalize invariants and constraints on behavior that allow efficient computation, often by specialized hardware.

> - Why do microprocessors not strive for simplicity, implement only a handful of instructions in an optimized way, with a very small chip footprint, to be followed by proliferation of cores (think 256-core, 512-core, 1024-core).

Some do, see GPUs and coprocessors like the Phi. We don't take this approach with CPUs because real problems often require complex, branching, inhomogeneous computation, which require the type of specialization and tradeoffs mentioned above.

fizixer · 9 years ago

> We "over-complicate" ISAs for the same reason we're constantly adopting new vocabulary: there is efficiency in specialization.

You're making broad generalizations, ironically speaking. If there is anything the article of this thread suggests, it is that we have created a needlessly complex instruction set, and that "efficient specialization" is not valuable to the software 99.99% of the time.

> Good design is not about simplicity; it's about managed complexity.

Managed complexity is simplicity. And it's pretty clear today's compilers and today's microprocessor designs are anything but good managers of their complexity.

Tuna-Fish · 9 years ago

Because that would not make CPUs faster or cheaper.

If you think that modern CPUs are large because they implement a lot of instructions, you are completely wrong. The entire machinery needed for executing more than a couple of different instructions is less than 1% of the core. The space is not taken up by decoding tables or ALUs, it's taken up by forwarding networks, registers, and most of all, cache. And all of those are things very much required to make a CPU fast. CPUs have a lot of instructions precisely because the cost of implementing a new instruction is negligible compared to the size of the CPU.

gpderetta · 9 years ago

> Why do microprocessors not strive for simplicity, implement only a handful of instructions in an optimized way, with a very small chip footprint, to be followed by proliferation of cores (think 256-core, 512-core, 1024-core).

Modern CPU designers have such a larger transistor budget than they need to get creative to make use of all of it, so specialized instructions are pretty much free.

And no, you can't just stuff 1024 cores in a processor; apart from the fact that most software wouldn't know what to make of it, such a monster might end up bottlenecked by intercommunication or memory bandwidth.

Also simple CPUs are just slow; you need a lot of machinery to perform well at high frequency/high memory latency.

fizixer · 9 years ago

> larger transistor budget

... because we (the chip designer) are okay with larger footprint per core.

> specialized instructions are pretty much free

... only after we have fixed the footprint per core. But if we're willing to vary that parameter, then the specialized instructions are not free.

Not to mention, the main article of this thread is a strong evidence that those specialized instructions are almost never used!

As for your point about 1024 cores, the whole point I'm trying to make is that whatever software does today with 4 cores in a 4-core processor, could be done by 4-cores in a 1024-core processor, because those 4-cores don't implement the instructions that are not needed. And that means you have 1020 cores free sitting in your microprocessor. You could only make your computations faster or at the same speed (in the worst case) in their presence, not slower.

> simple CPUs are just slow

I would like to see any source of this claim. The only reason I can think of is that complex CPUs implement some instructions that help speed up. But as we can see in the original article of this thread, software is not making use of those instructions. So I don't see how a simple CPU (that picks the best 5-7 instructions that give turing completeness, as well as best performance) is any slower.

Note, by a simple CPU, I'm not advocating eliminating pipelines and caches, etc. All I'm saying is that once you optimize a CPU design and eliminate redundancy as well as the requirement of backward compatibility, you can get a much better performing CPU that what we have currently.

jcranmer · 9 years ago

Sure, there is a lot of historical baggage in microprocessors--the BCD stuff and x86-16 support in general only exist for backwards compatibility (although note that BIOS starts up in x86-16). But the reason that Intel keeps adding instructions is, well, because they're useful.

What you're describing is a GPU--revert a core to a basic ALU and make it be 2 dozen 16-wide SIMD ALUs. Not everything can work well on a GPU, though; you need very high parallelizibility, and fairly uniform memory access models. It's why supercomputers switched from SIMD to MIMD models a few decades ago. (They're starting to revert to include more GPU-like cores, mostly for power reasons).

fizixer · 9 years ago

> What you're describing is a GPU

I would say I'm describing something halfway between a CPU and a GPU. It's not just an ALU, it's a complete microprocessor, with pipelining, caches, etc. The main difference is that the instruction set is optimized, backward compatibility is no longer a requirement, and redundancy of the architecture is eliminated.

fizbin · 9 years ago

https://en.wikipedia.org/wiki/One_instruction_set_computer

ehntoo · 9 years ago

Convoluted tricks? Certainly. There's a functioning mov-only compiler [1], though. [1] https://github.com/xoreaxeaxeax/movfuscator

atgreen · 9 years ago

"What is the minimum number of instructions a compiler could make use of to get everything done that it needs?"

I didn't go for the absolute minimum, but I did aim for useful and reasonable minimum with the ggx[1] ISA (now called moxie[1]) by defining the ISA incrementally and only adding instructions used by GCC.

The approach I took is described here: [1] https://github.com/atgreen/ggx

[2] http://moxielogic.org/

Animats · 9 years ago

The absolute minimum is one: "subtract, store, and branch if negative". This is not a useful way to design a CPU.

userbinator · 9 years ago

You're basically asking something akin to "why didn't MIPS or any other RISC become dominant?" (Yes, I know about ARM. Despite its name and claims, ARM is not really RISC anymore. It has grown instructions and specialised hardware, so it could remain competitive with x86, and they also use micro-op translation: http://techreport.com/review/28189/inside-arm-cortex-a72-mic...)

I think it's safe to say that about 5-7 instructions are all that's needed to perform all computation tasks.

One instruction is needed to be Turing-complete. It's not very practical though, as you need many more simple instructions to do the work of a single complex one.

Consider memcpy(), one of my favourite examples. On a simple architecture like MIPS, the software has to explicitly perform each read and write along with updating the pointers, the address calculations, and the loop control. It also has to take into account alignment and when to use byte/word/dword accesses. This all requires instructions, which occupy precious cache space (and unrolling makes them take even more) and have to be fetched, decoded, and executed. It can only read and write e.g. 32 bits at a time, because that's the only size the architecture supports for individual load and store instructions. If a wider (e.g. 64 bit) revision appears, all existing memcpy() has to be rewritten to take advantage of it.

On the other hand, consider the CISC solution: REP MOVSB. A single two-byte instruction which the hardware decodes into whatever operations are most optimised for it. It handles updating the registers, the copy, and the loop using specialised hardware. It can transfer entire cache lines (64 bytes or more) per clock cycle. Software doesn't need to change to take advantage of e.g. a newer processor with a wider memory bus. It's tiny, so it uses next to no cache space, and once it's been fetched, decoded, and executing internally, the memory/cache buses are free for other purposes like transferring the data to be copied, or for the other cores to use. It's far easier for the CPU to decode a complex instruction internally into micro-ops and/or dispatch it to dedicated hardware than it is to try pattern-matching long sequences of simple instructions into a complex semantic unit for such hardware when it becomes available. It's hard enough for something as simple as memcpy(), never mind AES or SHA1.

- Why do compilers not strive to simplify their code-gen phase, or enable themselves to do advanced instruction-level program analysis, or both?

Compilers are complex because the CPUs they generate code for are also complex, and the CPUs are complex because of the reason above: this complexity is efficency. A compiler for a simple CPU could be simple, but that just means the CPU is so simple that there is nothing to optimise at the software level; hardly an ideal situation. I think it wouldn't be too hard to get GCC to generate only the subset of x86 instructions that most closely resembles MIPS, and compare the resulting binaries for size and speed. It should then be obvious why more complex instructions are good.

Idontagree · 9 years ago

I'm no expert on processor architecture, but when I had to learn about it, I always wondered why is this 3000page manual necessary (x86), why is this thing constructed to be so complex when it's doing a lot of some very simple things.

void fast_hash(char *out, const char *in); void slow_hash(char *out, const char *in); void (*resolve_hasher(void))(char *, const char *) { if (cpuSupportsFancyInstructions()) { return &fast_hash; } else { return &slow_hash; } }

ajenner · 9 years ago

In general:

* x87 floating point is generally unused (if you have SSE2, which is guaranteed for x86-64)

* BCD/ASCII instructions

* BTC/BTS/related instructions. These are basically a & (1 << b) operations, but because of redundant uses, it's generally faster to do the regular operations

* MMX instructions are obsoleted by SSE

* There's some legacy cruft (e.g., segment management) that's generally unused by anyone not in 16-bit mode.

* There are few odd instructions that are basically no-ops (LFENCE, branch predictor hints)

* Several instructions are used in hand-written assembly, but won't be emitted by a compiler except perhaps by intrinsics. The AES/SHA1 instructions, system-level instructions, and several vector instructions fall into this category.

* Compilers usually target relatively old instruction sets, so while they can emit vector instructions for AVX or AVX2, most shipped binaries won't by default. When you see people list minimum processor versions, what they're really listing is which minimum instruction set is being targeted (largely boiling down to "do we require SSE, SSE2, SSE3, SSSE3, SSE4.1, or SSE4.2?").

As for how many x86 instructions, there are 981 unique mnemonics and 3,684 variants (per https://stefanheule.com/papers/pldi16-strata.pdf). Note that some mnemonics mask several instructions--mov is particularly bad about that. I don't know if those counts are considered only up to AVX-2 or if they extend to the AVX-512 instruction set as well.

dmm · 9 years ago

> * There's some legacy cruft (e.g., segment management) that's generally unused by anyone not in 16-bit mode.

OpenBSD uses segments(while in protected mode!) to implement a line-in-the-sand W^X implementation on i386 systems that don't support anything better. The segment is set just high enough in a processes space to cover the text and libraries but leave the heap and stack unexecutable.

This mentions this implementation: http://www.tedunangst.com/flak/post/now-or-never-exec

VMware also uses(used?) segments to hide its hypervisor: http://www.pagetable.com/?p=25

diamondlovesyou · 9 years ago

x86 NaCl uses segments for sandboxing.

dejawu · 9 years ago

So, I don't know a whole lot about the processor design/manufacturing process. If the fabricator (say, Intel) omitted many of these odd legacy and unused instructions, how would it affect the production process?

Would Intel be able to meet the smaller die sizes they're currently having trouble with? Would it make the processors any less expensive to produce at scale (all other things equal)?

barrkel · 9 years ago

I extended the Borland debugger's disassembler (as used by Delphi and C++ Builder IDEs) to x64, so I had professional reason to inspect the encodings. There are whole categories of instructions not used by most compilers, relating to virtualization, multiple versions of MMX and SSE (most are rarely output by compilers), security like DRM instructions (SMX feature aka Safer Mode), diagnostics, etc.

On LEA: LEA is often used to pack integer arithmetic into the ModRM and SIB prefix bytes of the address encoding, rather than needing separate instructions to express a calculation. Using these, you can specify some multiplication factors, a couple of registers and a constant all in a single bit-packed encoding scheme. Whether or not it uses different integer units in the CPU is independent of the fact that it saves code cache size.

JoeAltmaier · 9 years ago

Intel's own optimizing C++ compiler uses more, or well different ones anyway. Its really amazing what it can do. Uses instructions I never heard of.

hajile · 9 years ago

Then disables any of them from running on AMD processors (and that's still the case. They were told by the courts to either place a warning or stop the practice, so they buried a vague warning in the paperwork).

nixos · 9 years ago

My question is if compilers use "new" x86 instructions, as then the program won't work at all on old systems.

For example, if Intel decided today that CPUs need a new "fast" hashing opcode (I don't know if they actually do), a compiler can't compiles to it, as programs won't work on older computers.

Is it like the API cruft in Android, where "new" Lollipop APIs are introduced for 10 years from now, when no one uses any phones from before 2014?

veddan · 9 years ago

There are some methods to get around this. For example, there's an ELF extension called STT_GNU_IFUNC. It allows a symbol to be resolved at load time using a custom resolver function. This avoids the problem of figuring out which code-path to use on every invocation.

For example, you could have a function

    void hash(char *out, const char *in);

with two different possible implementations: a slow one using common instructions, and a fast one using exotic instructions. You can then can have a resolver like this:

mschuster91 · 9 years ago

I'm a bit skeptical about the performance, especially with often-called functions.

Normally, asm would do

    call slow_hash

at every place where slow_hash is invoked, but now it has to check at every invocation a pointer with the address of the function.

Of course the loader could walk through all uses of the pointer to slow_hash and replace them by fast_hash on loading, but that won't work for selfmodifying (packed, or RE-protected) code.

pjmlp · 9 years ago

JIT compilers are able to take advantage of them, because you don't get a binary set in stone that has to run everywhere.

This is the main reason why Apple is now pushing for LLVM bitcode, Android still uses dex even when AOT compiling and WP uses MDIL with AOT compilation at the store.

So regardless of what an OEM decides for their mobile device, in theory, it is possible to make the best use of the chosen CPU.

This is actually quite common in the mainframes, with AS/400 (now IBM i) being one of the most well known ones.

wahern · 9 years ago

The AS/400 is more like an AOT than a JIT compiler. When I hear JIT I think opportunistically compiling portions of a program, but falling back to an interpreter.

The way AS/400 works, IIUC, is that the compiler compiles to an intermediate byte code, which has remained stable for decades. When the program is first loaded, the entire program is compiled to the native architecture, cached, and then executed like any other binary.

The reason why JIT environments aren't competitive generally with AOT compiling is because of all the instrumentation necessary. A JIT environment is usually composed of an interpreter[1] which, using various heuristics, decides to compile segments of code to native code. But the logic for deciding what segments to compile, when to reuse a cached chunk, etc is complex, especially the pieces that keep track of data dependencies. Also, each chunk requires instrumentation code for transitioning from and back into the interpreter.[2] For this and other reasons JIT'd environments aren't competitive with AOT environments except for simple programs or programs with very high code and data locality (i.e. spending most of the time inside a single compiled chunk, such as a small loop).

Large or complex programs don't JIT very well. Even if the vast majority of the execution time is spent within a very small portion of the program, if data dependencies or executions paths are complex (which is usually the case) all the work spent managing the JIT'd segments can quickly add up. Programs that would JIT well also tend to be programs that vectorize well, and if they vectorize well AOT compilers also benefit and so JIT compilers are still playing catch-up. (One benefit (albeit only short term) for JIT compilers is that some performance optimizations are easier to add to JIT compilers because you don't have to worry about ABIs and other baggage; you iterate the compiler implementation faster.)

I increasingly hear the term JIT used in the context of GPU programming, where programs are generated and compiled dynamically for execution on SIMD cores. But that's much more like AOT compilation. The implementation stack and code generation rules are basically identical to a traditional AOT compiler and very little like the JIT environments for Java or JavaScript. The only similarity is that compilation happens at run-time, but you can analogize that with, for example, dynamically generating, compiling, and invoking C code. Which, actually, isn't uncommon. It's how Perl's Inline::C works, and how TinyCC is often used.

[1] Mike Pall has said that the secret to a fast JIT environment is a fast interpreter.

[2] So, for example, calling into a module using the Lua C API is faster in PUC Lua than LuaJIT. LuaJIT has to spill more complex state, whereas the PUC Lua interpreter is just calling a C function pointer--the spilling is much simpler and has already been AOT compiled.

mikeash · 9 years ago

In addition to what others have said, sometimes stuff gets added in a backwards compatible way. For example, Intel added two new instruction prefixes for their hardware lock elision extension. Actually, they reused existing instruction prefixes which were valid but did nothing on older CPUs when used on the instructions where the HLE prefixes apply. The semantics are such that "do nothing" is valid, but CPUs which understand the new prefix can do better.

Another alternative is to use the new instructions regardless, then trap illegal instructions and emulate them. This used to be a common way to handle floating point instructions, back in the days when FPU hardware wasn't universal. CPUs with FPUs would run the instructions natively, and CPUs without them would emulate them in software. This was, of course, unbelievably slow, but it worked.

But for the most part, you just generate different code for different CPU capabilities and dispatch, as the other comments describe.

peteri · 9 years ago

Certainly the Borland 8087 emulation code used to have an interrupt call followed by the 8087 opcodes, the interrupt call got replaced at run time by NOPs if there was a co-processor present.

brainfire · 9 years ago

The PC release of No Man's Sky apparently was built to use SSE4.1 instructions, leading to unexpected problems for users with older processors that would otherwise have been fine.

We have in the past received different binaries from vendors to use depending on which processors we're using.

That was the curse of MIPS. Code was sort of portable, but the optimal code for each CPU implementation was different. Programs would ship with multiple binaries compiled with different options.

ajuc · 9 years ago

You can already do that. In gcc you can pass arguments like -march=pentium3 , and the code will use features of that cpu and won't run on pentium2, etc.

Gentoo linux uses this to slightly optimize binaries - because every gentoo user specifies these flags systemwide, customizes them for herhis cpu, and the whole system is recompiled taking them into account. Performance boost is negligible from my experience (but I used gentoo 10 years ago, maybe it changed).

roel_v · 9 years ago

Compilers have flags to use eg avx instructions. If you run such a binary on a cpu that doesn't support it, your program will simply crash. It's possible but cumbersome to detect cpu features and take different code paths based on that.

rasz_pl · 9 years ago

Yes, usually it happens because someone is not experienced enough. Good example is No Man’s Sky crashing on Phenom CPUs because they hard linked Havoc compiled with SSE4.1 and _do not_ detect CPU type.

CyberDildonics · 9 years ago

Windows 8 won't work on some of the early AMD 64 bit processors because they don't support 128 bit atomic compare and swap.

SixSigma · 9 years ago

Look forward to Win7 not working on the next gen of Intel processors, at MS request.

krylon · 9 years ago

Why would Microsoft want to stop people running their system on newer CPUs?

Rebelgecko · 9 years ago

IIRC, when a new version of OSX stops working on older hardware, it's usually because Apple started using a "new" version of SSE that's not supported on a given CPU.

For handcoded assembly, it's not uncommon for code to check a CPU's capabilities at runtime (e.g. OpenSSL checks at runtime if it can use the really fast AES-NI instructions,. If not, it fall backs to other assembly or C implementations). That way you don't need different binaries tuned for every possible combination of available instructions. I don't think any compilers do this automatically, but most of the time generating assembly for the lowest common denominator is fast enough.

revelation · 9 years ago

The Intel compiler (used to?) do runtime CPUID checks. If your CPU came back "GenuineIntel" (and only Intel), it took the fast SSE path.

Theres more here:

http://www.agner.org/optimize/blog/read.php?i=49#49

shanemhansen · 9 years ago

There are instructions that would almost never be useful. See Linus's rant on cmov http://yarchive.net/comp/linux/cmov.html

The tl;dr is that it would only be useful if you are trying to optimize the size of a binary.

Linus complained because CMOV converts a dependency breaking conditional jump with an high latency instruction which preserves dependencies. With the Pentium 4s of that era it was often a net loss as CMOV was particularly high latency (~6 clocks, low throughput), but today it is actually quite good (1 clock, high throughput on Skylake), and Linus is ok with it.

dbcurtis · 9 years ago

I didn't read Linus's rant on CMOV, but whenever you see a CPU with CMOV, it is because the hardware has very good branch prediction, and the compiler has intimate knowledge of how the branch prediction hardware works.

Then the compiler works hard on determining if branches are highly predictable. Is the branch part of closing a loop? Predict that you will stay in the loop. Is the branch checking for an exception condition? Predict that the exception is rare. OTOH, there are some branches that are, in fact, "flakey" on a dynamic execution basis. Deciding where to push a particular particle of data based on it's value inside an innermost processing loop, for instance.

So... the compiler identifies "flakey" branches, it emits code to compute both branches of the if, and CMOVs the desired result at the end. That allows the instruction issue pipeline to avoid seeing a branch at all, thus avoiding polluting the branch cache with a flakey branch, and avoiding a whole bunch of pipeline flushes in the back end. At the cost of using extra back-end resources on throw-away work.

CMOV is in X86 for a reason. On Pentium Pro and later, it is a win if your compiler has good branch analysis.

CountSessine · 9 years ago

That's an interesting perspective. I always thought that cmov was more important if the CPU had really bad branch prediction - the MIPS CPUs that were in the PS2 and PSP had famously bad branch prediction; there were all sorts of interesting cases where we could manually recode using cmov to avoid mispredictions in hot-path code.

But I guess if the compiler knew enough about the branch prediction strategy employed by the CPU, it could do this optimization itself when prediction wouldn't work right, even on a CPU with good prediction.

astrange · 9 years ago

> I didn't read Linus's rant on CMOV, but whenever you see a CPU with CMOV, it is because the hardware has very good branch prediction, and the compiler has intimate knowledge of how the branch prediction hardware works.

gcc and clang really have no idea how x86 branch predictors work, and they're secret so nobody is going to contribute a model for them. I haven't read the if-conversion pass but it's just some general heuristics.

There also isn't anything guessing if a specific branch is mispredictable or not, it's more like it converts anything that looks "math-like" instead of "control-flow-like" to cmov.

caf · 9 years ago

You don't even need perfect, "insider" branch analysis if you can do profile-guided optimisation using actual branch performance counters in the profile.

chkras · 9 years ago

Linus rants about most everything, CMOV helped avoid branch prediction issues etc.

Nelson69 · 9 years ago

There are also very useful instructions that don't have semantics for most programming languages and have little or no use outside of booting a system. Like LGDT and some of the tricks to switch from real mode to protected mode.

What's maybe more curious, some instructions in certain modes aren't even implemented by some assemblers. I don't think it's entirely unusual to see hand crafted opcodes in an OS boot sequence. When the assembler doesn't do it, the compiler certainly isn't going to.

wscott · 9 years ago

It is _very_ useful when hand optimizing loops in assembly. But for the compiler, the newer processors have such incredible branch predictors it is usually a bad idea for the compiler assume a branch is poorly predicted.

Now if you are using profile guided optimizations (--prof-gen/prof-use) AND you use the processor's performance counters as part of the feedback to the compiler _then_ I could see the compiler correctly using this instruction.

snaky · 9 years ago

I have a mixed feeling about that incredible progress in modern CPUs branch prediction. Don't they just use idle execution units and run both branches in parallel behind the scenes? It looks great on microbenchmarks when there's a bunch of idle execution units to use, a memory bus and caches are underused. It may not perform so well on real load when there's no idle execution units available to use for free, and memory bus and cache lines are choking on load.

Is it somewhat similar to these two examples?

> If it owns that line then CAS is extremely fast - core doesn't need to notify other cores to do that operation. If core doesn't own it, the situation is very different - core has to send request to fetch cache line in exclusive mode and such request requires communication with all other cores. Such negotiation is not fast, but on Ivy Bridge it is much faster than on Nehalem. And because it is faster on Ivy Bride, core has less time to perform a set of fast local CAS operations while it owns cacheline, therefore total throughput is less. I suppose, a very good lesson learned here - microbenchmarking can be very tricky and not easy to do properly. Also results can be easily interpreted in a wrong way

http://stas-blogspot.blogspot.com/2013/02/evil-of-microbench...

> On current processors, POPCNT runs on a single execution port and counts the bits in 4B per cycle. The AVX2 implementation requires more instructions, but spreads the work out across more ports. My fastest unrolled version takes 2.5 cycles per 32B vector, or .078 cycles/byte (2.5/32). This is 1.6x faster (4 cycles per 32B /2.5 cycles per 32B) than scalar popcnt(). Whether this is worth doing depends on the rest of the workload. If the rest of the work is being done on scalar 64-bit registers, those other scalar operations can often fit between the popcnts() "for free", and the cost of moving data back and forth between vector and scalar registers usually overwhelms the advantage.

https://news.ycombinator.com/item?id=11279047

35bge57dtjku · 9 years ago

> Note that the x86 was originally designed as a Pascal machine, which is why there are instructions to support nested functions (enter, leave), the pascal calling convention in which the callee pops a known number of arguments from the stack (ret K), bounds checking (bound), and so on. Many of these operations are now obsolete.

http://stackoverflow.com/questions/26323215/do-any-languages...

Except windows still uses stdcall which is pascal style return with c styled parameter ordering.