One fun thing to note is that the author doesn’t really talk a lot about anything from this decade, which makes sense considering how much complexity was being added for the tiny improvements towards the end. Modern branch predictors are quite complicated so they can eke out improvements in accuracy by fractions of a percent.
> "I wonder if we can make branch predictors even more accurate,” and the next day you’d start XOR’ing the branch’s PC address with a shift register containing the branch’s recent branching history, because in those days, you could XOR anything with anything and get something useful, and you test the new branch predictor, and now you’re up to 96% accuracy, ...
> When you retire in 2003, your face is wrinkled from all of the smiles, and even though you’ve been sued by several pedestrians who suddenly acquired rare paintings as hats, you go out on top, the master of your domain. You look at your son John, who just joined Intel, and you rest well at night, knowing that he can look forward to a pliant universe and an easy life.
Unfortunately for John, the branches made a pact with Satan and quantum mechanics during a midnight screening of “Weekend at Bernie’s II.”
In some/many cases there is code before the branch that does not influence the branch condition. In that case, would it be possible to architect the cpu such that it could execute the branch 'early'? I'm thinking of a special asm instruction like 'execute the next 4 instructions, and only then do the conditional jump'.
Yes, that was made explicit in the MIPS instruction set with branch (and load) delay slots, and it's implicit in out-of-order processors. As I understand it the branch delay slot paradigm did not pan out as well as was hoped and it has not found its way into many other architectures.
The issue with the branch delay slot is that it assumes there is exactly one clock cycle (I.e. one stage) between fetch and execute. This was true on some of the early 5 stage in order riscs, but hasn't been true for a while. In an high frequency in order design there are maybe a dozen stages/clock cycles which would be hard to fill with delay slots. OoO is even more complicated.
That’s called delay slots. It turns out to be so hard to work out what to put in them that they’re basically useless. Look at the massive flop that was the Itanium before you try it again!
ARMv8-M has a set of extensions for this kind of branch. There is a set of prepare-to-branch instructions as well as counted loop instructions. In order to support a variable-length delay slot, they don't have very many options. In order to handle interrupts, you still need an ordinary branch instruction from the launch point as well. So if you take an interrupt in the middle of the sequence, the ordinary branch instruction still gets executed.
really great article. love it, super clear and consise. This kind of mechanisms are super interesting to learn more about. I'm curious to the new AMD cpus which use neural network to predict branches, how that fits into the performance and what kind of accuracy it peaks at.
Ha, ha.:-) A lot of us are in the same boat except that few are willing to acknowledge it openly. That said; i can suggest the following for further reading (from simple to advanced);
programming, program execution and pc architecture, even though very closely related, don't really have to be understood depending on your goals. it's super interesting though to learn these kind of things if you are programmer, it helps a lot especially in low level code to improve performance and have a better sense of what cpu / memory are doing as a result of your code.
Regarding fixed precision neural networks, it looks like Intel had some fairly mature work on this at ISCA -- fast inference, but have to train elsewhere in the machine:
This is a good primer for understanding the Spectre[0] issue. In a nutshell, Spectre is caused by the CPU keeping the branches in an unencrypted cache that is accessible from programs other than the one that generated the branch.
Their fix was to add an extra kernel check for unauthorized access, which is why it causes a huge performance hit. The real fix will be in the next generation of chips, which will either encrypt or at the least protect the branch cache from unauthorized access from other programs.
> In a nutshell, Spectre is caused by the CPU keeping the branches in an unencrypted cache that is accessible from programs other than the one that generated the branch.
I would instead
say that the branch prediction has observable side effects even when it’s thrown away; say, for example, the branch performs a memory access: it may leave things in the cache even if the processor doesn’t take the branch and hence it may be possible to recover information by running a timing attack on how long it takes to access certain things.
I feel like fixed precision neural nets are probably the future of branch prediction. Both 'hardcoding' a neural net on a large amount of software and then designing the weights into the CPU in a ROM, but also dynamic approaches to deciding which prediction strategy to use for a specific branch based on state recorded by another neural net.
Approaches like this have been used as a fun exercise, but now I think we're reaching the point where single thread performance is so important, and mispredicts are getting more expensive, so it's worth taking quite a big power and area hit to get a few more percentage points of performance...
> but now I think we're reaching the point where single thread performance is so important
Ever since Denard scaling stopped, the biggest boosts to performance have been in increasing the parallelization opportunity, both at the SIMD level and at the multi-core level. Admittedly, working in HPC gives me a biased view, but everyone I see has resigned themselves to processors ceasing to give meaningful boosts to single-threaded performance.
Moreover, the ceiling you could get in boosting single-threaded performance with a perfect branch predictor (for conditional branches) over current state-of-the-art is around 4%. There's just not a lot of upside to be had, and even just using the extra space for another core looks preferable at that low ceilings. Extra area for the branch predictor is likely to go to better indirect branch prediction (including virtual function calls), which are increasingly important in modern workloads, and where there is a much larger headroom for performance improvement.
I'll also add that the effectiveness of neural nets in modern machine learning contexts has come from making them big and deep, and deep means adding latency to the branch predictor, which is not what you want (especially because you want to know ASAP if you need to prefetch a different I-cache line).
Also, branch prediction circuitry has costs, such as higher energy per instruction on average; management of the design complexity, the risks associated with that (e.g Spectre, meltdown), increased die size and therefore faulty die rate and die cost, etc.
Thing is not everything is amenable to simd or multithreaded optimization. As you said yourself, a lot of HPC code tend to parallelize more easily than average. For a lot of code out of order execution is the only way to extract some amount of parallelism, and one of the biggest limiting factors to increasing the reorder window is branch prediction rate. Even an improvement of 0.1% in the prediction rate can lead to non trivial improvements in performance.
> Admittedly, working in HPC gives me a biased view, but everyone I see has resigned themselves to processors ceasing to give meaningful boosts to single-threaded performance.
There's also that realization that compiler & software development is still far, far behind the parallelization capabilities of the HW. See the failed DEC Alpha, Itanium. It's a bad design situation when compilers regularly generate code that blows caches, branch predictors, TLBs and instruction pipelines more than 50% of the time when the code has been run through aggressive non-PGO effort. Compilers need to get better at generating better code.
> Moreover, the ceiling you could get in boosting single-threaded performance with a perfect branch predictor (for conditional branches) over current state-of-the-art is around 4%.
No, actually having perfect branch prediction is a major blocker on performance even now, since it limits the useful size of the out-of-order reorder buffer.
There has been some work in making cores wider with larger re-order depth and more physical register files, etc. But all of that is deeply into diminishing returns territory.
Future?
Most modern CPUs do this to some extent. Remember you can only have a very simple model here, because the inference time has to be extremely fast.
> "I wonder if we can make branch predictors even more accurate,” and the next day you’d start XOR’ing the branch’s PC address with a shift register containing the branch’s recent branching history, because in those days, you could XOR anything with anything and get something useful, and you test the new branch predictor, and now you’re up to 96% accuracy, ...
> When you retire in 2003, your face is wrinkled from all of the smiles, and even though you’ve been sued by several pedestrians who suddenly acquired rare paintings as hats, you go out on top, the master of your domain. You look at your son John, who just joined Intel, and you rest well at night, knowing that he can look forward to a pliant universe and an easy life. Unfortunately for John, the branches made a pact with Satan and quantum mechanics during a midnight screening of “Weekend at Bernie’s II.”
E.g. instead of:
we would have:prepare-to-branch <jump-target> <launchpoint-label> ... launch-label: branch <jump-target>
The utility is pretty limited, but it can help for strictly in-order machines.
https://chasethedevil.github.io/post/the_neural_network_in_y...
It talks about “Dynamic Branch Prediction with Perceptrons” (Jimenez 2001) which sparked the whole thing.
The AMD engineers read this paper (and probably many more!), put together a development program, and shipped it.
a) Eben Upton's write-up on "Raspberry Pi and Spectre/Meltdown" gives a very nice overview of the main features of modern processors - https://www.raspberrypi.org/blog/why-raspberry-pi-isnt-vulne...
b) Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture by Jon Stokes.
c) Computer Systems: A Programmer's Perspective by Bryant and O'Hallaron.
d) Modern Processor Design: Fundamentals of Superscalar Processors by Shen and Lipasti
https://arxiv.org/abs/1906.09889
Their fix was to add an extra kernel check for unauthorized access, which is why it causes a huge performance hit. The real fix will be in the next generation of chips, which will either encrypt or at the least protect the branch cache from unauthorized access from other programs.
[0] https://en.wikipedia.org/wiki/Spectre_(security_vulnerabilit...
I would instead say that the branch prediction has observable side effects even when it’s thrown away; say, for example, the branch performs a memory access: it may leave things in the cache even if the processor doesn’t take the branch and hence it may be possible to recover information by running a timing attack on how long it takes to access certain things.
Approaches like this have been used as a fun exercise, but now I think we're reaching the point where single thread performance is so important, and mispredicts are getting more expensive, so it's worth taking quite a big power and area hit to get a few more percentage points of performance...
Ever since Denard scaling stopped, the biggest boosts to performance have been in increasing the parallelization opportunity, both at the SIMD level and at the multi-core level. Admittedly, working in HPC gives me a biased view, but everyone I see has resigned themselves to processors ceasing to give meaningful boosts to single-threaded performance.
Moreover, the ceiling you could get in boosting single-threaded performance with a perfect branch predictor (for conditional branches) over current state-of-the-art is around 4%. There's just not a lot of upside to be had, and even just using the extra space for another core looks preferable at that low ceilings. Extra area for the branch predictor is likely to go to better indirect branch prediction (including virtual function calls), which are increasingly important in modern workloads, and where there is a much larger headroom for performance improvement.
I'll also add that the effectiveness of neural nets in modern machine learning contexts has come from making them big and deep, and deep means adding latency to the branch predictor, which is not what you want (especially because you want to know ASAP if you need to prefetch a different I-cache line).
There's also that realization that compiler & software development is still far, far behind the parallelization capabilities of the HW. See the failed DEC Alpha, Itanium. It's a bad design situation when compilers regularly generate code that blows caches, branch predictors, TLBs and instruction pipelines more than 50% of the time when the code has been run through aggressive non-PGO effort. Compilers need to get better at generating better code.
No, actually having perfect branch prediction is a major blocker on performance even now, since it limits the useful size of the out-of-order reorder buffer.