How and why CPUs do “branch prediction” (2017)

In some/many cases there is code before the branch that does not influence the branch condition. In that case, would it be possible to architect the cpu such that it could execute the branch 'early'? I'm thinking of a special asm instruction like 'execute the next 4 instructions, and only then do the conditional jump'.

E.g. instead of:

  i=0
  StartLoop:
  i+=1
  do_stuff_0
  do_stuff_1
  do_stuff_2
  do_stuff_3
  if i<N goto StartLoop

we would have:

  i=0
  StartLoop:
  i+=1
  do_stuff_0
  do_stuff_1
  if i<N goto StartLoop in 2 instructions
  do_stuff_2
  do_stuff_3

projektfu · 6 years ago

Yes, that was made explicit in the MIPS instruction set with branch (and load) delay slots, and it's implicit in out-of-order processors. As I understand it the branch delay slot paradigm did not pan out as well as was hoped and it has not found its way into many other architectures.

gpderetta · 6 years ago

The issue with the branch delay slot is that it assumes there is exactly one clock cycle (I.e. one stage) between fetch and execute. This was true on some of the early 5 stage in order riscs, but hasn't been true for a while. In an high frequency in order design there are maybe a dozen stages/clock cycles which would be hard to fill with delay slots. OoO is even more complicated.

chrisseaton · 6 years ago

That’s called delay slots. It turns out to be so hard to work out what to put in them that they’re basically useless. Look at the massive flop that was the Itanium before you try it again!

brandmeyer · 6 years ago

ARMv8-M has a set of extensions for this kind of branch. There is a set of prepare-to-branch instructions as well as counted loop instructions. In order to support a variable-length delay slot, they don't have very many options. In order to handle interrupts, you still need an ordinary branch instruction from the launch point as well. So if you take an interrupt in the middle of the sequence, the ordinary branch instruction still gets executed.

prepare-to-branch <jump-target> <launchpoint-label> ... launch-label: branch <jump-target>

The utility is pretty limited, but it can help for strictly in-order machines.

amelius · 6 years ago

Wouldn't that be equivalent to loading i and N into registers, so that a conditional jump would be fast?

  i=0
  StartLoop:
  i+=1
  do_stuff_0
  do_stuff_1
  REG1 = i, REG2 = N
  do_stuff_2
  do_stuff_3
  if REG1<REG2 goto StartLoop

I feel like fixed precision neural nets are probably the future of branch prediction. Both 'hardcoding' a neural net on a large amount of software and then designing the weights into the CPU in a ROM, but also dynamic approaches to deciding which prediction strategy to use for a specific branch based on state recorded by another neural net.

Approaches like this have been used as a fun exercise, but now I think we're reaching the point where single thread performance is so important, and mispredicts are getting more expensive, so it's worth taking quite a big power and area hit to get a few more percentage points of performance...

jcranmer · 6 years ago

> but now I think we're reaching the point where single thread performance is so important

Ever since Denard scaling stopped, the biggest boosts to performance have been in increasing the parallelization opportunity, both at the SIMD level and at the multi-core level. Admittedly, working in HPC gives me a biased view, but everyone I see has resigned themselves to processors ceasing to give meaningful boosts to single-threaded performance.

Moreover, the ceiling you could get in boosting single-threaded performance with a perfect branch predictor (for conditional branches) over current state-of-the-art is around 4%. There's just not a lot of upside to be had, and even just using the extra space for another core looks preferable at that low ceilings. Extra area for the branch predictor is likely to go to better indirect branch prediction (including virtual function calls), which are increasingly important in modern workloads, and where there is a much larger headroom for performance improvement.

I'll also add that the effectiveness of neural nets in modern machine learning contexts has come from making them big and deep, and deep means adding latency to the branch predictor, which is not what you want (especially because you want to know ASAP if you need to prefetch a different I-cache line).

redcalx · 6 years ago

Also, branch prediction circuitry has costs, such as higher energy per instruction on average; management of the design complexity, the risks associated with that (e.g Spectre, meltdown), increased die size and therefore faulty die rate and die cost, etc.

gpderetta · 6 years ago

Thing is not everything is amenable to simd or multithreaded optimization. As you said yourself, a lot of HPC code tend to parallelize more easily than average. For a lot of code out of order execution is the only way to extract some amount of parallelism, and one of the biggest limiting factors to increasing the reorder window is branch prediction rate. Even an improvement of 0.1% in the prediction rate can lead to non trivial improvements in performance.

oso2k · 6 years ago

> Admittedly, working in HPC gives me a biased view, but everyone I see has resigned themselves to processors ceasing to give meaningful boosts to single-threaded performance.

There's also that realization that compiler & software development is still far, far behind the parallelization capabilities of the HW. See the failed DEC Alpha, Itanium. It's a bad design situation when compilers regularly generate code that blows caches, branch predictors, TLBs and instruction pipelines more than 50% of the time when the code has been run through aggressive non-PGO effort. Compilers need to get better at generating better code.

Veedrac · 6 years ago

> Moreover, the ceiling you could get in boosting single-threaded performance with a perfect branch predictor (for conditional branches) over current state-of-the-art is around 4%.

No, actually having perfect branch prediction is a major blocker on performance even now, since it limits the useful size of the out-of-order reorder buffer.

Symmetry · 6 years ago

There has been some work in making cores wider with larger re-order depth and more physical register files, etc. But all of that is deeply into diminishing returns territory.

ben-schaaf · 6 years ago

It seems like AMD has been doing something similar for a little while already: https://chasethedevil.github.io/post/the_neural_network_in_y...

tntn · 6 years ago

AMD switched to TAGE for Zen 2, so I don't know if neural networks are "the future of branch prediction" or just a neat diversion.

fooker · 6 years ago

Future? Most modern CPUs do this to some extent. Remember you can only have a very simple model here, because the inference time has to be extremely fast.

saagarjha · 6 years ago

One fun thing to note is that the author doesn’t really talk a lot about anything from this decade, which makes sense considering how much complexity was being added for the tiny improvements towards the end. Modern branch predictors are quite complicated so they can eke out improvements in accuracy by fractions of a percent.

Jach · 6 years ago

Improving CPUs since ~2003 has just been incredibly hard. My favorite satire-history account remains The Slow Winter: https://scholar.harvard.edu/files/mickens/files/theslowwinte...

> "I wonder if we can make branch predictors even more accurate,” and the next day you’d start XOR’ing the branch’s PC address with a shift register containing the branch’s recent branching history, because in those days, you could XOR anything with anything and get something useful, and you test the new branch predictor, and now you’re up to 96% accuracy, ...

> When you retire in 2003, your face is wrinkled from all of the smiles, and even though you’ve been sued by several pedestrians who suddenly acquired rare paintings as hats, you go out on top, the master of your domain. You look at your son John, who just joined Intel, and you rest well at night, knowing that he can look forward to a pliant universe and an easy life. Unfortunately for John, the branches made a pact with Satan and quantum mechanics during a midnight screening of “Weekend at Bernie’s II.”

dang · 6 years ago

Discussed at the time: https://news.ycombinator.com/item?id=15078574

alphaBetaGamma · 6 years ago

vectorEQ · 6 years ago

really great article. love it, super clear and consise. This kind of mechanisms are super interesting to learn more about. I'm curious to the new AMD cpus which use neural network to predict branches, how that fits into the performance and what kind of accuracy it peaks at.

ovi256 · 6 years ago

A brief introduction can be found in this post here:

https://chasethedevil.github.io/post/the_neural_network_in_y...

It talks about “Dynamic Branch Prediction with Perceptrons” (Jimenez 2001) which sparked the whole thing.

The AMD engineers read this paper (and probably many more!), put together a development program, and shipped it.

sufiyan · 6 years ago

The newest one moved to TAGE

tuananh · 6 years ago

reading these make me feel like i don’t know much about programming :D

rramadass · 6 years ago

Ha, ha.:-) A lot of us are in the same boat except that few are willing to acknowledge it openly. That said; i can suggest the following for further reading (from simple to advanced);

a) Eben Upton's write-up on "Raspberry Pi and Spectre/Meltdown" gives a very nice overview of the main features of modern processors - https://www.raspberrypi.org/blog/why-raspberry-pi-isnt-vulne...

b) Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture by Jon Stokes.

c) Computer Systems: A Programmer's Perspective by Bryant and O'Hallaron.

d) Modern Processor Design: Fundamentals of Superscalar Processors by Shen and Lipasti

programming, program execution and pc architecture, even though very closely related, don't really have to be understood depending on your goals. it's super interesting though to learn these kind of things if you are programmer, it helps a lot especially in low level code to improve performance and have a better sense of what cpu / memory are doing as a result of your code.

myNewName1001 · 6 years ago

Regarding fixed precision neural networks, it looks like Intel had some fairly mature work on this at ISCA -- fast inference, but have to train elsewhere in the machine:

https://arxiv.org/abs/1906.09889

jedberg · 6 years ago

This is a good primer for understanding the Spectre[0] issue. In a nutshell, Spectre is caused by the CPU keeping the branches in an unencrypted cache that is accessible from programs other than the one that generated the branch.

Their fix was to add an extra kernel check for unauthorized access, which is why it causes a huge performance hit. The real fix will be in the next generation of chips, which will either encrypt or at the least protect the branch cache from unauthorized access from other programs.

[0] https://en.wikipedia.org/wiki/Spectre_(security_vulnerabilit...

> In a nutshell, Spectre is caused by the CPU keeping the branches in an unencrypted cache that is accessible from programs other than the one that generated the branch.

I would instead say that the branch prediction has observable side effects even when it’s thrown away; say, for example, the branch performs a memory access: it may leave things in the cache even if the processor doesn’t take the branch and hence it may be possible to recover information by running a timing attack on how long it takes to access certain things.

londons_explore · 6 years ago