The Pentium contains a complicated circuit to multiply by three

Author here if anyone has questions...

vanderZwan · 6 months ago

What happened to the dedicated "times three" multiplier in later machines? Did some form of it stick around? Did they change tactics to something making it obsolete?

phire · 6 months ago

Obsolete.

You can observe the evolution of multipliers in the MIPS line, which I've been studying, and happen to know.

The R3000A had the same Radix-8 (aka 3-bit per cycle) integer multiplier in 1989.

The R4000 had two Radix-8 multipliers in 1991. One for integer, one for floating point.

The R4400 was an upgraded R4000 in 1992. The integer multiplier was kept at Radix-8, but the FPU was upgraded to a Radix-256 design (aka 8-bits per cycle).

In parallel, MIPS spent a lot of time creating a low power design for the target market of "Windows NT laptops". The result was the R4200, released in 1993. MIPS published quite a bit of information about the various power saving optimisations [1], [2]. Instead of seperate integer and floating point data paths, they created a unified data path that did both, allowing them to use the same Radix-8 multiplier for everything. They even unified the multiplier unit into the main adder, rather than using a seperate adder like earlier designs.

In 1995, MIPS released the R4300i, (aka the CPU found in the Nintendo 64). It was an evolution of the r4200, keeping the unified float/integer datapath. But it gained the Radix-256 multiplier from the R4400 design, so both integer and float instructions complete at 8-bits per cycle.

As far as I can tell, this Radix-256 multiplier doesn't use any fancy tricks. It's just an array of eight 64-bit wide carry-save adders, feeding into a regular carry-lookahead adder.

In 1996, MIPS released the R10000. Transistors are now cheap enough that they could implement a full 52-bit adder for their floating point data path, allowing them to issue one double precision multiplication every cycle (though it's pipelined with a 2 cycle latency). I assumes it's just 52 stacked adders, though seems like they probably need to be doing something fancier with carries by the time it's that big.

Most modern CPUs have ended up at the same point. Massive pipelined arrays of adders that can execute one 64-bit multiplication every cycle, with a 3 cycle latency.

[1] https://www.youtube.com/watch?v=nll5MWlG7q4[2] https://ieeexplore.ieee.org/document/282948/

kens · 6 months ago

There were a lot of different algorithms in use back then. I don't know if there is a winner used in modern chips.

dsvf · 6 months ago

No question, but just in general appreciation for the wonderful content you put out there that we get to enjoy!

java-man · 6 months ago

Ken, maybe it's time to publish a book?

kens · 6 months ago

That would require me to focus on one subject :-)

ForOldHack · 6 months ago

I have to respect his choice to focus on the 8086 and the pentium, I would think he considers the 80286 a brain dead footnote, interesting as to what when wrong and what went terribly wrong.

I used to teach binary math to kids, and at a county fair I was upstaged by an 11 year old girl who demonstrated both multiplication by powers of two and division of powers of two.

"What does your father do for a living?'

"He is the director of the laser fusion project!"

"Oh."

kristianp · 6 months ago

This is probably a basic question, but is this for floating point multiplies? Isn't the part thats being multiplied less than 64 bits because you also add the exponents?

kens · 6 months ago

Yes, this is for floating point multiplies. But internally, the FPU uses x86 extended-precision numbers, 80-bit floats, with 64 bits of significand and 15 exponent bits. So the multiplier is dealing with 64-bit numbers, plus a few more bits to support rounding correctly.

https://en.wikipedia.org/wiki/Extended_precision#x86_extende...

dwedge · 6 months ago

I only tenuously understand this so feel free to ignore the question if it's too asinine but:

> (What about ×5? If you can compute ×3, you can subtract that from ×8 to get ×5.)

Why can't you do the same to subtract x4 from x7 to get x3?

thombat · 6 months ago

Because x8 is simply shifting 3 bits left so it's "free" (fixed-sized shifts are very simple to implement), whereas generating the x7 would require the additional circuitry, which additionally will be more expensive than the x3 was (since it's another shift+add beyond what x3 required)

mikequinlan · 6 months ago

x7 is computed by x8 - x, so you get x8 - x - x4 which is 2 subtractions (and 2 delays). You only want 1 delay.

russdill · 6 months ago

The question would be why isn't it 4x - 1x?

rob74 · 6 months ago

Not really a question, just one phrase that left me scratching my head a bit:

> The downside to radix-8 multiplication is that multiplying by a number from 0 to 7 is much more complicated than multiplying by 0 or 1, which is almost trivial.

I think you can leave out the "almost" there - multiplying by 0 is 0 and multiplying by 1 is the number itself, as shown in the example, so I would definitely call it trivial.

"This ×3 multiplier contains roughly 9000 transistors, a bit more than an entire Z80 microprocessor (1976). Keep in mind that the ×3 multiplier is a small part of the floating-point multiplier, which is part of the floating-point unit in the Pentium. Thus, this small piece of a feature is more complicated than an entire microprocessor from 17 years earlier, illustrating the incredible growth in processor complexity."

That's the pace of performance growth that lead software to become bloated today; next year's performance improvements would cover up most of the sins of failure to think critically about algorithms and data flow context / locality.

Today (as far as I've read) we're at the practical limit for what's reasonable to do with silicon semiconductor technology and our present understanding of physics. The pendulum needs to swing the other direction; it's time for computers to work smarter not harder.

JumpCrisscross · 6 months ago

> Today (as far as I've read) we're at the practical limit for what's reasonable to do with silicon semiconductor technology and our present understanding of physics

We’ve been at that limit for decades.

GrumpyYoungMan · 6 months ago

The fabs propped up the corpse of Moore's Law by throwing mountains of cash at expanding transistors into the third dimension: finFET, GAA, CFET, etc. That has kept the party going a little while longer than it would have lasted but it's a one-time deal since are no more dimensions to expand into.

Joker_vD · 6 months ago

Depends on what limit exactly you are thinking about: the size of a transistor is still shrinking.

dlcarrier · 6 months ago

The speed of software bloat matching hardware improvement is known as Wirth's law: https://en.wikipedia.org/wiki/Wirth%27s_law

I think software bloat is growing faster, though.

_carbyau_ · 6 months ago

I think the software bloat is being more affected by the speed of light. If only every damn thing didn't need internet access with its associated lag - often variable in timing...

userbinator · 6 months ago

On the other hand, the multiplier is far more regular in structure than a Z80. The datapath on the Pentium is several times wider too.

kens · 6 months ago

A bit surprisingly, the Z80's datapath is 4 bits wide. The Pentium's FPU is more than 64 bits wide (extra bits for rounding and stuff). So, yes, there is a big difference in the datapath widths.

acchow · 6 months ago

History of function calls:

- goto/jmp to instr

- vtable lookup

- hash and lookup in a dictionary

- run a Large Language Model

CamperBob2 · 6 months ago

Or, rather:

- goto/jmp to instr

- vtable lookup

- hash and lookup in a dictionary

- run everything through a universal approximator

HPsquared · 6 months ago

Luckily there is plenty room for improvement in most applications.

ajsnigrutin · 6 months ago

But why would we waste thousands of hours of programmers time to optimize stuff, if we can instead waste millions if not billions of hours of users' time?

Salgat · 6 months ago

Which is how it should be. There's a limited amount of resources in developing a chip, best to take advantage of the least costly route given those constraints.

Deleted Comment

EncomLab · 6 months ago

Jonathan Blow has entered the chat...

Deleted Comment

marginalia_nu · 6 months ago

This is a very tangential. I did some work in Trinary computer emulation a lifetime ago, there's a cute trick for finding a closed form translation between a division by a power of 3, and series of bit shifts and additions.

Start by observing that

1/3 - 1/2 = 2/6 - 3/6

1/3 = 1/2 - 1/2 (1/3)

Substitute equation above into RHS an infinite number of times and find

1/3 = -(-1/2)^N for N in 1..inf

You can do this with arbitrary pairs powers of 2 and 3 (also other bases).

The implication is that you can fairly easily build a fixed time divide-by-constant circuit as out of nothing but adders and subtractors for values that are close to a power of two.

ngneer · 6 months ago

Incredible, thanks for sharing. My understanding is that ternary computers would have been based on tri-state logic which was less reliable than say transistors or even vacuum tubes encoding binary state. Is that understanding correct?

phkahler · 6 months ago

The Cinematronics arcade game processor has two 12-bit accumulators. The multiply instruction shifts them right as a single 24-bit value and adds the contents of memory if a 1 came out the LSB. So you clear the upper half, load one value in the lower, I forget how you set the memory address for the other operand... and execute several 1-bit multiplies is a row. You can get a 24bit product this way, but most code I saw had runs of 8 multiplies. The most common thing was a 2x2 matrix multiple to do coordinate rotations for game objects.

This was in the mid 1970s using off the shelf 7400 series parts and had peak throughput of 5MIPS.

greesil · 6 months ago

Not exactly one cycle for a multiply it sounds like. That 5 mips would be eaten up quickly. I have sometimes had to do fixed point math in the past 20 years and have had much respect for those programmers who came before me.

>> Not exactly one cycle for a multiply it sounds like. That 5 mips would be eaten up quickly.

Sure taking 8 instructions for a multiply is a lot, but microprocessors at that time didn't even have multiply and ran below 1MIPS. I just wanted to bring it up as yet another way someone implemented multiply, and it was pretty effective for its time.

EDIT: too late to edit my original comment, but it added the memory value to the upper half of the 24-bit result.

philsnow · 6 months ago

> you may have heard of techniques such as carry lookahead, Kogge-Stone addition

Just an aside, that is Peter Kogge, who did his PhD work at Stanford, worked on the space shuttle, is an IBM fellow, and invented the first multi-core cpu.

atq2119 · 6 months ago

> invented the first multi-core cpu

The man clearly has many deserved achievements to his name, but that is true without this, and I'm confident the world would be better without this kind of statement.

"A multi-core CPU" isn't much of an invention per se. It's an idea -- one that is fairly obvious and trivial at a certain point of semiconductor history. Getting a multi-core CPU to run is not trivial, but it's not a single invention either, and by the time we got there, development teams were so large that it would be downright insulting to claim that one person solved all these problems by himself. Perhaps Kogge led the development of the first multi-core CPU, perhaps he was even visionary in the sense that he pushed for it before others thought it was feasible (I do not know if that is the case). But either way, he didn't invent it.

Thank you for keeping me honest, I concede the point; I was quoting from his Wikipedia entry and wasn’t particularly critical because I took an architecture class from him in grad school and liked him as a professor.

reaperman · 6 months ago

This raises my general curiosity to ask myself: among the set of things that could be said to have been truly invented by a single person (or pair/trio/tiny team) ... which inventions are the most complex ... and which are the most technologically advanced?

oidar · 6 months ago

I thought Kunle Olukotun led the team for the first multi-core CPU.

You may absolutely be right, I don’t know who did it “first”.

I read your comment in exactly this voice https://www.getyarn.io/yarn-clip/6b70f8d0-5706-4e10-a6e9-e61...

(In this scene, Steve Martin’s character Vinnie is trying to get away from Rick Moranis’s character Barney, and he gets a bunch of actors to pretend to be his family (none of whom speak English) to play on the latter’s sympathy and ask to have his handcuffs removed. Vinnie introduces Barney as, among other things, the inventor of the rotary engine, causing one of the actresses to break character and say “I thought Wankel invented the rotary engine”.)

nxobject · 6 months ago

As a double-aside, Peter Kogge wrote a very good early textbook on pipelined microarchitectures that's worth reading if you want to learn how early supercomputer vector processors were designed: The Architecture of Pipelined Computers (1981).

mturmon · 6 months ago

Peter used to consult/collaborate with my lab. He was a proponent of moving remote sensing computations closer to the sensor ("edge computing" these days).

You could definitely make the intellectual case for this approach. In cases where there's latency or costs associated with moving data to central computing, this makes sense. (In our case, it was space-based sensors so you could make the case that way.)

But AFAIK this style of processing was never systematically adopted in any space-based processing system, although many such systems (like radars) have ad hoc data reductions performed in hardware close to the sensor.

Thanks for providing that connection!

thaumasiotes · 6 months ago

I'm missing something:

> The downside to radix-8 multiplication is that multiplying by a number from 0 to 7 is much more complicated than multiplying by 0 or 1, which is almost trivial. Fortunately, there are some shortcuts. Note that multiplying by 2 is the same as shifting the number to the left by 1 bit position, which is very easy in hardware—you wire each bit one position to the left. Similarly, to multiply by 4, shift the multiplicand two bit positions to the left.

> Multiplying by 7 seems inconvenient, but there is a trick, known as Booth's multiplication algorithm. Instead of multiplying by 7, you add 8 times the number and subtract the number, ending up with 7 times the number. You might think this requires two steps, but the trick is to multiply by one more in the digit to the left, so you get the factor of 8 without an additional step.

> Thus, you can get the ×7 by subtracting. Similarly, for a ×6 term, you can subtract a ×2 multiple and add ×8 in the next digit.

> Thus, the only difficult multiple is ×3. (What about ×5? If you can compute ×3, you can subtract that from ×8 to get ×5.)

If we can easily compute ×2 for the purposes of exploiting 6x = 8x - 2x, and we can easily compute ×4 for the purposes of exploiting 4x = 4x...

why is it more difficult than that to compute 3x as the sum of 2x and 1x, or the difference of 4x and 1x?

Also, if we can easily compute ×6 by any means, why not compute ×3 as a right shift of that value? That is admittedly an extra step, but the extra step is a shift.

For the 64-bit multiplication, you're adding 22 terms, one for each base-8 digit. (Think of grade-school multiplication.) Each term needs to be trivial to compute, so you can do a shift and/or a negation to get the term, but you can't do another addition.

The point is that ×3 is precomputed once, so you can then simply feed it into any of the 22 terms as needed. You can't do ×2 and ×1 into a term to get ×3 because every term would need another adder. In other words, you want one circuit to compute ×3 rather than 22 circuits.

For your ×6 question, this value is computed by putting negative ×2 into the term and conceptually adding one more to the next digit to get ×8. You can't right-shift this ×8 value since it's part of a completely different term.

Hopefully this makes sense. There are a lot of digits and sums flying around.

How do you do a negation without doing an addition? Can you do better than -x = ~x + 1?

lewdwig · 6 months ago

Multiplication by 3 is actually a common operation, particularly in address calculations, where a shift and an add means multiplying the index by 3. Implementing this naively would add significant latency. But using this circuitry the LEA (Load Effective Address) instruction can do it in a single cycle, so spending this much transistor budget on it was totally a good call.

quadhome · 6 months ago

Is it used for that, though? As I understood the article, this circuit is used as a part of floating-point multiplication.

kazinator · 6 months ago

What?

- LEA is just an instruction moves the address calculated by an addressing mode into its output operand, rather than performing a data transfer to or from that address; all the addressing that LEA can do, MOV instructions can also do.

- The indexed addressing modes of the x86 used by MOV or LEA do not support a scale factor of 3, only powers of two: 1, 2, 4, 8. So address generation has no use for multiplication by 3.

- Article makes it clear that the multiplier by 3 is part of the floating point multiplier.

mjevans · 6 months ago

>> Instead of multiplying by 7, you add 8 times the number and subtract the number, ending up with 7 times the number. You might think this requires two steps, but the trick is to multiply by one more in the digit to the left, so you get the factor of 8 without an additional step.

Does this mean there's an "adder" to add 1 to the "next digit" prior to feeding the digits to the main part of the multiplier? That itself seem like it'd be similar to the carry lookahead circuit. Also thinking about when that needs to be done:

7 = 8-1

6 = 8-2

5 = 8-3

4 = 8-4 <- he didn't say they do this, but it would mean the MSB of the 3-bit value could be used to determine if we need to add 1 to the next digit, saving a couple gates.