Optimizing a bignum library for fun

I've had to implement BigNum on an embedded system that had very little RAM and where the initial optimized version of ModExp took many minutes to complete. After much hair-pulling, the final version took 40 seconds.

First, you should work with unsigned numbers, and use a power of 2 as your word size. The fastest choice for a word size is very operation- and CPU-dependent.

A key trick is to lay out your bignums in the same order as the endianity of each word in memory, e.g. least-significant word first on little-endian systems. This will allow you to choose your word size dynamically for each operation: in memory, a number with M words of N bits each is identical to a number with M / 2 words of N * 2 bits each.

For multiplication, identify the CPU instruction with the widest result, then use half that size as your word size. Each step through the arrays generates a word result in the low half and a word carry in the top half. The carry gets added to the result of the next step, possibly overflowing.

For addition, use the widest result as your word size. This can also overflow.

How you deal with overflows is very CPU-dependent. You can use adc/addc as someone else mentioned, which will be faster on embedded and may be faster on fatter chips. Alternatively, you can halve the word size and use the top half as the carry.

If addc is not available, you can test for overflows as follows:

    uint32_t a = ..., b = ...;
    uint32_t res = a + b;
    uint32_t carry = res < a;

On overflow, res must necessarily be less than both a and b, so no need to check b.

If SIMD instructions are available, that will almost always be the fastest choice by far. While it doesn't change the above guidelines in principle, there are often e.g. optimized overflow mechanisms.

less_less · 2 years ago

    """
    uint32_t a = ..., b = ...;
    uint32_t res = a + b;
    uint32_t carry = res < a;
    """

But note well: this trick doesn't work when there is also a carry-in. That is,

    uint32_t res = a + b + carry;
    carry = res < a;

won't work when b = 0xFFFFFFFF and carry = 1, so you have to check for carries twice, or set `carry = res < a || (res == a && carry);` or similar. So if you are implementing bignums on an architecture that doesn't have ADC (such as RISC-V), or on a vector architecture where the performance would be bad, then you often need an alternative strategy such as using less than 32/64 bits per limb (whether that's half-size, or 30 bits or 51 bits or whatever).

worstspotgain · 2 years ago

Yep. Those are two additions, and naturally every full-word addition needs a separate overflow check. You can avoid the short-circuiting branches by just adding the carries:

   uint32_t res1 = a + carry;
   carry = res1 < a;
   uint32_t res2 = b + res1;
   carry += res2 < b;

Deleted Comment

jonathrg · 2 years ago

Just curious, what kind of embedded system were you working on that required arbitrary precision numbers, and where it was acceptable for the system to stall for 40 seconds during a calculation?

worstspotgain · 2 years ago

The ModExp was used for cryptographic authentication, and only ran once per session or once when pairing. IIRC it was a battery BLE device.

Modulo is surprisingly expensive even when you combine it with a quotient. It is almost always better to use binary "limbs", in this case 31 or 32 bits wide, because decimal parsing and printing should be much rarer than individual operations in general.

colonwqbang · 2 years ago

This is a bit misleading. Quotient and modulo with constant right-hand side is readily optimised into faster operations, like multiplication and bit shifts. "All" optimising compilers do this and don't emit a true divide instruction in such cases. See for instance: https://godbolt.org/z/vxnTPvY6M

Quotient and modulo with a variable RHS is expensive, but this isn't necessary for the algorithm showed in the article. Although I agree that binary wins over decimal anyway. We are using binary computers after all, for the last 50 years or so.

Someone · 2 years ago

> Quotient and modulo with a variable RHS is expensive

For dividing ‘big’ bignums by ‘small’ divisors, I would use/tweak libdivide (https://libdivide.com/), and spend the time to optimize the division.

For dividing ‘big’ bignums by ‘big’ divisors, that may not be worth it.

(Determining the correct values of the various ‘big’ and ‘small’ values left as an exercise for the reader. Those will depend on the hardware at hand)

anonymoushn · 2 years ago

The oldish version of gcc used by highload.fun seems to stop performing this optimization after the 5th or so constant divisor within a small function body. Always good to double check :(

worstspotgain · 2 years ago

The reason a division by 1000000 can be efficiently turned into a multiplication is because you're targeting a system that has a 32x32->64 multiplication instruction, which has plenty of room to spare. But if you have that, you would be better off using e.g. 2^32 as your modulo.

haberman · 2 years ago

Coincidentally I was just writing a bignum library from scratch two weeks ago.

A few interesting things I learned:

1. Knuth volume 2 has an extensive discussion of the problem space. I've only gotten a chance to skim it so far, but it looks interesting and approachable.

2. I need to support bitwise operations, which operate on a two's complement representation, so I figured it would be simpler to use two's complement internally, despite seeing that most (all?) bignum libraries use signed magnitude. I'm starting to regret this: two's complement introduces a lot of complexity.

The most fun thing about working on a bignum library is that it makes the algorithms you learned in grade school for add/subtract/multiply/divide relevant again. The basic ("classical") algorithms on bignum are basically exactly the same thing you learned in grade school, except on a much larger base than base-10.

vintermann · 2 years ago

Knuth has a "hugenum" scheme, TCALC, which is based on trees rather than byte arrays. There are many ways to do it, depending on which type of numbers you need most - for TCALC, it's numbers close to towers of exponents.

Of course for every big number encoding that can beat an array of bytes for some numbers, it also has to lose to byte arrays for some numbers... But it's possible to bound this inefficiency, so that it's never more than twice as bad as byte arrays, but potentially exponentially more efficient for the favoured numbers.

lysp · 2 years ago

Twice as bad being a timeout and 2nd call?

minimize · 2 years ago

For maximum efficiency, you should work in binary instead of base 10. Handling carries becomes more straightforward with the right primitives, for example __builtin_addc with GCC: https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins...

You can also implement it in C if you want a more portable solution: https://github.com/983/bigint/blob/ee0834c65a27d18fa628e6c52...

If you scroll around, you can also find my implementations for multiplication and such.

rurban · 2 years ago

For maximum efficiency you'd avoid addc like hell, because it blocks internal precomputation, and use guarantees like he did, avoiding overflow better. Better use int128 types though.

I'd just just stack objects for small typical sizes. And esp. do formal proofs than the random test joke.

> For maximum efficiency you'd avoid addc like hell, because it blocks internal precomputation, and use guarantees like he did, avoiding overflow better.

Whether that is the case depends on the CPU. I hit memory bandwidth limits before adc becomes a problem. But I concede that you have a valid point and that there are definitely cases where leaving sufficient space for carries is the better strategy.

Anyway, we can probably both agree that "% 1000000000" is not the best way to do it.

koverstreet · 2 years ago

addc makes all your adds serialize on the flags register, which is really painful.

more modern approach is to reserve some bits out of every word for carries, but that drastically complicates things.

globular-toast · 2 years ago

Funny, I did this myself 10 years ago. Shit, it's been that long..?

For my undergrad project I wrote a computer algebra system to do symbolic integration. The supervisor was a hardcore, old school C guy, so naturally I was going to just use C and no libraries. He told me I'd need bignums first, so I got to work (this is because many algorithms like polynomial GCD create massive numbers as they go, even if the inputs and final outputs are generally very small).

I just couldn't figure out how to do better than the largest power of 10 per digit at the time. Working with non base 10 arithmetic was a mind fuck for me at the time. So I did it with digits holding 10^9 and the classical algorithms from Knuth. Division is the hardest!

At some point I discovered the GNU multiple precision library (GMP) and made my program work with that instead of mine. I was shocked at how much faster GMP was! I finished my project with my own code, but I knew I had to come back to do it better.

The breakthrough came when I got a copy of Hacker's Delight. It has stuff like how to detect overflow after it's happened (in C). Something twigged and then I just understood how to fill each word completely rather than use a power of 10. I don't know what confused me before.

But, of course, the real way to do it is to use assembly. You can't get close to high performance in C alone. In assembly you get the overflow bit. It's actually easier in a way! So you write tiny platform specific bits for the digits and build on that in C. My add and subtract were then as fast as GMP. I lost interest when it came to implement faster multiplication algorithms.

Code in case anyone is interested: https://github.com/georgek/bignums

lifthrasiir · 2 years ago

guyomes · 2 years ago

For a small survey of practical efficient methods for bignum arithmetic operations, the Algorithms section of the documentation of GMP [1] is excellent.

[1]: https://gmplib.org/manual/Algorithms

tgot · 2 years ago

I think that your description is almost excellent, but that you're fundamentally misleading in describing what you are doing as a "30-bit" digit.

It's a 10^9 digit mathematically, occupying 30 bits of storage. You do briefly mention that it's 10^9, but repeatedly say 30-bits.

samatman · 2 years ago

It isn't misleading at all.

A hexadecimal digit has 4 bits of entropy. You can guess it correctly by chance one in sixteen times. Calling that a four-bit digit is correct. Same with a 30 bit digit, all that changes is the magnitude.

The "storage bits" aren't a detail, they're an essential property. Stored in balanced ternary it would still have 30 bits of entropy.

derdi · 2 years ago

I think you misunderstood the parent. If you speak of a number in the range 0-15 stored in 4 bits, we can all agree that "4 bit digit" is the appropriate term for it. But what about a number also stored in 4 bits but restricted to the range 0-13? It's a digit that fits in 4 bits, but calling it a "4 bit digit" without further qualification would omit relevant information.

nj5rq · 2 years ago

Fascinating article, I have always been wondering how these big number libraries worked.

As a side question, does anyone the program that the author used when making that "addition" and "multiplication" performance graph? Thanks.

jaeckel · 2 years ago

If you're interested in even more details you could also have a look at Tom's book about bignum arithmetics. C.f. third paragraph of https://github.com/libtom/libtommath#summary

The library evolved since then, so the algorithm documentation only exists in code nowadays.

FTR I'm involved in the project as maintainer.

nerdponx · 2 years ago

I don't know for sure, but you can do that kind of thing pretty easily with Matplotlib in Python. Or in R base graphics, with more effort to get it looking pretty.