Addition is all you need for energy-efficient language models

shrubble · a year ago

I remember that many years ago, when floating point computation was expensive for Intel CPUs to do, there were multiple ways that programmers used integer trickery to work around this.

Chuck Moore of Forth fame demonstrated taking the value, say 1.6 multiplied by 4.1 and doing all the intermediate calculations via integers (16 * 41) and then formatting the output by putting the decimal point back in the "right place"; this worked as long as the range of floating point values was within a range that multiplying by 10 didn't exceed 65536 (16 bit integers), for instance. For embedded chips where for instance, you have an analog reading with 10 bits precision to quickly compute multiple times per second, this worked well.

I also recall talking many years ago with a Microsoft engineer who had worked with the Microsoft Streets and Trips program (https://archive.org/details/3135521376_qq_CD1 for a screenshot) and that they too had managed to fit what would normally be floating point numbers and the needed calculations into some kind of packed integer format with only the precision that was actually needed, that was faster on the CPUs of the day as well as more easily compressed to fit on the CDROM.

dajoh · a year ago

What you're describing is called fixed point arithmetic, a super cool technique I wish more programmers knew about.

Proper finance related code should use it, but in my experience in that industry it doesn't seem very common unless you're running mainframes.

Funnily enough, I've seen a lot more fixed point arithmetic in software rasterizers than anywhere else. FreeType, GDI, WPF, WARP (D3D11 reference rasterizer) all use it heavily.

kccqzy · a year ago

I have worked on firmware that has plenty of fixed point arithmetic. The firmware usually runs on processors without hardware floating point units. For example certain Tesla ECUs use 32-bit integers where they divide it into four bits of integer part and 28 bits of fractional part. So values are scaled by 2^28.

aatd86 · a year ago

What do they use? Not float I hope. Plus given that some currencies have different precisions... Don't tell me it's rounding errors over trillion monies?! :o)

EGreg · a year ago

Smart contracts on EVM and other blockchains all use fixed point, for the simple reason that all machines have to get exactly the same result.

myst · a year ago

Every half-competent software engineer knows about fixed point arithmetic, my friend.

andrewla · a year ago

I recall playing with FRACTINT, which was a fractal generator that existed before floating point coprocessors were common, that used fixed point math to calculate and display fractals. That was back when fractals were super cool and everyone wanted to be in the business of fractals, and all the Nobel Prizes were given out to fractal researchers.

touisteur · a year ago

Ozaki has been doing fp64 matrix-multiplication using int8 tensor cores

https://arxiv.org/html/2306.11975v4

Interesting AF.

candiddevmike · a year ago

AFAIK this is still the best way to handle money/financial numbers.

amanda99 · a year ago

That's got nothing to do with perf tho.

dwattttt · a year ago

That particular trick is known as fixed point arithmetic (not to be confused with a fixed point of a function)

asadalt · a year ago

this is still true for many embedded projects. like pi pico (2040) uses a table.

kragen · a year ago

Sure, FRACTINT is called FRACTINT because it uses fixed-point ("integer") math. And fixed-point math is still standard in Forth; you can do your example in GForth like this:

    : organize; gforth
    Gforth 0.7.3, Copyright (C) 1995-2008 Free Software Foundation, Inc.
    Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
    Type `bye' to exit
    : %* d>s 10 m*/ ;  : %. <# # [char] . hold #s #> type ;  ok
    1.6 4.1 %* %. 6.5 ok

Note that the correct answer is 6.56, so the result 6.5 is incorrectly rounded. Here's how this works.

(If you're not familiar with Forth, Forth's syntax is that words are separated by spaces. "ok" is the prompt, ":" defines a subroutine terminated with ";", and you use RPN, passing parameters and receiving results on a stack.)

In standard Forth, putting a decimal point in a number makes it a double-precision number, occupying two cells on the stack, and in most Forths the number of digits after the decimal point is stored (until the next number) in the non-standardized variable dpl, decimal point location. Here I've just decided that all my numbers are going to have one decimal place. This means that after a multiplication I need to divide by 10, so I define a subroutine called %* to do this operation. (Addition and subtraction can use the standard d+ and d- subroutines; I didn't implement division, but it would need to pre-multiply the dividend by the scale factor 10.)

"%*" is defined in terms of the standard subroutine m*/, which multiplies a double-precision number by a single-precision number and divides the result by a divisor, and the standard subroutine d>s, which converts a double-precision number to a single-precision number. (There's probably a better way to do %*. I'm no Forth expert.)

I also need to define a way to print out such numbers, so I define a subroutine called "%.", using Forth's so-called "pictured numeric output", which prints out an unsigned double-precision number inserting a decimal point in the right place with "hold", after printing out the least significant digit. (In PNO we write the format backwards, starting from the least significant digit.) The call to "type" types out the formatted number from the hold space used by PNO.

Then I invoked %* on 1.6 and 4.1 and %. on its result, and it printed out 6.5 before giving me the "ok" prompt.

If you want to adapt this to use two decimal places:

    : %* d>s 100 m*/ ;  : %. <# # # [char] . hold #s #> type ; redefined %*  redefined %.   ok
    1.60 4.10 %* %. 6.56 ok

Note, however, that a fixed-point multiplication still involves a multiplication, requiring potentially many additions, not just an addition. The paper, which I haven't read yet, is about how to approximate a floating-point multiplication by using an addition, presumably because in multiplication you add the mantissas, or maybe using a table of logarithms.

Forth's approach to decimal numbers was a clever hack for the 01970s and 01980s on sub-MIPS machines with 8-bit and 16-bit ALUs, where you didn't want to be invoking 32-bit arithmetic casually, and you didn't have floating-point hardware. Probably on 32-bit machines it was already the wrong approach (a double-precision number on a 32-bit Forth is 64 bits, which is about 19 decimal digits) and clearly it is on 64-bit machines, where you don't even get out of the first 64-bit word until that many digits:

    0 1 %. 184467440737095516.16 ok

GForth and other modern standard Forths do support floating-point, but for backward compatibility, they treat input with decimal points as double-precision integers.

visarga · a year ago

> can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot products

It this were about convolutional nets then optimizing compute would be a much bigger deal. Transformers are lightweight on compute and heavy on memory. The weakest link in the chain is fetching the model weights into the cores. The 95% and 80% energy reductions cited are for the multiplication operations in isolation, not for the entire inference process.

woadwarrior01 · a year ago

Pre-fill (even in the single batch case) and multi-batch decoding are still compute dominated. The oft repeated trope of "decoder only transformer inference is bottle-necked on memory bandwidth" is only strictly true in the single batch decoding case, because you're mostly doing vector matrix mults when the batch size is one.

ein0p · a year ago

Not even single batch. If you want reasonable latency per token (TPOT) even larger batches do not give you high compute utilization during extend. It’s only when you don’t care about TPOT at all, and your model is small enough to leave space for a large batch on an 8 GPU host, that’s when you could get decent utilization. That’s extend only - it’s easy to get high utilization in prefill.

SuchAnonMuchWow · a year ago

Its worse than that: the energy gains are when comparing computations made with fp32, but for fp8 the multipliers are really tiny and the adder/shifters represent a largest part of the operators (energy-wise and area-wise) and this paper will only have small gains.

On fp8, the estimated gate count of fp8 multipliers is 296 vs. 157 with their technique, so the power gain on the multipliers will be much lower (50% would be a more reasonable estimation), but again for fp8 the additions in the dot products are a large part of the operations.

Overall, its really disingenuous to claim 80% power gain and small drop in accuracy, when the power gain is only for fp32 operations and the small drop in accuracy is only for fp8 operators. They don't analyze the accuracy drop in fp32, and don't present the power saved for fp8 dot product.

bobsyourbuncle · a year ago

I’m new to neural nets, when should one use fp8 vs fp16 vs fp32?

Deleted Comment

lifthrasiir · a year ago

I'm also sure that fp8 is small enough that multiplication can really be done in a much simpler circuit than larger fp formats. Even smaller formats like fp4 would be able to just use a lookup table, and that makes them more like sort-of-standardized quantization schemes.

tankenmate · a year ago

i suspect that you could do fp8 with log tables and interpolation if you really wanted to (compared to the memory required for the model it's peanuts), it just turns into a LUT (log table look up) and bit shift (interpolation). so again, memory bandwidth is the limiting factor for transformers (as far as energy is concerned).

brilee · a year ago

fp4/fp8 for neural networks don't work the way you think they do - they are merely compression formats - a set of, say, 256 fp32 weights from 1 neuron are lossily turned into 1 max value (stored in fp32 precision) and 256 fp4/fp8 numbers. Those compressed numbers are multiplied by the fp32 number at runtime to restore the original weights and full fp32 multiplication + additions are executed.

bee_rider · a year ago

What is fp4? 3 bits of exponent and one of mantissa?

api · a year ago

Sounds like the awesome architecture for transformers would be colocation of memory and compute.

Joker_vD · a year ago

Yes, that's why we generally run them on GPUs.

imjonse · a year ago

That is true for single user/light inference only. For training and batch inference you can get compute bound fast enough.

saagarjha · a year ago

That really depends on what you're doing. Trying to feed a tensor core is pretty hard–they're really fast.

kendalf89 · a year ago

Maybe this technique can be used for training then since that is a lot more compute intensive?

mikewarot · a year ago

Imagine if you had a systolic array large enough that all the weights would only have to be loaded once at startup. Eliminating the memory-compute bottleneck of the von Neumann architecture could make this quite a bit more efficient.

h_tbob · a year ago

Bro... they are NOT lightweight on compute!

tantalor · a year ago

[2023] GradIEEEnt half decent: The hidden power of imprecise lines

http://tom7.org/grad/murphy2023grad.pdf

Also in video form: https://www.youtube.com/watch?v=Ae9EKCyI1xU

dang · a year ago

GradIEEEnt half decent: The hidden power of imprecise lines [video] - https://news.ycombinator.com/item?id=36806970 - July 2023 (9 comments)

GradIEEEnt half decent - https://news.ycombinator.com/item?id=35780921 - May 2023 (32 comments)

indrora · a year ago

I had hoped that they would reference this in their paper as some kind of "supporting previous exploration" but no, alas.

js8 · a year ago

Haven't read it, but isn't this just logarithmic tables in some form?

I am asking not to dismiss it, I genuinely feel I don't understand logarithms on a fundamental level (of logic gates etc.). If multiplication can be replaced with table lookup and addition, then there has to be a circuit that gives you difficult addition and easy multiplication, or any combination of those tradeoffs.

sabhiram · a year ago

Log space is nice, multiplication can be replaced by addition.

This part is easy and anyone can implement hardware to do this. The tricky bit is always the staying in log space while doing accumulations, especially ones across a large range.

pclmulqdq · a year ago

Yes, this is logarithmic number systems at work.

jenda23 · a year ago

Highly recommended!! Success achieved! Previously I had worked with another well regarded company to attempt recovering an Ethereum presale wallet passphrase that I had forgotten. After 14 months of trying there was no success, so then I looked into ReWallet. They were able to find the password solution in 6 weeks! Since I only remembered a few portions or clues, it seemed like a nearly impossible task. They worked diligently and very professionally. I fully recommend and trust these guys, the result speaks for itself. Contact email, ‎rewalletshieldcoinrecovery@ aol.com or WhatsApp::+1 (757) 332-1885

cpldcpu · a year ago

It puzzles me that there does not seem to be a proper derivation and discussion of the error term in the paper. It's all treated indirectly way inference results.

Lerc · a year ago

The paper has an odd feel about it to me too. Doing a gate estimation as a text explanation without a diagram makes it too easy to miss some required part. It wouldn't need to be a full gate level explanation but blocks labeled 'adder'.

Seeing the name de Vries in the first paragraph didn't help my sense of confidence either.

brcmthrowaway · a year ago

Because of the twisted mentat?

pjc50 · a year ago

"We recommend training and hosting L-Mul-based models on devices integrated with specialized architectural designs. Patent pending"

(from footnote in method section)

CGamesPlay · a year ago

I believe this reduces the compute required, but still uses 8 bits per value, so it does not reduce the memory requirements required to run inference, so it doesn’t particularly make the models more accessible for inference. Is this storage method suitable for training? That could potentially be an interesting application.

Manabu-eo · a year ago

It actually is about 0.5 bits less efficient per weight in terms of precision/range, something the paper never highlights.