Show HN: Matrix Multiplication with Half the Multiplications

This looks pretty cool! What's the catch? e.g. why isn't this already implemented in accelerators, is it really just a forgotten algorithm, or this has some implications on the cost of building the accelerator or else?

lupire · a year ago

It's not just a software algorithm. It's a hardware architecture optimization. To benefit, you have to build hardware that matches the dimensions of the algorithm. That's an expensive commitment.

emacs28 · a year ago

> you have to build hardware that matches the dimensions of the algorithm

Yes the benefits are realized in custom hardware designs as opposed to software, however, the hardware architectures work for multiplying matrices of arbitrary dimensions by splitting up larger matrices into smaller tiles, then summing up the tile products to form the final larger matrix products (i.e. GEMM)

SJC_Hacker · a year ago

Not so much in FPGA ... although I'm not sure top end FPGAs would beat Nvidia TPUs even with this algorithm, and even if cost were not a consideration.

emacs28 · a year ago

IMHO, for fixed-point MM accelerators, there is no catch, I think it's an overlooked algorithm. It's based on an algorithm by Winograd who coincidentally also proposed another unrelated algorithm that later became very popular for CNN acceleration which would take some visibility away from this other algorithm by Winograd... But that is speculative

yorwba · a year ago

On the other hand, if you tried it with floating point, you'd lose significant digits. Since the approach is to sum (a[i] + b[i+1])(a[i+1] + b[i]) and subtract the sums of a[i]a[i+1] and b[i]b[i+1] in the end to get a[i]b[i] + a[i+1]b[i+1], you may be taking the difference of two large values to get a small value, losing precision.

OJFord · a year ago

LLM hype and this submission in particular keep making me think of a lecturer I had for Topics in Large Dimensional Data Processing, circa 2016: as I recall he was enthusiastically adamant that the most important thing, breakthroughs etc., in years/decades to come was going to be faster matrix operations. Anyway, I'm pretty sure I recognise FIP (not FFIP of course) from that course.

I wish I could remember his name, I believe he left academia after my year and went to work in industry, I'd just be curious to see what he's up to now. I'm not saying it was a particularly novel or prescient comment/attitude, we may not have had quite such ML hype but certainly 'big data' was all the rage at the time, it's just something that's stuck in my mind. One of those areas I always meant to study more, just realistically probably never had the mathematical chops for and certainly those I did have atrophied.

pclmulqdq · a year ago

There are a lot of matrix multiplication algorithms out there with a lot of pluses and minuses. It's always a balance of accuracy, runtime, and scaling. This one probably has bad accuracy in floating point.

emacs28 · a year ago

For everyone discussing the reduced accuracy/numerical stability of the algorithms in floating-point, this is true. But note that the application of the algorithms in the work is explored for fixed-point MM/quantized integer NN inference, not floating-point MM/inference. Hence, there is no reduction in accuracy for that application of it compared to using conventional fixed-point MM.

abetusk · a year ago

I don't know why this answer is getting downvoted. This is absolutely correct.

W. Miller has a paper discussing, under conditions of numerical stability, O(n^3) multiplications is necessary [0]. Any algorithm that gets sub cubic runtime for matrix multiplication, like Strassen's or Coppersmith's, must sacrifice some amount of precision or stability.

[0] https://epubs.siam.org/doi/10.1137/0204009

waynecochran · a year ago

The document said it outputs the exact same values as the conventional method. There is no accuracy trade off here.

pbsd · a year ago

It's not quite forgotten. It kind of lives on in the pseudo-dot product Wegman-Carter authenticators like UMAC. See Section 3 of [1] for context.

[1] https://cr.yp.to/antiforgery/pema-20071022.pdf

adastra22 · a year ago

I’ve only glanced at it so someone correct me if I’m wrong, but IIUC this is not a replacement for matrix multiplication but rather an approximation that only gives decent-ish results for the types of linear systems you see in AI/ML. But for that use case it is totally fine?

emacs28 · a year ago

It produces identical/bit-equivalent results as conventional/naive matrix multiplication for integer/fixed-point data types

mariocesar · a year ago

Perhaps it's less of a hidden gem and more of a spotlight moment.

This readme does a really poor job of explaining what the improvement is or how they drop half the multiplications. What is the Big O run time on this? Is this shifting the known best bounds?

And the diagrams are chaotic and don't really explain anything about why this approach is fast or good. The result is that I'm reluctant to even click-through to the PDF.

If you want to improve the project credibility please consider being honest and open about what is actually going on and giving some clear explanations and illustrations, rather than things that may as well be designed to hype people too busy to tell you that you are cranks. It's hard to tell if this is incredibly groundbreaking or just but nothingburger. Sadly I almost feel like that must be an intentional decision motivated by poor merits of work and a desire to exploit AI height. The alternative - which I prefer to believe is the case - is that the author simply needs to revise and better contextualize.

Someone · a year ago

> What is the Big O run time on this?

The claim is they’re dropping half the multiplications, so it isn’t doing anything for Big O.

> If you want to improve the project credibility please consider being honest and open about what is actually going on and giving some clear explanations and illustrations,

The math explaining how to halve the number of multiplications in the paper (https://arxiv.org/abs/2311.12224) isn’t hard to understand.

You only have to read formulas 2 (traditional matrix multiplication) and 3 to 6.

I think it’s clear it does do what’s being advertised, halving the number of multiplications at the cost of lots of extra additions/subtractions.

They then go on to better vectorize that algorithm. That, as is usual for that, gets looking messy soon.

My main concern would be numerical stability.

emacs28 · a year ago

Thanks, good summary. Regarding numerical stability, the application is for fixed-point arithmetic, and therefore numerical stability is not an issue (the result is identical compared to using the conventional inner-product)

stephencanon · a year ago

The readme doesn't explain much, but the introduction to the paper itself is quite approachable.

As for whether it's groundbreaking or not ... it's a neat readily-accessible constant-factor improvement for area-constrained fixed-point accelerators. That doesn't change everything overnight, but neither is it nothing. It's nice work.

Deleted Comment

eigenket · a year ago

> This readme does a really poor job of explaining what the improvement is or how they drop half the multiplications. What is the Big O run time on this? Is this shifting the known best bounds?

Without wishing to sound elitist I, I don't understand the point of this comment at all. If you don't understand Big O notation enough to know that "half the multiplications" doesn't change it then why are you even asking about it?

barfbagginus · a year ago

I made two really bad misunderstandings which I completely own. I can't seem to edit it, so please let this be my refutation and explanation of my own mistake.

First misunderstanding: I assumed this was a new large matrix multiplication algorithm building on the hype from last week or so where we saw this paper: https://arxiv.org/abs/2210.10173

It is not an algorithm, but a hardware design - a systolic array using roughly half of the silicon area of a baseline design.

2. Assuming that we were talking about an algorithm, I then further assumed that it reduce an algorithm's multiplications by half for some important n. I assumed that it did this by accelerating some critical sub procedure in the baseline algorithm into a more efficient big O class without really changing the multiplicative factor. This is a common way to reduce the number of operations of an algorithm for some fixed n. As a consequence I thought that the author must be being sloppy by not telling us the full big o details of the improvement, and just picking some n where it just so happened that half of the multiplications vanish. That also seemed unlikely to be a consequence of a improvement in the bound of matrix multiplication, given how incredibly slow the progress on matrix multiplication bound has been. So I thought that the author might even be a crank.

But it turned out the sloppy thinking was on me. I was being the crank.

Reading the paper's introduction made it very very clear that we were dealing with a systolic array that reduces silicon area per compute.

Even worse that information is there in the first sentence of the readme as well.

Would a clearer sentence have helped me? Something like:

"We introduce a VHDL hardware design for a systolic array that nearly halves the silicon area of a baseline array, by replacing half of the multiplication-accumulate units (MACs) with simple adder units, exploiting Winograd's 1967 fast inner product formula (FIP)."

I'm honestly not sure, given how bad my mistake was to begin with. Not even the diagrams tipped me off - in hind site they are very obviously hardware block diagrams, but I thought that they were just needlessly complicated algorithmic diagrams! How silly!

I still believe that the readme could be simplified for a general audience of goobers like me. But first and foremost I have to admit that I was being a goober!

Does that help you understand my mistake here? I do understand Big O and why cutting operations by half is typically a constant factor improvement. But apparently I don't understand it well enough to prevent me from retconning a narrative with some very stinky assumptions and then projecting them on to the poor innocent hardware designer. Not very proud about that.

mariocesar · a year ago

It´s actually fairly clear

hackyhacky · a year ago

Not to everyone. If it's clear to you, you could helpfully explain it.

Deleted Comment

halflings · a year ago

ixaxaar · a year ago

Man I remembered something similar I had tried working on in 2018, but gave up after all my PhD applications got rejected.

https://github.com/ixaxaar/pytorch-dni

The concept here goes a bit further and tries to replicate backprop with an external network, arguing that that's probably what the brain actually does.

I'm not seeing the connection. This work is about low-level optimization of matrix multiplication. The repo you linked seems to be about replacing back-propagated gradients with a cheaper estimate. What's the similarity you see between these two?

Correct, I think I mistook it as "use a small neural net to approximate matrix multiplication" instead it seems as "use cheaper replacements of matrix mul without much acc loss".

Wellll that means I can give dni another try :D

jebarker · a year ago

This feels like a "no free lunch" situation. I would imagine that any time saving in approximating the gradients this way would be lost to needing to train for more iterations due to the loss in gradient accuracy. Is that not the case?

I think that's the reason for its dead end.

However if this is really the biological analogue of credit assignment, this might scale better than training llms from scratch every time. Even if say it could approx gradients to a certain degree given a new network, normal backprop could further tune for a few epochs or so dramatically reducing overall training costs.

rollingtide · a year ago

Unrelated to the technical discussion but I was wondering what you made that architecture gif with? Looks neat!

I think that image is from the paper and was not created by me. Looks cool indeed!

michelpp · a year ago

This is very cool and a real interesting read! For those in the comments confused about how this is better, the paper is talking about synthesizing matrix multiplication pipelines in hardware, like an FPGA or ASIC. On a CPU or GPU you won't notice because adds and multiplications take the same amount of time generally, but multiplication units takes up many more transistors, so if you can reduce the circuit complexity you can increase the speed and parallel throughput and reduce power and routing complexity. This approach could be particularly useful for efficient sparse matrix multiplication accelerators.

Another cool way to eliminate multiplication in matrix multiplication is to use different semirings [1]. The Tropical Semiring [2] for example substitutes addition for multiplication and min (or max) for addition. It's still matrix multiplication but with substituted binary operations. The research in this relatively new field of Tropical Algebra [3] is quite active and rich right now, being used for all kinds of optimization problems and in research for optimizing neural networks [4] . This approach also lends itself to hardware synthesis since most FPGA configurable logical blocks can add/min/max in one clock cycle, whereas efficient multiplication requires fixed dedicated on-chip hardware multipliers.

Another way to efficiently remove multiplications with a different but related semiring is to use a Log Semiring [5]. If you have to multiply chains of probabilities (like Markov chains) then the numbers quickly become very small and floating point loses its accuracy to represent the numbers. By scaling the numbers first by taking the log, multiplication becomes addition and addition becomes x + log1p(exp(y - x)).

[1] https://en.wikipedia.org/wiki/Semiring

[2] https://en.wikipedia.org/wiki/Tropical_semiring

[3] https://en.wikipedia.org/wiki/Tropical_geometry

[4] https://proceedings.mlr.press/v80/zhang18i/zhang18i.pdf

[5] https://en.wikipedia.org/wiki/Log_semiring

btown · a year ago

The paper in [4] is absolutely fascinating - I'm very much a neophyte here, but I believe it shows that practically any ReLU network can be represented as a tropical ratio of two tropical polynomials, and thus can be analyzed with geometric principles including visualizations of surfaces. It's been cited in more recent work: https://scholar.google.com/scholar?cites=1003719112553620451... - does anyone know if there's been any significant advancements here?

gatane · a year ago

Whoah, this is what Unified Algebra is all about!

http://www.cs.toronto.edu/~hehner/UA.pdf

nextaccountic · a year ago

So what's the difference between this unified algebra, and universal algebra[0]?

[0] https://en.wikipedia.org/wiki/Universal_algebra

buybackoff · a year ago

This is what HN is about :)

I understood a fraction but instantly wanted to dive into the topic just after reading such a knowledgeable comment.

jhj · a year ago

> By scaling the numbers first by taking the log, multiplication becomes addition and addition becomes x + log1p(exp(y - x)).

Addition/subtraction in a logarithmic number system is way more expensive than what you would spend on multiplication, especially if you care about correctly rounded results, as the (hardware) LUTs required are rather big.

pk-protect-ai · a year ago

Isn't this the same approach as in GF(2^x), which has been in use for decades? The only limitation that comes to mind is the field size.

Finite field log/antilog lookup tables are used for efficient-ish multiplication, similar to addition/subtraction tables used for logarithmic number systems.

bearzoo · a year ago

somewhat related is the number theoretic transform

https://ieeexplore.ieee.org/abstract/document/1451721

Drakim · a year ago

I'm surprised this actually works, usually detecting whether to use multiplication or addition is slower than simply using multiplication. Especially if it's massive amounts of work being done in parallel.

1letterunixname · a year ago

Wonder how well it compares to openblas and cublas.

skykooler · a year ago

I find it fascinating that this is using a process invented in 1968 and hasn't been used for this purpose until now!

Hey, nobody knew what to do with GF(2^x) up until mid last century either... Oh wait, CS was not really a thing almost up until mid last century...

Lucasoato · a year ago

If you're interested in the mathematical theory behind sub-cubic algorithms for matrix multiplications, you can start from here: https://en.wikipedia.org/wiki/Matrix_multiplication_algorith...

I conjecture that for every j > 0 in R, a number n exists so that any two n x n matrices can be multiplied together in O(n^(2+j)) steps.

(Now proven for for 2+j = w = 2.3728596, or j > 0.3728596)

gizmo686 · a year ago

> I conjecture that for every j > 0 in R, a number n exists so that any two n x n matrices can be multiplied together in O(n^(2+j)) steps.

Is this stated correctly? Because it seems almost meaningless as stated. You start with "for every j, there exists an n such that...". That would mean that for the rest of the statement, n and j are constant. So you are just saying that you can multiply constant sized matrices in constant time. Technically true, but I feel like you are trying to claim something stronger.

JohnKemeny · a year ago

It should simply say:

for any j>0 there exists an algorithm multiplying nxn matrices in time O(n^{2+j}).

bee_rider · a year ago

It does seem to be harder to make progress over time though. Maybe it’ll bottom out at j=1/e. I won’t call my guess even a conjecture though, it just happens to be a convenient constant near where the current value is. It would be a funny prank for math to pull on us.

abeppu · a year ago

Predicting that this holds for any j > 0 seems rather bold. Would you care to share your intuition why you think that's the case?

roflmaostc · a year ago

Two matrices with size NxN each can be multiplied naively with the schoolbook algorithm in O(N^3).

It's clear that the algorithm needs at least O(N^2) because to access each element of the matrices once, you need a double for loop, which is O(N^2).

  for i in rows
      for j in cols
          # do something with element matrix1 [i, j], matrix2[i, j],...

so it has to be j >= 0