Karatsuba Matrix Multiplication and Its Efficient Hardware Implementations

freetonik · 5 months ago

Random anecdote: 15 years ago I was starting an online community for computer science students, and needed to come up with a name. I made a survey for the community members to vote on options, with some boring ones like "CS Hub" and stuff; one of the options was "Karatsuba": I had just learned about the Karatsuba multiplication algorithm, and somehow got this idea into my head that his last name sounds cool and unique, and my online community (later grown into a ed-tech startup) should be named after this Russian mathematician.

Anatoly Karatsuba himself was already dead (he died in 2008). I emailed his daughter Yekaterina (who is also a mathematician, btw) asking for a permission to use their last name. She agreed, but asked to be extra careful about potential implicit affiliations, i.e. to be clear that the content has nothing to do with her father's research.

She also expressed an opinion that in the field of mathematics and computation, at least in Russia, actual researchers are rarely involved in writing textbooks, and the textbooks used in universities often contain conflicting or even wrong information about the authorship of research.

In the end a different name was chosen for the project.

dvasdekis · 5 months ago

Would this work have the potential to speed up encoding/decoding of the PAR2 format[0]? This format is widely used to protect datasets against bitrot and small losses, but is held back because of the significant compute overhead when dealing with large datasets.

[0] https://en.wikipedia.org/wiki/Parchive

gjm11 · 5 months ago

Unless I am badly misunderstanding the paper, this is about doing matrix multiplication on matrices whose entries are large integers. So far as I can tell, PAR2 doesn't do any large-number arithmetic at all, so I don't think this work will have any relevance for implementations of PAR2.

[EDITED to add:] Reading the paper a bit more carefully, the above isn't exactly right. It is about doing matrix multiplication on matrices whose entries are "large" but they're interested in using the idea for hardware design where "large" may mean not-very-large, building (say) 16-bit multiplies out of 8-bit multipliers.

But for software implementations (and I doubt anyone will be making special-case hardware for PAR2) small multiplications are not generally faster than large ones. Maybe you might be able to do more of them in parallel using SIMD, but I bet the extra additions and subtractions and general overhead would outweigh any gains in that setting.

oofbey · 5 months ago

They're proposing "new hardware architectures" to take advantage of this idea. Anybody with a background in GPU floating point math comment on how realistic this is?

emacs28 · 5 months ago

First author here. The hardware architectures are realistic - we developed & evaluated real example hardware implementations for them, validated on FPGA, and they achieved state-of-the-art ResNet performance in a deep learning accelerator system implementation compared to prior accelerators evaluated on similar FPGAs. See the associated accelerator system source code here:

https://github.com/trevorpogue/algebraic-nnhw

The hardware architectures focused on in the paper are systolic array designs, an efficient type of hardware design for matrix multiplication (e.g., the Google TPU uses this), as opposed to more SIMD-like vector architectures like GPUs. It may be possible to extend the proposed KMM algorithm to other types of hardware architectures also in future work. Regarding floating point - this work is applicable for integer matrix multiplication acceleration, it may be possible to extend the concept to floating point data types in future work also.

almostgotcaught · 5 months ago

> systolic array designs, an efficient type of hardware design for matrix multiplication (e.g., the Google TPU uses this), as opposed to more SIMD-like vector architectures like GPUs

this is wrong. TPUv4 has tensor cores just like NVIDIA has tensor cores just like AMD has tensor cores. no one uses a systolic array because bandwidth/connectivity is much scarcer than compute. the only people that keep talking about them are academics that don't actually fab/sell chips.

https://cloud.google.com/tpu/docs/v4

https://www.nvidia.com/en-us/data-center/tensor-cores/

https://rocm.docs.amd.com/projects/rocWMMA/en/latest/what-is...

ninja edit: before you gotcha me with "a tensor core is a systolic array!!!" - most tensor cores are actually outerproduct engines not riffle shuffle engines (or whatever you wanna call the topology corresponding to a systolic array).

sifar · 5 months ago

I have used Karatsuba's & Winograd's Inner product [0] algorithm in my work for wide multi-simd integer multipliers and matrix multiplication HW for DSPs. The latter cuts down the MACs by half - n^3/2 instead of n^3. I think the paper talks about it's derivative - FFIP.

The issue is memory bandwidth. These techniques indeed help you save multiplier area however the performance is still bandwidth limited - you'd need to be able to feed more data per cycle to increase performance.

One thing the paper doesn't talk about is energy. For DNN, at the network level the energy consumed by integer macs is not that high. Localizing data computation would have a much more impact on energy reduction than optimizing MACs.

[0] https://ieeexplore.ieee.org/document/1687427

thijson · 5 months ago

On an FPGA integer adders are much more abundant than integer multipliers. So this algorithm definitely helps get more utilization out of the FPGA. Once the multiplier is small enough, say 3 bits by 3 bits, it can fit into several LUT6's.

Deleted Comment

winterismute · 5 months ago

The paper is about integer multiplication, not float

oofbey · 5 months ago

Why would that matter? I understood the point was to speed up matrix multiplication by doing the adds and multiplies in a different order. Shouldn't matter whether the datatype is int, float, complex, whatever.

godsmokescrack · 5 months ago

The basic idea of reducing 4 multiplications to 3 multiplications

(ax + b)(cx + d) = acx^2 + [(a + b)(c + d) - ac - bd]x + bd

holds pretty generally; there isn't any new math or algorithm here that I can see. Their own complexity analysis (eqns. 7 and 8) shows this performs about the same as using Karatsuba multiplication on the entries of the matrices (instead of on the matrices themselves).

arcastroe · 5 months ago

> this performs about the same as using Karatsuba multiplication on the entries of the matrices (instead of on the matrices themselves).

If it offers improvements in both, why wouldn't one do it in both?

ash-ali · 5 months ago

the govy uses specialized hardware that isn't sold on the market right? would something like this be useful in developing said hardware>?

nine_k · 5 months ago

FPGAs are widely available. They are not the cheapest hardware, but they allow you to have something similar to custom silicon, but quickly and in quantities starting from 1.

teleforce · 5 months ago

Modern CPU architecture for example the Versal architecture from AMD has seamless integration with compute accelerators namely FPGA, GPU, DSP, etc [1].

They even has built-in ADC/DAC for communicatiobmn, sensing and data acquisition (DAQ) [2].

On top of that they have native support for dataflow library and framework including linear algebra that is more suitable for data-intensive in which AI is only one of the intended applications [3].

I was recently asked by my colleague why we still need CPU, since GPU is very dominant now and it seems it's all that we need. I just pointed out GPU is only one of the many accelerators available to the CPU, but a very good one that happened to be very useful and biased towards fancy killer applications namely graphic (game), blockchain (bitcoin) and AI (LLM/ChatGPT).

[1] AMD Adaptive SoC Platform: Versal Architecture [pdf]:

https://www.amd.com/content/dam/amd/en/documents/products/ad...

[2] AMD adds RF-sampling data converters to Versal adaptive SoCs (2024):

https://news.ycombinator.com/item?id=42899304

[3] AIEBLAS: Open-Source Expandable BLAS Implementation for AMD/Xilinx Versal Device:

https://github.com/atlarge-research/AIE-BLAS