joennlae (u/joennlae)

joennlae commented on Does MHz Still Matter? ubicloud.com/blog/does-mh... · Posted by u/furkansahin

joennlae · 4 days ago

How can I make sure that each github runner uses exactly one cpu core?

joennlae commented on Llama 3 implemented in pure NumPy docs.likejazz.com/llama3.... · Posted by u/orixilus

joennlae · a year ago

Trainable Llama-like transformer (with backpropagation) in numpy only (~600 lines)

https://github.com/joennlae/tensorli

joennlae commented on Designing a SIMD Algorithm from Scratch mcyoung.xyz/2023/11/27/si... · Posted by u/ingve

uo21tp5hoyg · 2 years ago

Wow I really like that little mini-map on the right...

joennlae · 2 years ago

+1. Does someone know how to do that?

joennlae commented on Show HN: Stella Nera – Maddness Hardware Accelerator github.com/joennlae/halut... · Posted by u/joennlae

joennlae · 2 years ago

Author here: Let me try to give an overview as I saw some questions repeating itself.

* This accelerator is for an Edge/Inference case, so there is no training on this chip.

* We introduce a differentiable form of Maddness, allowing Maddness to be used in e2e training and present an application -> ResNet.

* We are still in the process of understanding how this will translate to transformers.

* The goal was to show that Maddness is feasible with a good codesign of the hardware.

* Compared to other extreme quantisation (BNN/TNN) and pruning schemes, this is more general as it replaces the matmul with an approximate matmul.

* The model architecture is not fixed in hardware. It is „just“ a matmul unit.

I hope this helps :-)

joennlae commented on Show HN: Stella Nera – Maddness Hardware Accelerator github.com/joennlae/halut... · Posted by u/joennlae

fxtentacle · 2 years ago

That's the beauty of their method: If you can replace a 8192x8192 matrix multiplication with a 8192x256 decision tree and then a 256x8192 look up table, your memory requirements go from 67,108,864 down to about 2,162,688 parameters. (I assumed that their decision tree for encoding is perfectly balanced and only uses log(256) parameters per row)

EDIT: And given that this work is centered around energy-efficiency and was sponsored by Huawei, I would guess that LLMs on your phone are precisely the goal here.

EDIT2: The process node that they did their calculations with appears to match Google's TPUv3 which has 0.56 TOPS/W and the paper claims 161 TOPS/W which would be a 280x improvement in energy efficiency over the AI chips in Pixel phones.

joennlae · 2 years ago

Thank you for the feedback :-)

We have to be careful with the comparisons we make. The TPUv3 is a training and datacenter chip and not an Edge/Inference chip. They optimise for a different tradeoff, so while the comparison looks good, it is unfair.

joennlae commented on Show HN: Stella Nera – Maddness Hardware Accelerator github.com/joennlae/halut... · Posted by u/joennlae

fxtentacle · 2 years ago

I am surprised that they do not mention comparing against quantized matrix multiplication because their "encoding" appears to be something like a quantization step with unevenly sized buckets. And then their approximate multiplication step to me looks like multiplying a quantized input vector against a 1-bit quantized matrix.

But overall this is an extremely exciting development because it shows how one could convert a NN into an efficient hardware implementation. And due to them working only on quantized data with LUTs, one can also embed low-dimensional matrices directly into the silicon.

My prediction would be that this will develop in the way that we can soon buy $1 hardware accelerators for things like word embedding, grammar, and general language understanding. And then you need those expensive GPUs only for the last few layers of your LLM, thereby massively reducing deployment costs.

EDIT: Reading the actual paper, I saw that this work is also related to LORA because they convert high-dimensional input vectors to a quantized value based on a lower-dimensional embedding which they call "prototypes". So it's a bit like doing LORA with 1-bit quantization but instead of representing it as 8x 1bit flags you represent it as 1x 8bit integer.

joennlae · 2 years ago

Author here:

Thank you for the feedback :-) A lot of the work regarding the comparison with „simple“ approximate matrix multiplication has been done in the preceding paper: https://arxiv.org/abs/2106.10860

While I share your enthusiasm regarding the potential, we have to be careful about the limiting factors. Our main contributions on the algorithmic side are the reformulation of Maddness such that it is differentiable (autogradable), and we can use it in e2e DNN training, as decision trees are not differentiable.

We are still in the process of understanding how to optimise the training. In the next step, we want to look into transformers as, for now, we only looked into ResNets for easy comparability.

If you are a student at ETH Zurich and want to work on this -> reach out to me

joennlae commented on Show HN: less than 650 LOC trainable GPT only using NumPy github.com/joennlae/tenso... · Posted by u/joennlae

dauertewigkeit · 2 years ago

It's part of the TensorLi definition where all the magic happens.

joennlae · 2 years ago

That is true. I went for a simple implementation of the layer norm and included it in the tensorli definition. But it would have been better to define it as a moduli for clarity.