Towards 1-bit Machine Learning Models

Really strong binary results. So strong it was fishy. I hope someone can explain my confusion below.

> We compared the performance of the Llama2-7B model in three configurations: FP16 (full precision), HQQ (without fine-tuning), and HQQ+ (with adapter layers) using a group-size of 8.

Interesting, what is "group-size of 8"?

From their HQQ post (https://mobiusml.github.io/hqq_blog/), it's the block size at which they add scales (presumably 16-bit) and shifts (in that post, it's 8-bit).

So for every 8 binary weights we have a 16-bit scale and 8-bit shift?

> Fine-tuning with Low-Rank Adapters

They say they inline the shift into the LoRA but how can you do this, block-wise, without increasing your LoRA rank by num-blocks (they claim to only use 1 additional rank)?

Then, the reported 7B sizes, in GB:

> 13.5 (fp16) 1.76 (HQQ 1-bit) 1.85 (HQQ+ 1-bit) 2.72 (quip# 2-bit)

those numbers would make sense if it was _actually_ 1 bit. But if you include the overhead of 16-bit scales (and why is the shift inlineable into lora? still unexplained) it'd be more like 3-bit.

From their HF page:

> This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.

Interesting, so we have to go back to CPU to rescale? Is this how they counted GB? This should have been clearly caveated in the table. I also am amazed they got latency lower than quip if they pingpong to CPU.

mobicham · 2 years ago

Hello, I am the main author, would love to clarify a couple of things:

All the linear-quantization methods have meta-data, including the 1.58bit paper. You can control the quality vs. memory usage by reducing the group-size. However, the meta-data is the not the same thing as the quantized weights for many reasons:

> The meta-data size doesn't change the fact that you can do binary/ternary matmul, which the most important thing in this story.

> The meta-data size doesn't increase the actual compute: these are point-wise operations and even if you have 1 scalar you still need to multiply the same amount of weights.

> Meta-data is offloaded to the CPU with pinned-memory, which allows non-blocking transfers. Technically, you can trigger the copy in the layer before and synchronize and will make it almost seamless. I did some experiments with cuda streams that worked very well on an older machine, but then I tried a better machine and the transfer was much faster. Obviously if you are trying it on Google colab it's very slow for this reason.

> Smaller models like Llama2-7B are very hard to directly quantize at very low bits, so they need a lower group-size to function well. Larger models (like what we showed for Mixtral), can be quantized to 2-bit on the fly, without any data, and still work very well. So basically larger models are less sensitive to extreme quantization and you could use a much larger group-size. I still think that the meta-data size is really not a big deal for the reasons I have explained above.

> There are many other ways to increase the group-size or even get rid of it all together, many ideas available but needs lots of experimentation.

> Binary/ternary CUDA matmul kernels don't exist yet. The current code is implementing the dequantization step in CUDA but then uses torch.matmul as fp16. I tried doing matmul at low-bits with CUDA but it is very difficult to even beat cuBLAS with fp16, especially for a novice CUDA coder like me :)

Please note: this is early experimental work. Since it showed promising results, we wanted to share it with the community first as we progress. There's still a lot of things to be done and we are actively working on it, despite the very limited resources we have.

Happy to answer any questions here!

vladf · 2 years ago

Thanks for the reply. I’m quite familiar with subchannel quant, but still feel like my questions did not get addressed.

1 Could you post the full memory use of the methods? E.g. you include quip metadata in its GB but not hqq metadata in its GB.

2 If you have to go to cpu to shift and scale, how did you get latency lower than pure on device? Was this bsz1? No speculative decoding?

3 how can lora absorb shifts with only increasing rank by 1 if you have a shift per group?

mikeravkine · 2 years ago

Thank you for your efforts on behalf of the GPU poor!

It's getting tougher to use older, cheaper GPUs (Pascal/Maxwell) with modern quantization schemes so anything you can do to keep kernels compatible with SM52 and SM61 would be greatly appreciated.

danielhanchen · 2 years ago

When one does quantization, it's done in blocks. Bitsandbytes uses a blocksize of 64 I think. W * scale + zero_point is needed for each group size of 8. So you need 2 numbers in fp16 for each 64 group of numbers. For BnB, you get 4.5bit approx since 64*4bit + 16bit + 16bit = 288/64 = 4.5. So 4bit is actually 4.5bit.

For HQQ 1bit, a group size of 8 needs 2 fp16 numbers (you mentioned 8bit for shift). So you need 8 * 1bit + 16bit + 8bit for each group ie 32bits for each group size of 8. Or 4bits per param.

I'm assuming the scale and zero_point are both moved to 8bit maybe so 8*1bit + 8bit + 8bit = 24bit / 8 = 3bits per param?

"This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.", so the 8+8 scale / zero_point moves to the CPU. So GPU memory 1bit, but CPU meta data is the rest - very smart!

Dylan16807 · 2 years ago

> "This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.", so the 8+8 scale / zero_point moves to the CPU. So GPU memory 1bit, but CPU meta data is the rest - very smart!

Doesn't it need all the weight metadata for a layer to use that layer?

* If yes, then can't any algorithm offload x% of its data as a balancing act between speed and RAM?

* If no, then what's it for and when does it get used?

vladf · 2 years ago

Err, you are just restating what I’m saying, without addressing the concerns.

1 - is it fair to use ram in two places and report only one of them without any asterisk? (If you think this is fair-oh boy wait till you hear about my 0GB hbm use inference algorithm)

2 - i know how subchannel quantization works. Are they hitting those reported latency numbers with per layer cpu pingpong to rescale?

I believe the future is 1 bit models - for both training and inference.

When people make custom silicon for 1 bit models, they'll find that it is sooooo much more power and silicon-space efficient to do 1 bit math than 16 bit floating point - like 100x or more.

That extra model size will vastly overshadow any worse performance of the models.

mikewarot · 2 years ago

I believe the future is 4*4 bit look up tables with output latches, with a bit to/from each Cartesian neighbor. Clock them like the colors of a chessboard, in 2 phases, and you don't have to worry about timing dependencies.

All of your code gets converted to a directed acyclic graph(DAG), executing at Ghz rates if you want.

Imagine a machine that can output a million parallel GPT-4 streams at 1000 tokens per second each.

If the architecture changes it's just a different DAG. Unlike with FPGAs and their custom blocks that have to be optimally used, you can compile a DAG almost instantly.

Dylan16807 · 2 years ago

1. If you write FPGA code as a grid of lookup tables then I would expect it to be easy to compile instantly.

2. In what way is this "acyclic"?

3. Won't putting your code into this form be the hard part? Even if you start with a DAG, 99.99% of them won't fit this form without intense restructuring. So you just moved the hard step over by one.

smusamashah · 2 years ago

Is this something from a research finding or is it your idea?

wongarsu · 2 years ago

Probably more than 100x for inference. Not only are you drastically reducing the number of bits and replacing float math with integer math, you can do matrix multiplication with only addition (as pointed out in the BitNet b1.58 paper). Additions require a lot less hardware to implement than multiplication. Adding one-bit or two-bit numbers requires barely any hardware at all. A traditional two-bit adder without carry bit is three xor gates and an and gate.

fasa99 · 2 years ago

to me the most exciting thing is that if is training that is speed up on the order of 100x-1000x, a large cluster may be well suited to gradient-descend hyperparameter tuning parameters by LLM training again and again at scale -- this is the first foot in a door towards an AI that iteratively may improve itself

cma · 2 years ago

For training how do you get any kind of meaningful derivative with it?

concurrentsquar · 2 years ago

You don't (you have to use real-valued inertial 'latent weights' during training): https://arxiv.org/abs/1906.02107

(there is still a reduction in memory usage though (just not 24x):

> "Furthermore, Bop reduces the memory requirements during training: it requires only one real-valued variable per weight, while the latent-variable approach with Momentum and Adam require two and three respectively.")

twelfthnight · 2 years ago

Maybe evolutionary algorithms instead? Hasn't proven super useful historically, but maybe at the scale of enormous LLMs it will be?

scotty79 · 2 years ago

Maybe something probabilistic?

chalst · 2 years ago

The OP explicitly excludes training.

api · 2 years ago

Doesn’t training need higher precision to avoid getting stuck at local minima, at least with back propagation style learning?

Maybe something alternate like evolutionary algorithms could work in a domain like this, but so far those have proven to be less efficient.

sp332 · 2 years ago

A recent paper used a single ternary "trit" ~1.5 bits per parameter for training. https://news.ycombinator.com/item?id=39535800 They said it would be difficult to apply this technique to pre-trained models and had to be trained in 1-trit from scratch.

bionhoward · 2 years ago

Isn’t 1bit too low for optimal radix economy (Euler’s number) though?

I want to see somebody try “imbalanced quaternary” -,0,+,2

twelfthnight · 2 years ago

Haven't heard this argument before. But from the Wikipedia article it seems base 3 has the best asymptomatic radix economy, but isn't much better than base 2 and base 2 is seemingly easier to program and optimize.

Since this is a new argument I've not heard, would be curious if you had links or some more explanation.

johnmorrison · 2 years ago

people are indeed working on -1,0,1,2 Q2 models, I read something about it the other day but don't recall the title.

they mentioned decomposition of Q2 into ternary+binary i.e. [[1,2],[-1,0]] -> [[1,1],[0,0]] + [[0,1],[-1,0]]

Dylan16807 · 2 years ago

I bet the optimal "large" value is bigger than 2.

hervature · 2 years ago

Radix economy is all about which base is the most efficient to represent a given number. It is simple to show that, for large numbers, this is equivalent to how efficient a base can represent itself, b/ln(b). Simple calculus shows this is minimized at e (Euler's number) or 3 if integer (closely followed by 2).

It sounds like you have something to add but you are already dictating the base by saying "bit". Literally from "binary digit". Anyway, quantization is not about which number system is best - virtually all computer systems we use today represents numbers in base 2. Quantization, at its core, is lossy compression. How do you go from a large model trained to high precision to a smaller model without hindering performance? This can be studied without needing to know the base.

Suppose you are using a decimal computer. You can ask yourself, I have a 128-decimal precision numbers, do I need that much precision? What happens if I just use 1-decimal precision by chopping off the 127 digits after the first decimal? You realize that there are two parts of an operation. The numbers involved (the operands) and the operation itself. You then ask yourself, if I keep one of the operands fixed (the original input), can I represent my 128-decimal precision neural network simply as a series of operations without the other operand? Perhaps only the most basic ones? Like: noops (add 0 or multiply by 1), increments (add 1), decrements (subtract 1), negations (multiply by -1), and clears (multiply by 0)? You count those numbers (-1, 0, and 1). There are 3 so you proudly proclaim you've made a neural network that only uses 0.477 dits. People get excited and confused because that is less than 1 dit which seems like a fundamental point. You further surprise the scientific field by finding a clever trick for getting rid of negations. You beat your previous record and now you only need 0.301 dits to represent your network. You are about to accept your Turing reward when the ghost of Claude Shannon appears and says "Why are you using a unit that measures entropy to mean how many symbols you have? If you insist, at least realize 0.301 dits is 1 bit." You are shocked when you realize 10^0.301 = 2^1. Reviewing Shannon's seminal paper, you are awestruck by Shannon's prescient comment "Change from the base a to base b merely requires multiplication by log_b(a).". You humbly give your award to Shannon. You keep the $1M since ghosts aren't as fast a new NVidia DGX. No matter how quantized the ghost is.

[1] - https://people.math.harvard.edu/~ctm/home/text/others/shanno...

imtringued · 2 years ago

The 1 ternary bit models only compress the weights. You still add and subtract using bfloat16 for better accuracy. Dedicated silicon is mostly a waste, because you are only processing two operations per parameter during inference. Loading the parameters from slow DDR, GDDR or HBM memory is the bottleneck in practice and the only solution is PIM. I was honestly disappointed by Nvidia's Blackwell since it is just barely competitive with GDDR PIM.

mobicham · 2 years ago

At least you can copy 16 times more data to the shared memory with binary weights.

programjames · 2 years ago

> I believe the future is 1 bit models - for both training and inference.

1 bit's nothin'. The future is training directly on electronic/photonic circuits.

Deleted Comment

bgnn · 2 years ago

This! Maybe just integer, but not floating point. That's a ridiculous way to do computation when you don't really need the precision.

JHonaker · 2 years ago

> That extra model size will vastly overshadow any worse performance of the models.

...What?

sroussey · 2 years ago

I think OP was referring to parameter size. You can make up for quantization by having more parameters.