Readit News logoReadit News
vladf · 2 years ago
Really strong binary results. So strong it was fishy. I hope someone can explain my confusion below.

> We compared the performance of the Llama2-7B model in three configurations: FP16 (full precision), HQQ (without fine-tuning), and HQQ+ (with adapter layers) using a group-size of 8.

Interesting, what is "group-size of 8"?

From their HQQ post (https://mobiusml.github.io/hqq_blog/), it's the block size at which they add scales (presumably 16-bit) and shifts (in that post, it's 8-bit).

So for every 8 binary weights we have a 16-bit scale and 8-bit shift?

> Fine-tuning with Low-Rank Adapters

They say they inline the shift into the LoRA but how can you do this, block-wise, without increasing your LoRA rank by num-blocks (they claim to only use 1 additional rank)?

Then, the reported 7B sizes, in GB:

> 13.5 (fp16) 1.76 (HQQ 1-bit) 1.85 (HQQ+ 1-bit) 2.72 (quip# 2-bit)

those numbers would make sense if it was _actually_ 1 bit. But if you include the overhead of 16-bit scales (and why is the shift inlineable into lora? still unexplained) it'd be more like 3-bit.

From their HF page:

> This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.

Interesting, so we have to go back to CPU to rescale? Is this how they counted GB? This should have been clearly caveated in the table. I also am amazed they got latency lower than quip if they pingpong to CPU.

mobicham · 2 years ago
Hello, I am the main author, would love to clarify a couple of things:

All the linear-quantization methods have meta-data, including the 1.58bit paper. You can control the quality vs. memory usage by reducing the group-size. However, the meta-data is the not the same thing as the quantized weights for many reasons:

> The meta-data size doesn't change the fact that you can do binary/ternary matmul, which the most important thing in this story.

> The meta-data size doesn't increase the actual compute: these are point-wise operations and even if you have 1 scalar you still need to multiply the same amount of weights.

> Meta-data is offloaded to the CPU with pinned-memory, which allows non-blocking transfers. Technically, you can trigger the copy in the layer before and synchronize and will make it almost seamless. I did some experiments with cuda streams that worked very well on an older machine, but then I tried a better machine and the transfer was much faster. Obviously if you are trying it on Google colab it's very slow for this reason.

> Smaller models like Llama2-7B are very hard to directly quantize at very low bits, so they need a lower group-size to function well. Larger models (like what we showed for Mixtral), can be quantized to 2-bit on the fly, without any data, and still work very well. So basically larger models are less sensitive to extreme quantization and you could use a much larger group-size. I still think that the meta-data size is really not a big deal for the reasons I have explained above.

> There are many other ways to increase the group-size or even get rid of it all together, many ideas available but needs lots of experimentation.

> Binary/ternary CUDA matmul kernels don't exist yet. The current code is implementing the dequantization step in CUDA but then uses torch.matmul as fp16. I tried doing matmul at low-bits with CUDA but it is very difficult to even beat cuBLAS with fp16, especially for a novice CUDA coder like me :)

Please note: this is early experimental work. Since it showed promising results, we wanted to share it with the community first as we progress. There's still a lot of things to be done and we are actively working on it, despite the very limited resources we have.

Happy to answer any questions here!

vladf · 2 years ago
Thanks for the reply. I’m quite familiar with subchannel quant, but still feel like my questions did not get addressed.

1 Could you post the full memory use of the methods? E.g. you include quip metadata in its GB but not hqq metadata in its GB.

2 If you have to go to cpu to shift and scale, how did you get latency lower than pure on device? Was this bsz1? No speculative decoding?

3 how can lora absorb shifts with only increasing rank by 1 if you have a shift per group?

mikeravkine · 2 years ago
Thank you for your efforts on behalf of the GPU poor!

It's getting tougher to use older, cheaper GPUs (Pascal/Maxwell) with modern quantization schemes so anything you can do to keep kernels compatible with SM52 and SM61 would be greatly appreciated.

danielhanchen · 2 years ago
When one does quantization, it's done in blocks. Bitsandbytes uses a blocksize of 64 I think. W * scale + zero_point is needed for each group size of 8. So you need 2 numbers in fp16 for each 64 group of numbers. For BnB, you get 4.5bit approx since 64*4bit + 16bit + 16bit = 288/64 = 4.5. So 4bit is actually 4.5bit.

For HQQ 1bit, a group size of 8 needs 2 fp16 numbers (you mentioned 8bit for shift). So you need 8 * 1bit + 16bit + 8bit for each group ie 32bits for each group size of 8. Or 4bits per param.

I'm assuming the scale and zero_point are both moved to 8bit maybe so 8*1bit + 8bit + 8bit = 24bit / 8 = 3bits per param?

"This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.", so the 8+8 scale / zero_point moves to the CPU. So GPU memory 1bit, but CPU meta data is the rest - very smart!

Dylan16807 · 2 years ago
> "This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.", so the 8+8 scale / zero_point moves to the CPU. So GPU memory 1bit, but CPU meta data is the rest - very smart!

Doesn't it need all the weight metadata for a layer to use that layer?

* If yes, then can't any algorithm offload x% of its data as a balancing act between speed and RAM?

* If no, then what's it for and when does it get used?

vladf · 2 years ago
Err, you are just restating what I’m saying, without addressing the concerns.

1 - is it fair to use ram in two places and report only one of them without any asterisk? (If you think this is fair-oh boy wait till you hear about my 0GB hbm use inference algorithm)

2 - i know how subchannel quantization works. Are they hitting those reported latency numbers with per layer cpu pingpong to rescale?

londons_explore · 2 years ago
I believe the future is 1 bit models - for both training and inference.

When people make custom silicon for 1 bit models, they'll find that it is sooooo much more power and silicon-space efficient to do 1 bit math than 16 bit floating point - like 100x or more.

That extra model size will vastly overshadow any worse performance of the models.

mikewarot · 2 years ago
I believe the future is 4*4 bit look up tables with output latches, with a bit to/from each Cartesian neighbor. Clock them like the colors of a chessboard, in 2 phases, and you don't have to worry about timing dependencies.

All of your code gets converted to a directed acyclic graph(DAG), executing at Ghz rates if you want.

Imagine a machine that can output a million parallel GPT-4 streams at 1000 tokens per second each.

If the architecture changes it's just a different DAG. Unlike with FPGAs and their custom blocks that have to be optimally used, you can compile a DAG almost instantly.

Dylan16807 · 2 years ago
1. If you write FPGA code as a grid of lookup tables then I would expect it to be easy to compile instantly.

2. In what way is this "acyclic"?

3. Won't putting your code into this form be the hard part? Even if you start with a DAG, 99.99% of them won't fit this form without intense restructuring. So you just moved the hard step over by one.

smusamashah · 2 years ago
Is this something from a research finding or is it your idea?
wongarsu · 2 years ago
Probably more than 100x for inference. Not only are you drastically reducing the number of bits and replacing float math with integer math, you can do matrix multiplication with only addition (as pointed out in the BitNet b1.58 paper). Additions require a lot less hardware to implement than multiplication. Adding one-bit or two-bit numbers requires barely any hardware at all. A traditional two-bit adder without carry bit is three xor gates and an and gate.
fasa99 · 2 years ago
to me the most exciting thing is that if is training that is speed up on the order of 100x-1000x, a large cluster may be well suited to gradient-descend hyperparameter tuning parameters by LLM training again and again at scale -- this is the first foot in a door towards an AI that iteratively may improve itself
cma · 2 years ago
For training how do you get any kind of meaningful derivative with it?
concurrentsquar · 2 years ago
You don't (you have to use real-valued inertial 'latent weights' during training): https://arxiv.org/abs/1906.02107

(there is still a reduction in memory usage though (just not 24x):

> "Furthermore, Bop reduces the memory requirements during training: it requires only one real-valued variable per weight, while the latent-variable approach with Momentum and Adam require two and three respectively.")

twelfthnight · 2 years ago
Maybe evolutionary algorithms instead? Hasn't proven super useful historically, but maybe at the scale of enormous LLMs it will be?
scotty79 · 2 years ago
Maybe something probabilistic?
chalst · 2 years ago
The OP explicitly excludes training.
api · 2 years ago
Doesn’t training need higher precision to avoid getting stuck at local minima, at least with back propagation style learning?

Maybe something alternate like evolutionary algorithms could work in a domain like this, but so far those have proven to be less efficient.

sp332 · 2 years ago
A recent paper used a single ternary "trit" ~1.5 bits per parameter for training. https://news.ycombinator.com/item?id=39535800 They said it would be difficult to apply this technique to pre-trained models and had to be trained in 1-trit from scratch.
bionhoward · 2 years ago
Isn’t 1bit too low for optimal radix economy (Euler’s number) though?

I want to see somebody try “imbalanced quaternary” -,0,+,2

twelfthnight · 2 years ago
Haven't heard this argument before. But from the Wikipedia article it seems base 3 has the best asymptomatic radix economy, but isn't much better than base 2 and base 2 is seemingly easier to program and optimize.

Since this is a new argument I've not heard, would be curious if you had links or some more explanation.

johnmorrison · 2 years ago
people are indeed working on -1,0,1,2 Q2 models, I read something about it the other day but don't recall the title.

they mentioned decomposition of Q2 into ternary+binary i.e. [[1,2],[-1,0]] -> [[1,1],[0,0]] + [[0,1],[-1,0]]

Dylan16807 · 2 years ago
I bet the optimal "large" value is bigger than 2.
hervature · 2 years ago
Radix economy is all about which base is the most efficient to represent a given number. It is simple to show that, for large numbers, this is equivalent to how efficient a base can represent itself, b/ln(b). Simple calculus shows this is minimized at e (Euler's number) or 3 if integer (closely followed by 2).

It sounds like you have something to add but you are already dictating the base by saying "bit". Literally from "binary digit". Anyway, quantization is not about which number system is best - virtually all computer systems we use today represents numbers in base 2. Quantization, at its core, is lossy compression. How do you go from a large model trained to high precision to a smaller model without hindering performance? This can be studied without needing to know the base.

Suppose you are using a decimal computer. You can ask yourself, I have a 128-decimal precision numbers, do I need that much precision? What happens if I just use 1-decimal precision by chopping off the 127 digits after the first decimal? You realize that there are two parts of an operation. The numbers involved (the operands) and the operation itself. You then ask yourself, if I keep one of the operands fixed (the original input), can I represent my 128-decimal precision neural network simply as a series of operations without the other operand? Perhaps only the most basic ones? Like: noops (add 0 or multiply by 1), increments (add 1), decrements (subtract 1), negations (multiply by -1), and clears (multiply by 0)? You count those numbers (-1, 0, and 1). There are 3 so you proudly proclaim you've made a neural network that only uses 0.477 dits. People get excited and confused because that is less than 1 dit which seems like a fundamental point. You further surprise the scientific field by finding a clever trick for getting rid of negations. You beat your previous record and now you only need 0.301 dits to represent your network. You are about to accept your Turing reward when the ghost of Claude Shannon appears and says "Why are you using a unit that measures entropy to mean how many symbols you have? If you insist, at least realize 0.301 dits is 1 bit." You are shocked when you realize 10^0.301 = 2^1. Reviewing Shannon's seminal paper, you are awestruck by Shannon's prescient comment "Change from the base a to base b merely requires multiplication by log_b(a).". You humbly give your award to Shannon. You keep the $1M since ghosts aren't as fast a new NVidia DGX. No matter how quantized the ghost is.

[1] - https://people.math.harvard.edu/~ctm/home/text/others/shanno...

imtringued · 2 years ago
The 1 ternary bit models only compress the weights. You still add and subtract using bfloat16 for better accuracy. Dedicated silicon is mostly a waste, because you are only processing two operations per parameter during inference. Loading the parameters from slow DDR, GDDR or HBM memory is the bottleneck in practice and the only solution is PIM. I was honestly disappointed by Nvidia's Blackwell since it is just barely competitive with GDDR PIM.
mobicham · 2 years ago
At least you can copy 16 times more data to the shared memory with binary weights.
programjames · 2 years ago
> I believe the future is 1 bit models - for both training and inference.

1 bit's nothin'. The future is training directly on electronic/photonic circuits.

Deleted Comment

bgnn · 2 years ago
This! Maybe just integer, but not floating point. That's a ridiculous way to do computation when you don't really need the precision.
JHonaker · 2 years ago
> That extra model size will vastly overshadow any worse performance of the models.

...What?

sroussey · 2 years ago
I think OP was referring to parameter size. You can make up for quantization by having more parameters.
mmoskal · 2 years ago
It seems the trick here is they first quantize it to 1- or 2-bit, and then they fine-tune the quantization bias parameters (the parameters that dequantize from 1-2 to 16 bit) via LoRA. Then they have specialized kernels to do matrix multiplication at the bit level.

Also, the 2-bit model seems much better than the 1-bit model - they use [-1, 0, 1, 2] - I wonder if '2' is needed in light of the 1.58b paper (which claims -1 is def. needed).

andy_xor_andrew · 2 years ago
Interesting, and it kinda makes sense. You quantize, which invariably means you lose some precision, but then you can finetune post-quantization to recover at least some of it. Neat idea.
jimmySixDOF · 2 years ago
Which is itself a little counterintuitive as the arxiv papers they cite say models need to be pretrained from the ground up using 1- or 2-bit (or 1.58bit). It definitely adds some interesting data points for the open source community who are experimenting in every possible direction.
WithinReason · 2 years ago
1-bit weights have been a thing since at least 2016:

https://arxiv.org/abs/1606.06160

buildbot · 2 years ago
XNOR-Net was the "first" in that generation, as I recall. It's not a new idea at all though.

Check out the dates on these papers -

https://ieeexplore.ieee.org/abstract/document/286901 < 1994

https://ieeexplore.ieee.org/abstract/document/344783 < 1993

https://link.springer.com/article/10.1007/BF00337115 < 1986

bjornsing · 2 years ago
First version of BNN was submitted to ArXiv a month before XNOR-Net: https://arxiv.org/abs/1602.02830
thechao · 2 years ago
We're just speed-running the NN pendulum, at this point.
elcomet · 2 years ago
WithinReason · 2 years ago
I don't think that paper's methods could be applied to LLMs
rapatel0 · 2 years ago
Don't forget Michael

https://arxiv.org/abs/1511.00363 < 2015

jstmm · 2 years ago
In both the '1-bit Model' and '2-bit Model' tables, the forward time (sec) for Llama2-7B with FP16 (full precision) is 0.1 s, whereas it's ~0.231, ~0.257, ~0.353 s respectively for HQQ (1-bit) / HQQ+ (1-bit) / Quip# (2-bit) meaning the FP16 model has ~3x lower inference time.

On the contrary, in the BitNet b1.56 paper [0] the authors report their 7b model has 2.9x reduced inference latency.

It's not clear to me what's happening here. Can someone explain why the 1/2bit HQQ/HQQ+ models are so much slower than the BitNet b1.56 models?

[0] https://arxiv.org/pdf/2402.17764.pdf

londons_explore · 2 years ago
GPU's aren't really designed for 1 bit math... They don't perform much faster than floating point math.

Whereas a custom ASIC or updated design of GPU could give massive speedups with 1 bit math.

UncleOxidant · 2 years ago
Yes, exactly. Neither GPUs nor CPUs are setup for 1 bit math. Pulling 1 or 2 bits out of a word isn't all that straightforward on CPU or GPU - lots of shifting and masking. I wonder how long it's going to be before we see custom hardware for bitnets? I suspect we'll see it on FPGAs first.
bee_rider · 2 years ago
For 1 bit math, at least it should be possible to populate every other bit of an integer type, right? Surely one could do better with a dedicated type for this, but at least we could pack 16 single-bit weights into a 32 bit int for addition, right?
imtringued · 2 years ago
You're telling me GPUs aren't designed for additions and subtractions? Where did you hear that?
shaklee3 · 2 years ago
A100 (> 5yo GPU) has a 1-bit tensor core engine
brucethemoose2 · 2 years ago
Real world GPU performance is hugely influenced by hand optimization of the CUDA kernels.
thatguysaguy · 2 years ago
Sounds like these guys didn't use custom kernels, but BitNet did.
mobicham · 2 years ago
That's correct. Only the dequantization is done on CUDA, the matmul is done with Pytorch. If they put their kernels open-source we could re-use them!
ianbicking · 2 years ago
Reduction to decision tree!

But I'm unclear how it actually run, and the article talks about the conversion and training but doesn't describe how it runs... I suppose because it's obvious to someone who has followed quantization.

Thinking out loud... if you have a model of just 1 and 0, my first thought is that the outputs are 1's and 0's but I think that's wrong. Instead it's a bunch of floats, and you multiply them by 1 or 0 (in a sense you are sampling the output of the previous layer?), add them up, and put the result through some activation function. And two-bit quantization sounds kind of similar, just with a _little_ scale, going from -1 to 2 instead of 0 to 1.

It seems kind of interesting that you now have a bunch of weights that are exactly 0, meaning you can assert something about what parameters and weights affect what parts of the output. Though in some sense the way they compress the weights down to one bit is also a heuristic you could use to interpret the original model... this just makes it clearer that in totality you are making a defensible simplification, because the end result is still a working model.

It also seems like you could make a lot of mathematical assertions about a one bit model that would be harder to make otherwise. Like maybe you could start thinking of a model as an equation, a _particular_ (though giant) equation, and look at its properties and consider symbolic transformations to that equation.

bick_nyers · 2 years ago
A comment I really liked on a previous post about ternary highlighted that what you are actually measuring with {-1, 0, 1} is inverse correlation, no correlation, and correlation.
fabmilo · 2 years ago
I like the decision tree analogy
grungegun · 2 years ago
Does anyone know if this works on vanilla deep networks? These quantization articles always seem to target LLM's which leads me to wonder if there's something special about the LLM architecture vs a vanilla deep architecture.
zaptrem · 2 years ago
Transformer LLMs are just a bunch of MLPs (linear layers) where you sometimes multiply/softmax the output in a funny way (attention). In other words, they're arguably more "vanilla deep net" than most architectures (e.g., conv nets).

(There are also positional/token embeddings and normalization but those are a tiny minority of the parameters)

grungegun · 2 years ago
So there's no performance gain for quantization enabled by the transformer architecture? It seems very strange that quantization works so well since in most of my experiments, the internal model weights of mlps look random.
amelius · 2 years ago
Ok, but what does a perceptron look like in 1-bit? Would it be just some simple logic gate, like an OR-gate?
alephxyz · 2 years ago
LLMs have been trending towards obscenely large number of parameters (314B for grok), which makes quantization crucial if you want to run them without a Meta-sized budget.
Y_Y · 2 years ago
Certainly does, people have been doing this in computer vision for years.

Deleted Comment

kromem · 2 years ago
The most exciting part about ternary or binary weights is the inevitable hardware revolution for AI dedicated chips that's going to result from it.
imtringued · 2 years ago
Your hardware already supports addition and subtraction and the tensor cores of NVIDIA GPUs are already fast enough to keep up. The only benefit is reducing memory capacity and bandwidth requirements.
Matumio · 2 years ago
You mean we'll have hardware accelerated ternary instructions? https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

(Okay probably those are not ready to be used as NN weights if the activations are not binary too, but... the gap to what CPUs already can do is getting smaller.)