QuIP#: 2-bit Quantization for LLMs

I’m a layperson when it comes to this topic but does this mean every value in the network is a value from 00 to 11? I.e: 00, 01, 10, and 11?

I struggle to understand how a network with only two bits of precision could ever generate text or numbers or anything really.

Is my intuition wrong here? If so, can someone give an example of what it means to quantize the network down to 2 bits only.

p1esk · 2 years ago

It’s better to think about mapping groups of weights to an entry in a large codebook, and then using clever methods to do weight lookups.

2 bits of precision per weight is perfectly fine as long as you have enough weights. The information encoded by a neural network is measured in total number of bits, so you can compress it by either reducing number of weights or reducing number of bits per weight.

can anyone comment on running the 2b quantized llama 70b on consumer cards like the 4090?

lhl · 2 years ago

It's been on my list to do a proper shootout of all the various new quant formats floating around (my list here: https://llm-tracker.info/books/llms/page/quantization-overvi...) but a lot of them don't have very good production code yet (eg, a few months ago, when I tried OmniQuant, some of the important bits of code wasn't even included and had to be gotten directly from the authors: https://llm-tracker.info/books/llms/page/omniquant).

If you're looking for the best widely deployed quant format atm, it's probably ExLlamaV2's EXL2 - it supports arbitrary bpw w/ a calibration file, and also 8-bit kvcache support. I haven't tested EXL2 much at lower bpws though.

Note, both llama.cpp and AirLLM allow layer offloading to system memory (or in AirLLM's case, even to disk?!).

r/LocalLlama probably is the best place to search for if you're looking for people's experiences w/ quants. I know some people have been testing, like: https://www.reddit.com/r/LocalLLaMA/comments/17klaa5/tested_...

brucethemoose2 · 2 years ago

> https://llm-tracker.info/books/llms/page/quantization-overvi...

This is a very cool resource, thanks!

Gems like this, even in areas I follow pretty closely, are why I keep coming back to HN.

0xDEADFED5 · 2 years ago

i humbly request you to add mlc-llm to your quant test when/if you get around to doing it

beacon294 · 2 years ago

I have quantized Llama-2-chat-70b to 2q_k with llama.cpp then run on a 4090 for 50/83 layers. I was able to get 1.3 tokens per second. It answers this question, which smaller models have trouble with:

<s>[INST]Hello. The functions addition, multiplication, and exponentiation can be considered each as a higher order of the previous. For example, 2+2+2 = 2*3 which is significant because there are three twos, which is the multiplication factor. Likewise 2*2*2=2^3, again, the power number represents the count of the numbers being multiplied. What is this concept and what is the next higher order function?[/INST] The concept you are referring to is called the "iterative" or "recursive" nature of arithmetic operations. Each operation can be seen as a repeated application of the previous operation, with the multiplication factor representing the number of times the operation is applied.

For example, the addition of two numbers, 2+2, can be seen as a repeated application of the multiplication operation, where the multiplicand is 2 and the multiplier is 3 (i.e., 2*3 = 6, which is the result of the addition). Similarly, the multiplication of two numbers, 2*2, can be seen as a repeated application of the exponentiation operation, where the base is 2 and the power is 3 (i.e., 2^3 = 8, which is the result of the multiplication).

The next higher order function would be the "tetration" operation, which is the iterative application of the exponential function. In other words, tetration(n,x) = x^(x^(x^(...(x^n)...))), where n is the number of times the exponential function is applied.

For example, tetration(2,2) = 2^(2^(2^(2))), which can be simplified to 2^16, or 65536.

It's worth noting that this concept can be extended even further to higher levels of abstraction, such as "hyper-operations" and "super -operations", but these are more advanced and less commonly used concepts in mathematics.*

skykooler · 2 years ago

Note that it got all the math wrong. 2+2 is a repetition of the succession operator, not multiplication; 2+2 equals 4, not 6; multiplication of two numbers is repeated addition, not exponentiation; 2*2 equals 4, not 8; and tetration(2,2) is 4, not 65536. This is actually the invariant in that n-ation(2,2) equals 4 for all natural numbers n.

Philpax · 2 years ago

To be clear, this is not using QuIP#, and is thus not particularly relevant to the discussion at hand.

skavi · 2 years ago

to clarify, is that with https://huggingface.co/relaxml/Llama-2-70b-chat-E8P-2Bit

beacon294 · 2 years ago

The answer seems kind of low quality, but it got further than others. It could probably self correct.

thulle · 2 years ago

Better output than the smaller llamas in my limited testing, but it's surprisingly slow:

Output generated in 101.74 seconds (0.98 tokens/s, 100 tokens, context 82, seed 532878022)

Output generated in 515.46 seconds (0.99 tokens/s, 511 tokens, context 27, seed 660997525)

Checking nvidia-smi it stalls at ~130W (out of ~470 W max) power usage, ~25% GPU usage and ~10% memory bandwidth usage. There's fairly much traffic on the pci-bus though, and the python process is stable at 100% usage of one core. GPU possibly limited by some thing handled in python? Pausing the GPU-accelerated video-decoding of a twitch stream it get a surprisingly large boost:

Output generated in 380.42 seconds (1.34 tokens/s, 511 tokens, context 26, seed 648992918)

SeanAnderson · 2 years ago

Just to make sure I'm understanding this correctly.

This paper signals that the authors have found a way to run Llama 2 70B, but with 1/8th the VRAM requirements as compared to the original model, right?

And the output is on-par with the original along some metrics (ArcE/PiQA), within 25% on others (Wiki/C4), and the trajectory of their progress hints that there's even more ground to gain in the future?

samus · 2 years ago

Your last paragraph is the key. Without their improvements, the quality tradeoff would have been hard to stomach.

londons_explore · 2 years ago

When considering quality, you should really compare not to the original model, but to smaller models which use the same vram/compute

lxe · 2 years ago

Already works on oobabooga as of a few days ago: https://github.com/oobabooga/text-generation-webui/issues/47...

Need a few extra steps: https://github.com/oobabooga/text-generation-webui/pull/4803

tarruda · 2 years ago

If this quantization method works with smaller models, it would enable running up to 33B models with only 12GB VRAM.

Especially important for democratizing access to Mistral MoE new model.

MrNeon · 2 years ago

IIRC quantizing small models causes a higher relative drop in the metrics.

bongwater_OS · 2 years ago

One of the best papers I've read in a long time. This could be huge.

karmasimida · 2 years ago

For quantization, you should always verify directly on your own intended tasks, not trusting the quantization will preserve accuracy on a boarder spectrums of tasks, because surprises are not that infrequent.

pyinstallwoes · 2 years ago

Since a pixel can have more states than binary, could you have more space and compute by leveraging RGBA-space for data/compute than binary?

Maybe a stupid question.

wizzard0 · 2 years ago

A pixel takes more than 1 bit to store, too

on a video card? I'm not sure that is true wrt to the geometry and charge.

Found this https://news.ycombinator.com/item?id=27409371

Kinda related. Especially the comments.

saberience · 2 years ago