Just to make sure I'm understanding this correctly.
This paper signals that the authors have found a way to run Llama 2 70B, but with 1/8th the VRAM requirements as compared to the original model, right?
And the output is on-par with the original along some metrics (ArcE/PiQA), within 25% on others (Wiki/C4), and the trajectory of their progress hints that there's even more ground to gain in the future?
For quantization, you should always verify directly on your own intended tasks, not trusting the quantization will preserve accuracy on a boarder spectrums of tasks, because surprises are not that infrequent.
It’s better to think about mapping groups of weights to an entry in a large codebook, and then using clever methods to do weight lookups.
2 bits of precision per weight is perfectly fine as long as you have enough weights. The information encoded by a neural network is measured in total number of bits, so you can compress it by either reducing number of weights or reducing number of bits per weight.
If you're looking for the best widely deployed quant format atm, it's probably ExLlamaV2's EXL2 - it supports arbitrary bpw w/ a calibration file, and also 8-bit kvcache support. I haven't tested EXL2 much at lower bpws though.
Note, both llama.cpp and AirLLM allow layer offloading to system memory (or in AirLLM's case, even to disk?!).
I have quantized Llama-2-chat-70b to 2q_k with llama.cpp then run on a 4090 for 50/83 layers. I was able to get 1.3 tokens per second. It answers this question, which smaller models have trouble with:
<s>[INST]Hello. The functions addition, multiplication, and exponentiation can be considered each as a higher order of the previous. For example, 2+2+2 = 2*3 which is significant because there are three twos, which is the multiplication factor. Likewise 2*2*2=2^3, again, the power number represents the count of the numbers being multiplied. What is this concept and what is the next higher order function?[/INST]
The concept you are referring to is called the "iterative" or "recursive" nature of arithmetic operations. Each operation can be seen as a repeated application of the previous operation, with the multiplication factor representing the number of times the operation is applied.
For example, the addition of two numbers, 2+2, can be seen as a repeated application of the multiplication operation, where the multiplicand is 2 and the multiplier is 3 (i.e., 2*3 = 6, which is the result of the addition). Similarly, the multiplication of two numbers,
2*2, can be seen as a repeated application of the exponentiation operation, where the base is 2 and the power is 3 (i.e., 2^3 = 8, which is the result of the multiplication).
The next higher order function would be the "tetration" operation, which is the iterative application of the exponential function. In other words, tetration(n,x) = x^(x^(x^(...(x^n)...))), where n is the number of times the exponential function is applied.
For example, tetration(2,2) = 2^(2^(2^(2))), which can be simplified to 2^16, or 65536.
It's worth noting that this concept can be extended even further to higher levels of abstraction, such as "hyper-operations" and "super
-operations", but these are more advanced and less commonly used concepts in mathematics.*
Note that it got all the math wrong. 2+2 is a repetition of the succession operator, not multiplication; 2+2 equals 4, not 6; multiplication of two numbers is repeated addition, not exponentiation; 2*2 equals 4, not 8; and tetration(2,2) is 4, not 65536. This is actually the invariant in that n-ation(2,2) equals 4 for all natural numbers n.
Checking nvidia-smi it stalls at ~130W (out of ~470 W max) power usage, ~25% GPU usage and ~10% memory bandwidth usage. There's fairly much traffic on the pci-bus though, and the python process is stable at 100% usage of one core. GPU possibly limited by some thing handled in python?
Pausing the GPU-accelerated video-decoding of a twitch stream it get a surprisingly large boost:
This paper signals that the authors have found a way to run Llama 2 70B, but with 1/8th the VRAM requirements as compared to the original model, right?
And the output is on-par with the original along some metrics (ArcE/PiQA), within 25% on others (Wiki/C4), and the trajectory of their progress hints that there's even more ground to gain in the future?
Need a few extra steps: https://github.com/oobabooga/text-generation-webui/pull/4803
Especially important for democratizing access to Mistral MoE new model.
Maybe a stupid question.
Kinda related. Especially the comments.
I struggle to understand how a network with only two bits of precision could ever generate text or numbers or anything really.
Is my intuition wrong here? If so, can someone give an example of what it means to quantize the network down to 2 bits only.
2 bits of precision per weight is perfectly fine as long as you have enough weights. The information encoded by a neural network is measured in total number of bits, so you can compress it by either reducing number of weights or reducing number of bits per weight.
If you're looking for the best widely deployed quant format atm, it's probably ExLlamaV2's EXL2 - it supports arbitrary bpw w/ a calibration file, and also 8-bit kvcache support. I haven't tested EXL2 much at lower bpws though.
Note, both llama.cpp and AirLLM allow layer offloading to system memory (or in AirLLM's case, even to disk?!).
r/LocalLlama probably is the best place to search for if you're looking for people's experiences w/ quants. I know some people have been testing, like: https://www.reddit.com/r/LocalLLaMA/comments/17klaa5/tested_...
This is a very cool resource, thanks!
Gems like this, even in areas I follow pretty closely, are why I keep coming back to HN.
<s>[INST]Hello. The functions addition, multiplication, and exponentiation can be considered each as a higher order of the previous. For example, 2+2+2 = 2*3 which is significant because there are three twos, which is the multiplication factor. Likewise 2*2*2=2^3, again, the power number represents the count of the numbers being multiplied. What is this concept and what is the next higher order function?[/INST] The concept you are referring to is called the "iterative" or "recursive" nature of arithmetic operations. Each operation can be seen as a repeated application of the previous operation, with the multiplication factor representing the number of times the operation is applied.
For example, the addition of two numbers, 2+2, can be seen as a repeated application of the multiplication operation, where the multiplicand is 2 and the multiplier is 3 (i.e., 2*3 = 6, which is the result of the addition). Similarly, the multiplication of two numbers, 2*2, can be seen as a repeated application of the exponentiation operation, where the base is 2 and the power is 3 (i.e., 2^3 = 8, which is the result of the multiplication).
The next higher order function would be the "tetration" operation, which is the iterative application of the exponential function. In other words, tetration(n,x) = x^(x^(x^(...(x^n)...))), where n is the number of times the exponential function is applied.
For example, tetration(2,2) = 2^(2^(2^(2))), which can be simplified to 2^16, or 65536.
It's worth noting that this concept can be extended even further to higher levels of abstraction, such as "hyper-operations" and "super -operations", but these are more advanced and less commonly used concepts in mathematics.*
Output generated in 101.74 seconds (0.98 tokens/s, 100 tokens, context 82, seed 532878022)
Output generated in 515.46 seconds (0.99 tokens/s, 511 tokens, context 27, seed 660997525)
Checking nvidia-smi it stalls at ~130W (out of ~470 W max) power usage, ~25% GPU usage and ~10% memory bandwidth usage. There's fairly much traffic on the pci-bus though, and the python process is stable at 100% usage of one core. GPU possibly limited by some thing handled in python? Pausing the GPU-accelerated video-decoding of a twitch stream it get a surprisingly large boost:
Output generated in 380.42 seconds (1.34 tokens/s, 511 tokens, context 26, seed 648992918)