"...It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption..."
That essay on the water cycle makes no sense. Some sentences are repeated three times. The conclusion about the water cycle and energy appears wrong. And what paper is "Jenkins (2010)"?
Am I missing something, or is this regressing to GPT-1 level?
They should probably redo the demo with their latest model. I tried the same prompt on https://bitnet-demo.azurewebsites.net/ and it looked significantly more coherent. At least it didn't get stuck in a loop.
> "...It matches the full-precision (i.e., FP16 or BF16)
Wait... WHAT?!
When did //HALF PRECISION// become //FULL PRECISION//?
FWIW, I cannot find where you're quoting from. I cannot find "matches" on TFA nor the GitHub link. And in the paper I see
3.2 Inference Accuracy
The bitnet.cpp framework enables lossless inference for ternary BitNet b1.58 LLMs. To evaluate inference accuracy, we randomly selected 1,000 prompts from WildChat [ ZRH+24 ] and compared the outputs generated by bitnet.cpp and llama.cpp to those produced by an FP32 kernel. The evaluation was conducted on a token-by-token basis, with a maximum of 100 tokens per model output, considering an inference sample lossless only if it exactly matched the full-precision output.
i agree, no matter how much wishful thinking jensen sells to investors about paradigm shifts the days of everyone rushing out to get 6 figure tensor core clusters for their data center probably won't last forever.
If Nvidia was at all in a hurry to lock-out third-parties, then I don't think they would support OpenCL and Vulkan compute, or allow customers to write PTX compilers that interface with Nvidia hardware.
In reality, the demand for highly parallelized compute simply blindsided OEMs. AMD, Intel and Apple were all laser-focused on raster efficiency, none of them have a GPU architecture optimized for GPGPU workloads. AMD and Intel don't have competitive fab access and Apple can't sell datacenter hardware to save their life; Nvidia's monopoly on attractive TSMC hardware isn't going anywhere.
Even if you can squeeze an existing model into smaller hardware, that means that you can squeeze a larger (and hence smarter) model into that 6 figure cluster. And they aren't anywhere near smart enough for many things people attempt to use them for, so I don't see the hardware demand for inference subsiding substantially anytime soon.
At least not for these reasons - if it does, it'll be because of consistent pattern of overhyping and underdelivering on real-world applications of generative AI, like what's going on with Apple right now.
"Parameter count" is the "GHz" of AI models: the number you're most likely to see but least likely to need. All of the models compared (in the table on the huggingface link) are 1-2 billion parameters but the models range in actual size by more than a factor of 10.
Because of different quantization. However, parameter count is generally the more interesting number so long as quantization isn't too extreme (as it is here). E.g. FP32 is 4x the size of 8-bit quant, but the difference is close to non-existent in most cases.
>so long as quantization isn't too extreme (as it is here)
This is true for post-training quantization, not for quantization-aware training, and not for something like BitNet. Here they claim comparable performance per parameter count as normal models, that's the entire point.
I think almost all the free LLMs (not AI) that you find on hf can 'run on CPUs'.
The claim here seems to be that it runs usefully fast on CPU.
We're not sure how accurate this claim is, because we don't know how fast this model runs on a GPU, because:
> Absent from the list of supported chips are GPUs [...]
And TFA doesn't really quantify anything, just offers:
> Perhaps more impressively, BitNet b1.58 2B4T is speedier than other models of its size — in some cases, twice the speed — while using a fraction of the memory.
The model they link to is just over 1GB in size, and there's plenty of existing 1-2GB models that are quite serviceable on even a mildly-modern CPU-only rig.
If you click the demo link, you can type a live prompt and see it run on CPU or GPU (A100). From my test, the CPU was laughably slower. To my eyes, it seems comparable to the models I can run with llama.cpp today. Perhaps I am completely missing the point of this.
This is over a year old. The sky did not come down, everyone didn't switch to this in spite of the "advantages". If you look into why, you'll see that it does, in fact, affect the metrics, and some more than others, and there is no silver bullet.
The 2B4T model was literally released yesterday, and it's both smaller and better than what they had a year ago. Presumably the next step is that they get more funding for a larger model trained on even more data to see whether performance keeps improving. Of course the extreme quantization is always going to impact scores a bit, but if it lets you run models that otherwise wouldn't even fit into RAM, it's still worth it.
Take a look at their own paper or at many attempts to train something large with this. There's no replacement for displacement. If this actually worked without quality degradation literally everyone would be using this.
Once you know how to compress 32-bit parameters to ternary, compressing ternary to binary is the easy part. :)
They would keep re-compressing the model in its entirety, recursively until the whole thing was a single bit, but the unpacking and repacking during inference is a bitch.
1.58 is still more than 1 in general unless the parameters are corelated. At 1 bit it seems unlikely that you could pack/unpack independent parameters reliably without additional data.
There are projects working on distributed LLMs, such as exo[1]. If they can crack the distributed problem fully and get performance it’s a game changer. Instead of spending insane amounts on Nvidia GPUs, can just deploy commodity clusters of AMD EPYC servers with tons of memory, NVMe disks, and 40G or 100G networking which is vastly less expensive. Goodbye Nvidia AI moat.
Do you think this is inevitable? It sounds like, if distributed LLMs are technically feasible to achieve, it will eventually happen. Maybe that's an unknown whether it can be solved at all, but I imagine there are enough people working on the problem that they will find a break-through one way or the other. LLMs themselves could participate in solving it.
Edit: Oh I just saw the Git repo:
> exo: Run your own AI cluster at home with everyday devices.
So the "distributed problem" is in the process of being solved. Impressive.
https://github.com/microsoft/BitNet
"...It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption..."
https://arxiv.org/abs/2402.17764
Am I missing something, or is this regressing to GPT-1 level?
When did //HALF PRECISION// become //FULL PRECISION//?
FWIW, I cannot find where you're quoting from. I cannot find "matches" on TFA nor the GitHub link. And in the paper I see
In reality, the demand for highly parallelized compute simply blindsided OEMs. AMD, Intel and Apple were all laser-focused on raster efficiency, none of them have a GPU architecture optimized for GPGPU workloads. AMD and Intel don't have competitive fab access and Apple can't sell datacenter hardware to save their life; Nvidia's monopoly on attractive TSMC hardware isn't going anywhere.
At least not for these reasons - if it does, it'll be because of consistent pattern of overhyping and underdelivering on real-world applications of generative AI, like what's going on with Apple right now.
Was the crazy revaluation of Nvidia wild? Yes.
Will others start taking contracts away with their fast inferencing custom solutions? yes of course but im sure everyone is aware of it.
What is very unclear is, how strong Nvidia is with their robot platform.
This is true for post-training quantization, not for quantization-aware training, and not for something like BitNet. Here they claim comparable performance per parameter count as normal models, that's the entire point.
The claim here seems to be that it runs usefully fast on CPU.
We're not sure how accurate this claim is, because we don't know how fast this model runs on a GPU, because:
And TFA doesn't really quantify anything, just offers: The model they link to is just over 1GB in size, and there's plenty of existing 1-2GB models that are quite serviceable on even a mildly-modern CPU-only rig.Deleted Comment
They would keep re-compressing the model in its entirety, recursively until the whole thing was a single bit, but the unpacking and repacking during inference is a bitch.
[1] https://github.com/exo-explore/exo
Edit: Oh I just saw the Git repo:
> exo: Run your own AI cluster at home with everyday devices.
So the "distributed problem" is in the process of being solved. Impressive.