Microsoft researchers developed a hyper-efficient AI model that can run on CPUs

Repo with demo video and benchmark:

"...It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption..."

https://arxiv.org/abs/2402.17764

Animats · a year ago

That essay on the water cycle makes no sense. Some sentences are repeated three times. The conclusion about the water cycle and energy appears wrong. And what paper is "Jenkins (2010)"?

Am I missing something, or is this regressing to GPT-1 level?

yorwba · a year ago

They should probably redo the demo with their latest model. I tried the same prompt on https://bitnet-demo.azurewebsites.net/ and it looked significantly more coherent. At least it didn't get stuck in a loop.

int_19h · a year ago

2B parameters should be in the ballpark of GPT-2, no?

godelski · a year ago

  > "...It matches the full-precision (i.e., FP16 or BF16)

Wait... WHAT?!

When did //HALF PRECISION// become //FULL PRECISION//?

FWIW, I cannot find where you're quoting from. I cannot find "matches" on TFA nor the GitHub link. And in the paper I see

  3.2 Inference Accuracy
  
  The bitnet.cpp framework enables lossless inference for ternary BitNet b1.58 LLMs. To evaluate inference accuracy, we randomly selected 1,000 prompts from WildChat [ ZRH+24 ] and compared the outputs generated by bitnet.cpp and llama.cpp to those produced by an FP32 kernel. The evaluation was conducted on a token-by-token basis, with a maximum of 100 tokens per model output, considering an inference sample lossless only if it exactly matched the full-precision output.

This will happen more and more. This is why NVidia is rushing to get CUDA a software level lock-in otherwise their stock will go the way of Zoom.

soup10 · a year ago

i agree, no matter how much wishful thinking jensen sells to investors about paradigm shifts the days of everyone rushing out to get 6 figure tensor core clusters for their data center probably won't last forever.

bigyabai · a year ago

If Nvidia was at all in a hurry to lock-out third-parties, then I don't think they would support OpenCL and Vulkan compute, or allow customers to write PTX compilers that interface with Nvidia hardware.

In reality, the demand for highly parallelized compute simply blindsided OEMs. AMD, Intel and Apple were all laser-focused on raster efficiency, none of them have a GPU architecture optimized for GPGPU workloads. AMD and Intel don't have competitive fab access and Apple can't sell datacenter hardware to save their life; Nvidia's monopoly on attractive TSMC hardware isn't going anywhere.

int_19h · a year ago

Even if you can squeeze an existing model into smaller hardware, that means that you can squeeze a larger (and hence smarter) model into that 6 figure cluster. And they aren't anywhere near smart enough for many things people attempt to use them for, so I don't see the hardware demand for inference subsiding substantially anytime soon.

At least not for these reasons - if it does, it'll be because of consistent pattern of overhyping and underdelivering on real-world applications of generative AI, like what's going on with Apple right now.

layoric · a year ago

He is fully aware, that is why he is selling his stock on the daily.

Sonnigeszeug · a year ago

Comparing Zoom and Nvidia is just not valid at all.

Was the crazy revaluation of Nvidia wild? Yes.

Will others start taking contracts away with their fast inferencing custom solutions? yes of course but im sure everyone is aware of it.

What is very unclear is, how strong Nvidia is with their robot platform.

jcadam · a year ago

So Microsoft is about to do to Nvidia what Nvidia did to SGI.

PaulDavisThe1st · a year ago

still, better than the way of Skype.

hu3 · a year ago

ilrwbwrkhv · a year ago

zamadatix · a year ago

"Parameter count" is the "GHz" of AI models: the number you're most likely to see but least likely to need. All of the models compared (in the table on the huggingface link) are 1-2 billion parameters but the models range in actual size by more than a factor of 10.

Because of different quantization. However, parameter count is generally the more interesting number so long as quantization isn't too extreme (as it is here). E.g. FP32 is 4x the size of 8-bit quant, but the difference is close to non-existent in most cases.

orbital-decay · a year ago

>so long as quantization isn't too extreme (as it is here)

This is true for post-training quantization, not for quantization-aware training, and not for something like BitNet. Here they claim comparable performance per parameter count as normal models, that's the entire point.

charcircuit · a year ago

TPS is the Ghz of AI models. Both are related to the the propagation time of data.

idonotknowwhy · a year ago

Then i guess vocab is the IPC. 10k mistral tokens are about 8k llama3 tokens

Jedd · a year ago

I think almost all the free LLMs (not AI) that you find on hf can 'run on CPUs'.

The claim here seems to be that it runs usefully fast on CPU.

We're not sure how accurate this claim is, because we don't know how fast this model runs on a GPU, because:

  > Absent from the list of supported chips are GPUs [...]

And TFA doesn't really quantify anything, just offers:

  > Perhaps more impressively, BitNet b1.58 2B4T is speedier than other models of its size — in some cases, twice the speed — while using a fraction of the memory.

The model they link to is just over 1GB in size, and there's plenty of existing 1-2GB models that are quite serviceable on even a mildly-modern CPU-only rig.

sheepscreek · a year ago

If you click the demo link, you can type a live prompt and see it run on CPU or GPU (A100). From my test, the CPU was laughably slower. To my eyes, it seems comparable to the models I can run with llama.cpp today. Perhaps I am completely missing the point of this.

Deleted Comment

ein0p · a year ago

This is over a year old. The sky did not come down, everyone didn't switch to this in spite of the "advantages". If you look into why, you'll see that it does, in fact, affect the metrics, and some more than others, and there is no silver bullet.

The 2B4T model was literally released yesterday, and it's both smaller and better than what they had a year ago. Presumably the next step is that they get more funding for a larger model trained on even more data to see whether performance keeps improving. Of course the extreme quantization is always going to impact scores a bit, but if it lets you run models that otherwise wouldn't even fit into RAM, it's still worth it.

justanotheratom · a year ago

are you predicting, or is there already a documented finding somewhere?

Take a look at their own paper or at many attempts to train something large with this. There's no replacement for displacement. If this actually worked without quality degradation literally everyone would be using this.

imtringued · a year ago

AQLM, EfficientQAT and ParetoQ get reasonable benchmark scores at 2-bit quantization. At least 90% of the original unquantized scores.

stogot · a year ago

The pricing war will continue to rock bottom

falcor84 · a year ago

Why do they call it "1-bit" if it uses ternary {-1, 0, 1}? Am I missing something?

Maxious · a year ago

https://compilade.net/blog/ternary-packing is a good explainer (previous discussion https://news.ycombinator.com/item?id=42329307)

Thanks, but I've skimmed through both and couldn't find an answer on why they call it "1-bit".

taneq · a year ago

That’s pretty cool. :) One thing I don’t get is why do multiple operations when a 243-entry lookup table would be simpler and hopefully faster?

sambeau · a year ago

Maybe they are rounding down from 1.5-bit :)

BuyMyBitcoins · a year ago

Classic Microsoft naming shenanigans.

Nevermark · a year ago

Once you know how to compress 32-bit parameters to ternary, compressing ternary to binary is the easy part. :)

They would keep re-compressing the model in its entirety, recursively until the whole thing was a single bit, but the unpacking and repacking during inference is a bitch.

prvc · a year ago

There are about 1.58 (i.e. log_2(3)) bits per digit, so they just applied the constant function that maps the reals to 1 to it.

ufocia · a year ago

1.58 is still more than 1 in general unless the parameters are corelated. At 1 bit it seems unlikely that you could pack/unpack independent parameters reliably without additional data.

I like that as an explanation, but then every system is 1-bit, right? It definitely would simplify things.

mistrial9 · a year ago

nodesocket · a year ago

There are projects working on distributed LLMs, such as exo[1]. If they can crack the distributed problem fully and get performance it’s a game changer. Instead of spending insane amounts on Nvidia GPUs, can just deploy commodity clusters of AMD EPYC servers with tons of memory, NVMe disks, and 40G or 100G networking which is vastly less expensive. Goodbye Nvidia AI moat.

[1] https://github.com/exo-explore/exo

lioeters · a year ago

Do you think this is inevitable? It sounds like, if distributed LLMs are technically feasible to achieve, it will eventually happen. Maybe that's an unknown whether it can be solved at all, but I imagine there are enough people working on the problem that they will find a break-through one way or the other. LLMs themselves could participate in solving it.

Edit: Oh I just saw the Git repo:

> exo: Run your own AI cluster at home with everyday devices.

So the "distributed problem" is in the process of being solved. Impressive.