There is clearly a strategy here - and I'm trying to figure it out.
Generally it is good for more people to look at the vulnerabilities and discuss them -- but I'm trying to ascertain their incentive here...
Also a recruiting and branding effort.
All of this is educated guesses, but that's my feeling. I do think the post could have been clearer about describing the practical dangers of poisoning. Is it to spew misinformation? Is it to cause a corporate LLM powered application to leak data it shouldn't? Not really sure here.
Some of the other main tricks - compress the model to 8 bit floating point formats or even lower. This reduces the amount of data that has to stream to the compute unit, also newer GPUs can do math in 8-bit or 4-bit floating point. Mixture of expert models are another trick where for a given token, a router in the model decides which subset of the parameters are used so not all weights have to be streamed. Another one is speculative decoding, which uses a smaller model to generate many possible tokens in the future and, in parallel, checks whether some of those matched what the full model would have produced.
Add all of these up and you get efficiency! Source - was director of the inference team at Databricks
I do not have a good sense of how well quality scales with narrow MoEs but even if we get something like Llama 3.3 70b in quality at only 8b active parameters people could do a ton locally.
I've been out of the loop with HuggingFace models.
What can you do with these models?
1. Can you download them and run them on your Laptop via JupyterLab?
2. What benefits does that get you?
3. Can you update them regularly (with new data on the internet, e.g.)?
4. Can you finetune them for a specific use case (e.g. GeoSpatial data)?
5. How difficult and time-consuming (person-hours) is it to finetune a model?
(If HuggingFace has answers to these questions, please point me to the URL. HuggingFace, to me, seems like the early days of GitHub. A small number were heavy users, but the rest were left scratching their heads and wondering how to use it.)
Granted it's a newbie question, but answers will be beneficial to a lot of us out there.
Yes you can. The community creates quantized variants of these that can run on consumer GPUs. A 4-bit quantization of LLAMA 70b works pretty well on Macbook pros, the neural engine with unified CPU memory is quite solid for these. GPUs is a bit tougher because consumer GPU RAM is still kinda small.
You can also fine-tune them. There are lot of frameworks like unsloth that make this easier. https://github.com/unslothai/unsloth . Fine-tuning can be pretty tricky to get right, you need to be aware of things like learning rates, but there are good resources on the internet where a lot of hobbyists have gotten things working. You do not need a PhD in ML to accomplish this. You will, however, need data that you can represent textually.
Source: Director of Engineering for model serving at Databricks.
That said the conclusion that it's a good model for cheap is true. I just would be hesitant to say it's a great model.
You could of course use us and get that out of the box if you have access to Databricks.
Dead Comment