As for your bonus question, that is the model you want. In general I'd choose the largest quantized version that you can fit based on your system. I'm personally running the 8bit version on my M3 Max MacBook Pro and it runs great! Performance is unfortunately a loaded word when it comes to LLMs because it can mean tokens per second or it can mean perplexity (i.e. how well the LLM responds). In terms of tokens per second, quantized models usually run a little faster because memory bandwidth is a constraint, so you're moving less memory around. In terms of perplexity there are different quantization strategies that work better and worse. I really don't think there's much of a reason for anyone to use a full 16fp model for inference, you're not really gaining much there. I think most people use the 4bit quants because it's a nice balance. But really it's just a matter of playing with the models and seeing how well it works. For example, some models perform okay when quantized down to 2 bits (I'm shocked that's the case, but I've heard people say that's the case in their testing), but Mixtral is not one of those models.
I would say I care a lot more about the perplexity performance than pure T(okens)PS… it’s good to be able to verbalize that.
For one thing, and granted this is my own experience, that model is much better at coding than any of the others I've tried.
But going beyond that, if I need to do anything complicated that might hit the baked in filters on these other models I don't have to worry about it with mixtral. I'm not doing anything illegal btw. It's just that I'm an adult and don't need to use the bumper lane when I go bowling. I also approach any interaction with the thing knowing not to 100% trust it and to verify anything it says independently.
Bonus question if you have the time: there's a release by TheBloke for this on HuggingFace (TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF); but I thought his models were "quantised" usually - does that kneecap any of the performance?
But, people aren’t robots whose movements are controlled by an on-off switch. The government introduced the means for people to arrive and work, and so the people arrived. They are continuing to arrive because the policies have not been updated yet. How is it the immigrants’ fault? Why the hate and the attacks on their dignity & humanity?
The nonstop online vitriol hurts me deeply to read - nowhere is “safe” - Reddit, HN, Instagram… the hate spewers seemingly spend all their time spewing on these platforms to manipulate opinions and tap into the fundamental atavistic psychological flaws of the human mind.
Deleted Comment