For those primarily interested in open weight models, that Mixtral 8x22B is really intriguing. The Mistral models have tended to outperform other models with similar parameter counts.
Still 281GB is huge. That's at the higher end of what we see from other open weight models, and it's not going to fit on anybody's homelab franken-GPU rig. Assuming that 281GB is fp16, it should quantize down to roughly 70GB at 4bits. Still too big for any consumer grade GPU, but accessible on a workstation with enough system ram. Mixtral 8x7B runs surprisingly fast, even on CPUs. Hopefully this 8x22B model will perform similarly.
* It's complicated so it takes a while and you need lawyers and such to make it right
* Rules for training are probably hugely vague and undefined. Because you could ingest personal data and it cannot be deleted
* AFAIK it needs to be hosted in Europe (not directly GDPR related, but america has laws that allows them to spy on all traffic in the US, so this is somewhat the counter to that)
In the end from my experience just working at a company that needs to be compliant this usually means:
* All the services need to be hosed in EU including 3rd parties we send any data to
* There needs to be a way (email is enough) to delete user data (including from 3rd parties which need an endpoint so you can trigger it from your side)
* You need to inform the user about the data useage and allow them to opt out of the "usage" of this data for non-essential things (i.e marketing emails). This does not mean you cannot save this data if you also use it for other things, but you can not use it for the non-essential case.
* You could be in trouble if you save data "just because" and do not use it for anything essential or if it is not transparent to the user.
Not a lawyer. Just the things I notice in my day to day. In the end companies need data protection professionals to navigate these things. Which is probably another thing a startup does not worry about it early on.
Yeah it's fear of GDPR, which is kind of like a retroactive set of standards: "we'll know an infringement when we see it", type vibe. Which of course is kryptonite to innovation, and ultimatly will lead to a more fragmented internet.
As a European I to try see it from both sides, consumer protections are generally a good thing, but it right now being restricted by EU vagueness sucks ass because I just want to play with the cool new toys.
If I'm to venture a guess, it's probably because data protections are stronger and they want to avoid potential issues should someone test GDPR (or whatever the applicable law is) by asking specific data be removed from the model
OpenAI doesn't have this issue so it begs the question which part of the regulation is not compliant and why, if it's just Google being lazy, or if they actively do something sketchy and don't want to stop.
I don't know. I have an organisation account on Anthropic to use Claude 3, and the credit card is norwegian, the phone number is norwegian, the email finishes with a .no, the country in the address says Norway, and the business tax id is a norwegian VAT number. Sounds like they actually don't mind the regulations for businesses.
Because the EU have decided to make it extremely hard for Europeans to benefit from technological advantages trough their GDPR, Cookie Laws and soon AI Act.
Are any of these stable? I mean when using temperature=0, do you get the same reply for the same prompt?
I am using gpt-4-1106-preview quite a lot, but it is hard to optimize prompts when you cannot build a test-suite of questions and correct replies against which you can test and improve the instruction prompt. Even when using temperature=0, gpt-4-1106-preview outputs different answers for the same prompt.
> [...] but it is hard to optimize prompts when you cannot build a test-suite of questions and correct replies against which you can test and improve the instruction prompt.
I think this is because your approach isn't right. This tech isn't really unit-testable in the same sense. In fact, for many use cases, you may want non-deterministic results by design.
Instead, you probably need evaluations. The idea is that you're still building out "test" cases, but instead of expecting a specific result each time, you get a result that you can score through some means. Each test case produces a score, and you get a rollup score for the suite, and that's how you can track regressions over time.
For example, in our use case, we produce structured JSON that has to match a spec, but we also want to have the contents of that valid-to-spec JSON object be "useful". So there's a function that defines "usefulness" based on some criteria that I've put together since I'm a domain expert. This is something I can evolve over time, using real-world inputs that produce bad or unsatisfying outputs as new evaluations for the evaluation suite.
Fair warning though, it's not very easy to get started with, and there's not a whole lot of information about doing it well online.
This is what I do. I calculate a score over a sample of questions and replies. I'm not doing unit tests.
Comparing the scores of two prompts will not give you a definitive answer which one is superior. But the prediction which one is superior would be better without the noise added by the randomness in the execution of the LLM.
The 'system_fingerprint' in the reply was the same in both of the json responses. So it seems that even when you get the same 'system_fingerprint' back, replies for the same prompt will not be the same.
I like that both OpenAI and Anthropic default to the prepaid mode; I can safely experiment without worrying about selecting a large file by mistake (or worse, a runaway automated process).
Cohere’s Command R+ is unimpressive model, because it agrees with me every time I try to argue with smth like: "But are you sure? ..."; also it has: "last update in January 2023".
Mixtral 8x22B is interesting because 8x7B was one of the best (among all others) for me few months ago (in particular, common knowledge, engineering and high-level math, multi-lingual skills like translation, grammatically nicer rewritings)
I haven't tried it yet, but if the model itself isn't impressive, that 128k context window is. That's the largest I think I've seen for any open weights model.
One of the most attractive features about Mistral open models is that you can build a product on top of their API, and switch to a self hosted version if the need arises, such as customer requesting to run onprem due to privacy requirements, or the API service being taken down.
They let you at least partially separate instructions from data. This is useful for things like "Translate this text to French" - you don't want any instructions in the text you are translating to interfere with that goal.
If this was 100% robust then it would also solve prompt injection, but sadly it isn't.
Still 281GB is huge. That's at the higher end of what we see from other open weight models, and it's not going to fit on anybody's homelab franken-GPU rig. Assuming that 281GB is fp16, it should quantize down to roughly 70GB at 4bits. Still too big for any consumer grade GPU, but accessible on a workstation with enough system ram. Mixtral 8x7B runs surprisingly fast, even on CPUs. Hopefully this 8x22B model will perform similarly.
EDIT: Available here in GGUF format: https://huggingface.co/MaziyarPanahi/Mixtral-8x22B-v0.1-GGUF
The 2-bit quantization comes to 52GB, so worse than my napkin math suggested. Looking forward to giving it a try on my desktop though.
* It's complicated so it takes a while and you need lawyers and such to make it right
* Rules for training are probably hugely vague and undefined. Because you could ingest personal data and it cannot be deleted
* AFAIK it needs to be hosted in Europe (not directly GDPR related, but america has laws that allows them to spy on all traffic in the US, so this is somewhat the counter to that)
In the end from my experience just working at a company that needs to be compliant this usually means:
* All the services need to be hosed in EU including 3rd parties we send any data to
* There needs to be a way (email is enough) to delete user data (including from 3rd parties which need an endpoint so you can trigger it from your side)
* You need to inform the user about the data useage and allow them to opt out of the "usage" of this data for non-essential things (i.e marketing emails). This does not mean you cannot save this data if you also use it for other things, but you can not use it for the non-essential case.
* You could be in trouble if you save data "just because" and do not use it for anything essential or if it is not transparent to the user.
Not a lawyer. Just the things I notice in my day to day. In the end companies need data protection professionals to navigate these things. Which is probably another thing a startup does not worry about it early on.
As a European I to try see it from both sides, consumer protections are generally a good thing, but it right now being restricted by EU vagueness sucks ass because I just want to play with the cool new toys.
If I'm to venture a guess, it's probably because data protections are stronger and they want to avoid potential issues should someone test GDPR (or whatever the applicable law is) by asking specific data be removed from the model
Deleted Comment
What is the downvote coming from, isn't this just facts? If not for the regulation, why would EU be shunned?
Dead Comment
I am using gpt-4-1106-preview quite a lot, but it is hard to optimize prompts when you cannot build a test-suite of questions and correct replies against which you can test and improve the instruction prompt. Even when using temperature=0, gpt-4-1106-preview outputs different answers for the same prompt.
I think this is because your approach isn't right. This tech isn't really unit-testable in the same sense. In fact, for many use cases, you may want non-deterministic results by design.
Instead, you probably need evaluations. The idea is that you're still building out "test" cases, but instead of expecting a specific result each time, you get a result that you can score through some means. Each test case produces a score, and you get a rollup score for the suite, and that's how you can track regressions over time.
For example, in our use case, we produce structured JSON that has to match a spec, but we also want to have the contents of that valid-to-spec JSON object be "useful". So there's a function that defines "usefulness" based on some criteria that I've put together since I'm a domain expert. This is something I can evolve over time, using real-world inputs that produce bad or unsatisfying outputs as new evaluations for the evaluation suite.
Fair warning though, it's not very easy to get started with, and there's not a whole lot of information about doing it well online.
Comparing the scores of two prompts will not give you a definitive answer which one is superior. But the prediction which one is superior would be better without the noise added by the randomness in the execution of the LLM.
[0] https://platform.openai.com/docs/api-reference/chat/create#c...
The 'system_fingerprint' in the reply was the same in both of the json responses. So it seems that even when you get the same 'system_fingerprint' back, replies for the same prompt will not be the same.
I like that both OpenAI and Anthropic default to the prepaid mode; I can safely experiment without worrying about selecting a large file by mistake (or worse, a runaway automated process).
Mixtral 8x22B is interesting because 8x7B was one of the best (among all others) for me few months ago (in particular, common knowledge, engineering and high-level math, multi-lingual skills like translation, grammatically nicer rewritings)
The reason to be able to migrate is for me personally a huge plus.
I never have found a need for them. i.e. the example in the article Just prompting like:
Write hello 3 different ways in spanish works fine for me
If this was 100% robust then it would also solve prompt injection, but sadly it isn't.
Deleted Comment