Three major LLM releases in 24 hours

loudmax · a year ago

For those primarily interested in open weight models, that Mixtral 8x22B is really intriguing. The Mistral models have tended to outperform other models with similar parameter counts.

Still 281GB is huge. That's at the higher end of what we see from other open weight models, and it's not going to fit on anybody's homelab franken-GPU rig. Assuming that 281GB is fp16, it should quantize down to roughly 70GB at 4bits. Still too big for any consumer grade GPU, but accessible on a workstation with enough system ram. Mixtral 8x7B runs surprisingly fast, even on CPUs. Hopefully this 8x22B model will perform similarly.

EDIT: Available here in GGUF format: https://huggingface.co/MaziyarPanahi/Mixtral-8x22B-v0.1-GGUF

The 2-bit quantization comes to 52GB, so worse than my napkin math suggested. Looking forward to giving it a try on my desktop though.

lambdaba · a year ago

Does anyone know what's up with the models that aren't available in Europe? There isn't any transparency over this.

junon · a year ago

They probably aren't up to GDPR standards. Take from that what you will. It's the typical reason why they don't release here.

MyAccountYo · a year ago

I think it is mostly for these reasons:

* It's complicated so it takes a while and you need lawyers and such to make it right

* Rules for training are probably hugely vague and undefined. Because you could ingest personal data and it cannot be deleted

* AFAIK it needs to be hosted in Europe (not directly GDPR related, but america has laws that allows them to spy on all traffic in the US, so this is somewhat the counter to that)

In the end from my experience just working at a company that needs to be compliant this usually means:

* All the services need to be hosed in EU including 3rd parties we send any data to

* There needs to be a way (email is enough) to delete user data (including from 3rd parties which need an endpoint so you can trigger it from your side)

* You need to inform the user about the data useage and allow them to opt out of the "usage" of this data for non-essential things (i.e marketing emails). This does not mean you cannot save this data if you also use it for other things, but you can not use it for the non-essential case.

* You could be in trouble if you save data "just because" and do not use it for anything essential or if it is not transparent to the user.

Not a lawyer. Just the things I notice in my day to day. In the end companies need data protection professionals to navigate these things. Which is probably another thing a startup does not worry about it early on.

lagrange77 · a year ago

How can a model not be up to GDPR standards? Or are you talking about the services that provide those models? Genuinely interested.

OscarTheGrinch · a year ago

Yeah it's fear of GDPR, which is kind of like a retroactive set of standards: "we'll know an infringement when we see it", type vibe. Which of course is kryptonite to innovation, and ultimatly will lead to a more fragmented internet.

As a European I to try see it from both sides, consumer protections are generally a good thing, but it right now being restricted by EU vagueness sucks ass because I just want to play with the cool new toys.

passwordoops · a year ago

pure speculation

If I'm to venture a guess, it's probably because data protections are stronger and they want to avoid potential issues should someone test GDPR (or whatever the applicable law is) by asking specific data be removed from the model

dontupvoteme · a year ago

They're worried about GDPR with chatbots, but e.g. Claude is available via API.

speedgoose · a year ago

How hard is it to ask for informed consent?

Deleted Comment

karmasimida · a year ago

Regulation too cumbersome, not worth it.

What is the downvote coming from, isn't this just facts? If not for the regulation, why would EU be shunned?

black3r · a year ago

OpenAI doesn't have this issue so it begs the question which part of the regulation is not compliant and why, if it's just Google being lazy, or if they actively do something sketchy and don't want to stop.

speedgoose · a year ago

I don't know. I have an organisation account on Anthropic to use Claude 3, and the credit card is norwegian, the phone number is norwegian, the email finishes with a .no, the country in the address says Norway, and the business tax id is a norwegian VAT number. Sounds like they actually don't mind the regulations for businesses.

Zetobal · a year ago

If other entities do the same thing without problems it's most of the time a you problem.

Dead Comment

ThomPete · a year ago

Because the EU have decided to make it extremely hard for Europeans to benefit from technological advantages trough their GDPR, Cookie Laws and soon AI Act.

saikia81 · a year ago

Safety does make things harder for those that want to abuse us. We don't want technology at all costs.

mg · a year ago

Are any of these stable? I mean when using temperature=0, do you get the same reply for the same prompt?

I am using gpt-4-1106-preview quite a lot, but it is hard to optimize prompts when you cannot build a test-suite of questions and correct replies against which you can test and improve the instruction prompt. Even when using temperature=0, gpt-4-1106-preview outputs different answers for the same prompt.

phillipcarter · a year ago

> [...] but it is hard to optimize prompts when you cannot build a test-suite of questions and correct replies against which you can test and improve the instruction prompt.

I think this is because your approach isn't right. This tech isn't really unit-testable in the same sense. In fact, for many use cases, you may want non-deterministic results by design.

Instead, you probably need evaluations. The idea is that you're still building out "test" cases, but instead of expecting a specific result each time, you get a result that you can score through some means. Each test case produces a score, and you get a rollup score for the suite, and that's how you can track regressions over time.

For example, in our use case, we produce structured JSON that has to match a spec, but we also want to have the contents of that valid-to-spec JSON object be "useful". So there's a function that defines "usefulness" based on some criteria that I've put together since I'm a domain expert. This is something I can evolve over time, using real-world inputs that produce bad or unsatisfying outputs as new evaluations for the evaluation suite.

Fair warning though, it's not very easy to get started with, and there's not a whole lot of information about doing it well online.

mg · a year ago

This is what I do. I calculate a score over a sample of questions and replies. I'm not doing unit tests.

Comparing the scores of two prompts will not give you a definitive answer which one is superior. But the prediction which one is superior would be better without the noise added by the randomness in the execution of the LLM.

tomschwiha · a year ago

Wouldn't be for reproducability the usage of a seed be a better fit?

[0] https://platform.openai.com/docs/api-reference/chat/create#c...

mg · a year ago

I just tried the same prompt twice, both with

    'temperature': 0,
    'seed': 1,

And I got two different replies.

The 'system_fingerprint' in the reply was the same in both of the json responses. So it seems that even when you get the same 'system_fingerprint' back, replies for the same prompt will not be the same.

Xenoamorphous · a year ago

Aren't these models non-deterministic by nature?

kragen · a year ago

no, they run on computers. anything that happens in a computer can be made reproducible

jeswin · a year ago

Does Gemini have a prepaid mode?

I like that both OpenAI and Anthropic default to the prepaid mode; I can safely experiment without worrying about selecting a large file by mistake (or worse, a runaway automated process).

novaRom · a year ago

Cohere’s Command R+ is unimpressive model, because it agrees with me every time I try to argue with smth like: "But are you sure? ..."; also it has: "last update in January 2023".

Mixtral 8x22B is interesting because 8x7B was one of the best (among all others) for me few months ago (in particular, common knowledge, engineering and high-level math, multi-lingual skills like translation, grammatically nicer rewritings)

loudmax · a year ago

I haven't tried it yet, but if the model itself isn't impressive, that 128k context window is. That's the largest I think I've seen for any open weights model.

tarruda · a year ago

One of the most attractive features about Mistral open models is that you can build a product on top of their API, and switch to a self hosted version if the need arises, such as customer requesting to run onprem due to privacy requirements, or the API service being taken down.

tomschwiha · a year ago

Using Mistral together with groq is amazing.

The reason to be able to migrate is for me personally a huge plus.

kragen · a year ago

i'm interested to hear what kinds of things you're doing with it!

synergy20 · a year ago

which groq is that,the chip or the gpt?

rel2thr · a year ago

is the point of system prompts just to avoid prompt injection? or are they supposed to get better outputs too?

I never have found a need for them. i.e. the example in the article Just prompting like:

Write hello 3 different ways in spanish works fine for me

simonw · a year ago

They let you at least partially separate instructions from data. This is useful for things like "Translate this text to French" - you don't want any instructions in the text you are translating to interfere with that goal.

If this was 100% robust then it would also solve prompt injection, but sadly it isn't.