I'm wondering when everyone will transition from tracking parameter count vs evaluation metric to (total gpu RAM + total CPU RAM) vs evaluation metric.
For example, a 7B parameter model using float32s will almost certainly outperform a 7B model using float4s.
Additionally, all the examples of quantizing recently released superior models to fit on one GPU doesnt mean the quantized model is a "win." The quantized model is a different model, you need to rerun the metrics.
I can run a certain 120b on my M3 max with 128GB memory. However I found that while it “fits” Q5 was extremely slow. The story was different with Q4 though which ran just fine around ~3.5-4 t/s.
Now this model is ~134B right? It could be bog slow but on the other hand its a MoE so there might be a chance it could have satisfactory results.
> a 7B parameter model using float32s will almost certainly outperform a 7B model using float4s
Q5 quantization performs almost on par with base models. Obviously there's some loss there, but this indicates that there's still a lot of compression that we're missing.
I'm still amazed that quantization works at all, coming out as a mild degradation in quality rather than radical dysfunction. Not that I've thought it through that much. Does quantization work with most neural networks?
If you read their blog post, they mention it was pretrained on 12 Trillion tokens of text. That is ~5x the amount of the llama2 training runs.
From that, it seems somewhat likely we've hit the wall on improving <X B parameter LLMs by simply scaling up the training data, which basically forces everyone to continue scaling up if they want to keep up with SOTA.
Not recently. GPT-3 from 2020 requires even more RAM; the open-source BLOOM from 2022 did too.
In my view, the main value of larger models is distillation (which we particularly witness, for instance, with how Claude Haiku matches release-day GPT-4 despite being less than a tenth of the cost). Hopefully the distilled models will be easier to run.
A free lunch? Wouldn't that be nice! Sometimes the quantization process improves the accuracy a little (probably by implicit regularization) but a model that's at or near capacity (as it should be) is necessarily hurt by throwing away most of the information. Language models often quantize well to small fixed-point types like int4, but it's not a magic wand.
I suppose you could simulate dementia by loading as much of the weights as space permits and then just stopping. Then during inference, replace the missing weights with calls to random(). I'd actually be interested in seeing the results.
No but some model serving tools like llama.cpp do their best. It's just a matter of choosing the right serving tools. And I am not sure LLMs could not optimize their memory layout. Why not? Just let them play with this and learn. You can do pretty amazing things with evolutionary methods where the LLMs are the mutation operator. You evolve a population of solutions. (https://arxiv.org/abs/2206.08896)
>Giving up because "out of memory" is not intelligence.
When people can't remember the facts/theory/formulas needed to answer some test question, or can't memorize some complicated information because it's too much, they usually give up too.
So, giving up because of "out of memory" sure sounds like intelligence to me.
Their goal is to always drive enterprise business towards consumption.
With AI they need to desperately steer the narrative away from API based services (OpenAI).
By training LLMs, they build sales artifacts (stories, references, even accelerators with LLMs themselves) to paint the pictures needed to convince their enterprise customer market that Databricks is the platform for enterprise AI. Their blog details how the entire end to end process was done on the platform.
In other words, Databricks spent millions as an aid in influencing their customers to do the same (on Databricks).
Thanks! Why do they not focus on hosting other open models then? I suspect other models will soon catch up with their advantages in faster inference and better benchmark results. That said, maybe the advantage is aligned interests: they want customers to use their platforms, so they can keep their models open. In contrast, Mistral removed their commitment to open source as they found a potential path to profitability.
It's an image enhancement measure, if you want. Databricks' customers mostly use it as an ETL tool, but it benefits them to be perceived as more than that.
Databricks is trying to go all-in on convincing organizations they need to use in-house models, and therefore pay they to provide LLMOps.
They're so far into this that their CTO co-authored a borderline dishonest study which got a ton of traction last summer trying to discredit GPT-4: https://arxiv.org/pdf/2307.09009.pdf
I can see a business model for inhouse LLM models: Training a model on the knowledge about their products and then somehow getting that knowledge into a generally available LLM platform.
I recently tried to ask Google to explain to me how to delete sender-recorded voice-message I had created from WhatsApp. I got totally erroneous results back. Maybe it was because that is a rather new feature in WhatsApp.
It would be in the interests of WhatsApp to get accurate answers about it into Google's LLM. So Google might make a deal with them requiring WhatsApp to pay Google for regular updates about the up-to-date features of What's App into Google. The owner of What's App Meta of course is competition to Google so Google may not much care of providing up to date info about WhatsApp in their LLM. But they might if Meta paid them.
If the GPU has 16GB of VRAM, and the model is 70GB, can it still run well?
Also, does it run considerably better than on a GPU with 12GB of VRAM?
I run Ollama locally, mixtral works well (7B, 3.4GB) on a 1080ti, but the 24.6GB version is a bit slow (still usable, but has a noticeable start-up time).
While GPUs are still the kings of speed, if you are worried about VRAM I do recommend a maxed out Mac Studio.
Llama.cpp + quantized models on Apple Silicon is an incredible experience, and having 192 GB of unified memory to work with means you can run models that just aren't feasible on a home GPU setup.
It really boils down to what type of local development you want to do. I'm mostly experimenting with things where the time to response isn't that big of a deal, and not fine-tuning the models locally (which I also believe GPUs are still superior for). But if your concern is "how big of a model can I run" vs "Can I have close to real time chat", the unified memory approach is superior.
I had gone the Mac Studio route initially, but I ended up with getting an A6000 for about the same price as a Mac and putting that in a Linux server onder my desk. Ollama makes it dead simple to serve it over my local network, so I can be on my M1 Air and using it no differently than if on my laptop. The difference is that the A6000 absolutely smokes the Mac.
I know the M?-pro and ultra variants are multiple standard M?’s in a single package. But so the CPUs and GPUs share a die (like a single 4 p-core CPU 10 GPU core is what come in the die, and the more exotic variants are just a result of LEGO-ing out those guys and disabling some cores for market segmentation or because they had defects?)
I guess I’m wondering if they technically could throw in their gauntlet and compete with nvidia by doing something like a 4 CPU/80 GPU/256 GB chip, if they wanted to. Seems like it’d be a really appealing ML machine. (I could also see it being technically possible but Apple just deciding that’s pointlessly niche for them).
Aren't quantized models different models outright requiring a new evaluation to know the deviation in performance? Or are they "good enough" in that the benefits outweigh the deviation?
I'm on the fence about whether to spend 5 digits or 4 digits. Do I go the Mac Studio route or GPUs? What are the pros and cons?
>If the GPU has 16GB of VRAM, and the model is 70GB, can it still run well? Also, does it run considerably better than on a GPU with 12GB of VRAM?
No, it can't run at all.
>I run Ollama locally, mixtral works well (7B, 3.4GB) on a 1080ti, but the 24.6GB version is a bit slow (still usable, but has a noticeable start-up time).
That is not mixtral, that is mistral 7b. The 1080ti is slower than running inference on current generation threadripper cpus.
EDIT: This was ran on a 1080ti + 5900x. Initial generation takes around 10-30seconds (like it has to upload the model to GPU), but then it starts answering immediately, at around 3 words per second.
I have 128GB, but something is weird with Ollama. Even though for the Ollama Docker I only allow 90GB, it ends up using 128GB/128GB, so the system because very slow (mouse freezes).
I genuinely recommend considering AMD options. I went with a 7900 XTX because it has the most VRAM for any $1000 card (24 GB). NVIDIA cards at that price point are only 16 GB. Ollama and other inference software works on ROCm, generally with at most setting an environment variable now. I've even run Ollama on my Steam Deck with GPU inferencing :)
Worse than the chart crime of truncating the y axis is putting LLaMa2's Human Eval scores on there and not comparing it to Code Llama Instruct 70b. DBRX still beats Code Llama Instruct's 67.8 but not by that much.
> "On HumanEval, DBRX Instruct even surpasses CodeLLaMA-70B Instruct, a model built explicitly for programming, despite the fact that DBRX Instruct is designed for general-purpose use (70.1% vs. 67.8% on HumanEval as reported by Meta in the CodeLLaMA blog)."
To be fair, they do compare to it in the main body of the blog. It's just probably misleading to compare to CodeLLaMA on non coding benchmarks.
Waiting for Mixed Quantization with MQQ and MoE Offloading [1]. With that I was able to run Mistral 8x7B on my 10 GB VRAM rtx3080... This should work for DBRX and should shave off a ton of VRAM requirement.
Per the paper, 3072 H100s over the course of 3 months, assume a cost of 2$/GPU/hour
That would be roughly 13.5M$ USD
I’m guessing that at this scale and cost, this model is not competitive and their ambition is to scale to much larger models. In the meantime , they learned a lot and gain PR from open-sourcing
This makes me bearish on OpenAI as a company. When a cloud company can offer a strong model for free by selling the compute, what competitive advantage does a company who want you to pay for the model have left? Feels like they might get Netscape’d.
OpenAI is not the worst, ChatGPT is used by 100M people weekly, sort of insulated from benchmarks. The best of the rest, Anthropic, should be really scared.
The approval on the base model is not feeling very open. Plenty of people still waiting on a chance to download it, where as the instruct model was an instant approval. The base model is more interesting to me for finetuning.
> The model requires ~264GB of RAM
I'm wondering when everyone will transition from tracking parameter count vs evaluation metric to (total gpu RAM + total CPU RAM) vs evaluation metric.
For example, a 7B parameter model using float32s will almost certainly outperform a 7B model using float4s.
Additionally, all the examples of quantizing recently released superior models to fit on one GPU doesnt mean the quantized model is a "win." The quantized model is a different model, you need to rerun the metrics.
Cool to play with for a few tests, but I can't imagine using it for anything.
Now this model is ~134B right? It could be bog slow but on the other hand its a MoE so there might be a chance it could have satisfactory results.
Q5 quantization performs almost on par with base models. Obviously there's some loss there, but this indicates that there's still a lot of compression that we're missing.
This feels as crazy as Grok. Was there a generation of models recently where we decided to just crank on the parameter count?
From that, it seems somewhat likely we've hit the wall on improving <X B parameter LLMs by simply scaling up the training data, which basically forces everyone to continue scaling up if they want to keep up with SOTA.
In my view, the main value of larger models is distillation (which we particularly witness, for instance, with how Claude Haiku matches release-day GPT-4 despite being less than a tenth of the cost). Hopefully the distilled models will be easier to run.
That would be what I call artificial intelligence.
Giving up because "out of memory" is not intelligence.
When people can't remember the facts/theory/formulas needed to answer some test question, or can't memorize some complicated information because it's too much, they usually give up too.
So, giving up because of "out of memory" sure sounds like intelligence to me.
With AI they need to desperately steer the narrative away from API based services (OpenAI).
By training LLMs, they build sales artifacts (stories, references, even accelerators with LLMs themselves) to paint the pictures needed to convince their enterprise customer market that Databricks is the platform for enterprise AI. Their blog details how the entire end to end process was done on the platform.
In other words, Databricks spent millions as an aid in influencing their customers to do the same (on Databricks).
Azure already runs on-premise if I'm not mistaken, Claude 3 is out...but DBRX already falls so far behind
I just don't get it.
But really, I prefer to have as many players as possible in the field of _open_ models available.
They're so far into this that their CTO co-authored a borderline dishonest study which got a ton of traction last summer trying to discredit GPT-4: https://arxiv.org/pdf/2307.09009.pdf
I recently tried to ask Google to explain to me how to delete sender-recorded voice-message I had created from WhatsApp. I got totally erroneous results back. Maybe it was because that is a rather new feature in WhatsApp.
It would be in the interests of WhatsApp to get accurate answers about it into Google's LLM. So Google might make a deal with them requiring WhatsApp to pay Google for regular updates about the up-to-date features of What's App into Google. The owner of What's App Meta of course is competition to Google so Google may not much care of providing up to date info about WhatsApp in their LLM. But they might if Meta paid them.
If the GPU has 16GB of VRAM, and the model is 70GB, can it still run well? Also, does it run considerably better than on a GPU with 12GB of VRAM?
I run Ollama locally, mixtral works well (7B, 3.4GB) on a 1080ti, but the 24.6GB version is a bit slow (still usable, but has a noticeable start-up time).
Llama.cpp + quantized models on Apple Silicon is an incredible experience, and having 192 GB of unified memory to work with means you can run models that just aren't feasible on a home GPU setup.
It really boils down to what type of local development you want to do. I'm mostly experimenting with things where the time to response isn't that big of a deal, and not fine-tuning the models locally (which I also believe GPUs are still superior for). But if your concern is "how big of a model can I run" vs "Can I have close to real time chat", the unified memory approach is superior.
I guess I’m wondering if they technically could throw in their gauntlet and compete with nvidia by doing something like a 4 CPU/80 GPU/256 GB chip, if they wanted to. Seems like it’d be a really appealing ML machine. (I could also see it being technically possible but Apple just deciding that’s pointlessly niche for them).
I assume the FP32 and FP16 operations are already a huge improvement, but also the 33% increased VRAM might lead to fewer swaps between VRAM and RAM.
I'm on the fence about whether to spend 5 digits or 4 digits. Do I go the Mac Studio route or GPUs? What are the pros and cons?
No, it can't run at all.
>I run Ollama locally, mixtral works well (7B, 3.4GB) on a 1080ti, but the 24.6GB version is a bit slow (still usable, but has a noticeable start-up time).
That is not mixtral, that is mistral 7b. The 1080ti is slower than running inference on current generation threadripper cpus.
https://s3.amazonaws.com/i.snag.gy/ae82Ym.jpg
EDIT: This was ran on a 1080ti + 5900x. Initial generation takes around 10-30seconds (like it has to upload the model to GPU), but then it starts answering immediately, at around 3 words per second.
dolphin-mixtral:latest (24.6GB) mistral:latest (3.8GB)
The CPU is 5900x.
Do you mean mistral?
mixtral is 8x7B and requires like 100GB of RAM
Edit: (without quant as others have pointed out) can definitely be lower, but haven't heard of a 3.4GB version
I have those models in Ollama:
I have those:
dolphin-mixtral:latest (24.6GB) mistral:latest (3.8GB)
Funnily, I think the card is new (smells new) and unused, most likely a scalper bought it and couldn't sell it.
To be fair, they do compare to it in the main body of the blog. It's just probably misleading to compare to CodeLLaMA on non coding benchmarks.
If you chart the temperature of the ocean do you keep the y-axis anchored at zero Kelvin?
1. https://github.com/dvmazur/mixtral-offloading?tab=readme-ov-...
That would be roughly 13.5M$ USD
I’m guessing that at this scale and cost, this model is not competitive and their ambition is to scale to much larger models. In the meantime , they learned a lot and gain PR from open-sourcing