Readit News logoReadit News
djoldman · 2 years ago
Model card for base: https://huggingface.co/databricks/dbrx-base

> The model requires ~264GB of RAM

I'm wondering when everyone will transition from tracking parameter count vs evaluation metric to (total gpu RAM + total CPU RAM) vs evaluation metric.

For example, a 7B parameter model using float32s will almost certainly outperform a 7B model using float4s.

Additionally, all the examples of quantizing recently released superior models to fit on one GPU doesnt mean the quantized model is a "win." The quantized model is a different model, you need to rerun the metrics.

ml_hardware · 2 years ago
Looks like someone has got DBRX running on an M2 Ultra already: https://x.com/awnihannun/status/1773024954667184196?s=20
resource_waste · 2 years ago
I find 500 tokens considered 'running' a stretch.

Cool to play with for a few tests, but I can't imagine using it for anything.

irusensei · 2 years ago
I can run a certain 120b on my M3 max with 128GB memory. However I found that while it “fits” Q5 was extremely slow. The story was different with Q4 though which ran just fine around ~3.5-4 t/s.

Now this model is ~134B right? It could be bog slow but on the other hand its a MoE so there might be a chance it could have satisfactory results.

Mandelmus · 2 years ago
And it appears to be at ~80 GB of RAM via quantisation.
madiator · 2 years ago
That's great, but it did not really write the program that the human asked it to do. :)
dvt · 2 years ago
> a 7B parameter model using float32s will almost certainly outperform a 7B model using float4s

Q5 quantization performs almost on par with base models. Obviously there's some loss there, but this indicates that there's still a lot of compression that we're missing.

jonnycomputer · 2 years ago
I'm still amazed that quantization works at all, coming out as a mild degradation in quality rather than radical dysfunction. Not that I've thought it through that much. Does quantization work with most neural networks?
swalsh · 2 years ago
> The model requires ~264GB of RAM

This feels as crazy as Grok. Was there a generation of models recently where we decided to just crank on the parameter count?

breezeTrowel · 2 years ago
Cranking up the parameter count is literally how the current LLM craze got started. Hence the "large" in "large language model".
Jackson__ · 2 years ago
If you read their blog post, they mention it was pretrained on 12 Trillion tokens of text. That is ~5x the amount of the llama2 training runs.

From that, it seems somewhat likely we've hit the wall on improving <X B parameter LLMs by simply scaling up the training data, which basically forces everyone to continue scaling up if they want to keep up with SOTA.

espadrine · 2 years ago
Not recently. GPT-3 from 2020 requires even more RAM; the open-source BLOOM from 2022 did too.

In my view, the main value of larger models is distillation (which we particularly witness, for instance, with how Claude Haiku matches release-day GPT-4 despite being less than a tenth of the cost). Hopefully the distilled models will be easier to run.

wrs · 2 years ago
Isn’t that pretty much the last 12 months?
vlovich123 · 2 years ago
I thought float4 sacrificed a negligible cost in evaluation quality for a 8x reduction in RAM?
Taek · 2 years ago
For smaller models, the quality drop is meaningful. For larger ones like this one, the quality drop is negligible.
Y_Y · 2 years ago
A free lunch? Wouldn't that be nice! Sometimes the quantization process improves the accuracy a little (probably by implicit regularization) but a model that's at or near capacity (as it should be) is necessarily hurt by throwing away most of the information. Language models often quantize well to small fixed-point types like int4, but it's not a magic wand.
dheera · 2 years ago
I'm more wondering when we'll have algorithms that will "do their best" given the resources they detect.

That would be what I call artificial intelligence.

Giving up because "out of memory" is not intelligence.

falcor84 · 2 years ago
I suppose you could simulate dementia by loading as much of the weights as space permits and then just stopping. Then during inference, replace the missing weights with calls to random(). I'd actually be interested in seeing the results.
visarga · 2 years ago
No but some model serving tools like llama.cpp do their best. It's just a matter of choosing the right serving tools. And I am not sure LLMs could not optimize their memory layout. Why not? Just let them play with this and learn. You can do pretty amazing things with evolutionary methods where the LLMs are the mutation operator. You evolve a population of solutions. (https://arxiv.org/abs/2206.08896)
coldtea · 2 years ago
>Giving up because "out of memory" is not intelligence.

When people can't remember the facts/theory/formulas needed to answer some test question, or can't memorize some complicated information because it's too much, they usually give up too.

So, giving up because of "out of memory" sure sounds like intelligence to me.

hintymad · 2 years ago
Just curious, what business benefit will Databricks get by spending potentially millions of dollars on an open LLM?
ramoz · 2 years ago
Their goal is to always drive enterprise business towards consumption.

With AI they need to desperately steer the narrative away from API based services (OpenAI).

By training LLMs, they build sales artifacts (stories, references, even accelerators with LLMs themselves) to paint the pictures needed to convince their enterprise customer market that Databricks is the platform for enterprise AI. Their blog details how the entire end to end process was done on the platform.

In other words, Databricks spent millions as an aid in influencing their customers to do the same (on Databricks).

hintymad · 2 years ago
Thanks! Why do they not focus on hosting other open models then? I suspect other models will soon catch up with their advantages in faster inference and better benchmark results. That said, maybe the advantage is aligned interests: they want customers to use their platforms, so they can keep their models open. In contrast, Mistral removed their commitment to open source as they found a potential path to profitability.
anonymousDan · 2 years ago
Do they use spark for the training?
dhoe · 2 years ago
It's an image enhancement measure, if you want. Databricks' customers mostly use it as an ETL tool, but it benefits them to be perceived as more than that.
spxneo · 2 years ago
you can improve your brand for a lot less I just dont understand why they would throw all their chips in a losing race.

Azure already runs on-premise if I'm not mistaken, Claude 3 is out...but DBRX already falls so far behind

I just don't get it.

blitzar · 2 years ago
An increased valuation at IPO later this year.
qrios · 2 years ago
Instead of spending x by 10^7 of dollars, Databricks could buy databricks.ai, it's for sale.

But really, I prefer to have as many players as possible in the field of _open_ models available.

BoorishBears · 2 years ago
Databricks is trying to go all-in on convincing organizations they need to use in-house models, and therefore pay they to provide LLMOps.

They're so far into this that their CTO co-authored a borderline dishonest study which got a ton of traction last summer trying to discredit GPT-4: https://arxiv.org/pdf/2307.09009.pdf

galaxyLogic · 2 years ago
I can see a business model for inhouse LLM models: Training a model on the knowledge about their products and then somehow getting that knowledge into a generally available LLM platform.

I recently tried to ask Google to explain to me how to delete sender-recorded voice-message I had created from WhatsApp. I got totally erroneous results back. Maybe it was because that is a rather new feature in WhatsApp.

It would be in the interests of WhatsApp to get accurate answers about it into Google's LLM. So Google might make a deal with them requiring WhatsApp to pay Google for regular updates about the up-to-date features of What's App into Google. The owner of What's App Meta of course is competition to Google so Google may not much care of providing up to date info about WhatsApp in their LLM. But they might if Meta paid them.

omeze · 2 years ago
What does borderline dishonest mean? I only read the abstract and it seems like such an obvious point I dont see how its contentious
guluarte · 2 years ago
nothing, but they will brag about it to get more money from investors
XCSme · 2 years ago
I am planning to buy a new GPU.

If the GPU has 16GB of VRAM, and the model is 70GB, can it still run well? Also, does it run considerably better than on a GPU with 12GB of VRAM?

I run Ollama locally, mixtral works well (7B, 3.4GB) on a 1080ti, but the 24.6GB version is a bit slow (still usable, but has a noticeable start-up time).

PheonixPharts · 2 years ago
While GPUs are still the kings of speed, if you are worried about VRAM I do recommend a maxed out Mac Studio.

Llama.cpp + quantized models on Apple Silicon is an incredible experience, and having 192 GB of unified memory to work with means you can run models that just aren't feasible on a home GPU setup.

It really boils down to what type of local development you want to do. I'm mostly experimenting with things where the time to response isn't that big of a deal, and not fine-tuning the models locally (which I also believe GPUs are still superior for). But if your concern is "how big of a model can I run" vs "Can I have close to real time chat", the unified memory approach is superior.

bevekspldnw · 2 years ago
I had gone the Mac Studio route initially, but I ended up with getting an A6000 for about the same price as a Mac and putting that in a Linux server onder my desk. Ollama makes it dead simple to serve it over my local network, so I can be on my M1 Air and using it no differently than if on my laptop. The difference is that the A6000 absolutely smokes the Mac.
bee_rider · 2 years ago
I know the M?-pro and ultra variants are multiple standard M?’s in a single package. But so the CPUs and GPUs share a die (like a single 4 p-core CPU 10 GPU core is what come in the die, and the more exotic variants are just a result of LEGO-ing out those guys and disabling some cores for market segmentation or because they had defects?)

I guess I’m wondering if they technically could throw in their gauntlet and compete with nvidia by doing something like a 4 CPU/80 GPU/256 GB chip, if they wanted to. Seems like it’d be a really appealing ML machine. (I could also see it being technically possible but Apple just deciding that’s pointlessly niche for them).

XCSme · 2 years ago
I already have 128GB of RAM (DDR4), and was wondering if upgrading from a 1080ti (12GB) to a 4070ti super (16GB), would make a big difference.

I assume the FP32 and FP16 operations are already a huge improvement, but also the 33% increased VRAM might lead to fewer swaps between VRAM and RAM.

brandall10 · 2 years ago
Wait for the M3 Ultra and it will be 256GB and markedly faster.
spxneo · 2 years ago
Aren't quantized models different models outright requiring a new evaluation to know the deviation in performance? Or are they "good enough" in that the benefits outweigh the deviation?

I'm on the fence about whether to spend 5 digits or 4 digits. Do I go the Mac Studio route or GPUs? What are the pros and cons?

purpleblue · 2 years ago
Aren't the Macs good for inference but not for training or fine tuning?
llm_trw · 2 years ago
>If the GPU has 16GB of VRAM, and the model is 70GB, can it still run well? Also, does it run considerably better than on a GPU with 12GB of VRAM?

No, it can't run at all.

>I run Ollama locally, mixtral works well (7B, 3.4GB) on a 1080ti, but the 24.6GB version is a bit slow (still usable, but has a noticeable start-up time).

That is not mixtral, that is mistral 7b. The 1080ti is slower than running inference on current generation threadripper cpus.

XCSme · 2 years ago
> No, it can't run at all.

https://s3.amazonaws.com/i.snag.gy/ae82Ym.jpg

EDIT: This was ran on a 1080ti + 5900x. Initial generation takes around 10-30seconds (like it has to upload the model to GPU), but then it starts answering immediately, at around 3 words per second.

XCSme · 2 years ago
I have those:

dolphin-mixtral:latest (24.6GB) mistral:latest (3.8GB)

The CPU is 5900x.

lxe · 2 years ago
Get 2 pre-owned 3090s. You will easily be able to run 70b or even 120b quantized models.
jasonjmcghee · 2 years ago
> mixtral works well

Do you mean mistral?

mixtral is 8x7B and requires like 100GB of RAM

Edit: (without quant as others have pointed out) can definitely be lower, but haven't heard of a 3.4GB version

kwerk · 2 years ago
I have two 3090s and it runs fine with `ollama run mixtral`. Although OP definitely meant mistral with the 7B note
ranger_danger · 2 years ago
I'm using mixtral-8x7b-v0.1.Q4_K_M.gguf with llama.cpp and it only requires 25GB.
XCSme · 2 years ago
I have 128GB, but something is weird with Ollama. Even though for the Ollama Docker I only allow 90GB, it ends up using 128GB/128GB, so the system because very slow (mouse freezes).
K0balt · 2 years ago
I run mixtral 6 bit quant very happily on my MacBook with 64 gb.
Havoc · 2 years ago
The smaller quants still require a 24gb card. 16 might work but doubt it
XCSme · 2 years ago
Sorry, it was from memory.

I have those models in Ollama:

I have those:

dolphin-mixtral:latest (24.6GB) mistral:latest (3.8GB)

chpatrick · 2 years ago
The quantized one works fine on my 24GB 3090.
Zambyte · 2 years ago
I genuinely recommend considering AMD options. I went with a 7900 XTX because it has the most VRAM for any $1000 card (24 GB). NVIDIA cards at that price point are only 16 GB. Ollama and other inference software works on ROCm, generally with at most setting an environment variable now. I've even run Ollama on my Steam Deck with GPU inferencing :)
XCSme · 2 years ago
I ended up getting a 2nd hand 3090 for 680€.

Funnily, I think the card is new (smells new) and unused, most likely a scalper bought it and couldn't sell it.

speedylight · 2 years ago
Quantized models will run well, otherwise inference might be really really slow or the client crashes all together with some CUDA out of memory error.
briandw · 2 years ago
Worse than the chart crime of truncating the y axis is putting LLaMa2's Human Eval scores on there and not comparing it to Code Llama Instruct 70b. DBRX still beats Code Llama Instruct's 67.8 but not by that much.
jjgo · 2 years ago
> "On HumanEval, DBRX Instruct even surpasses CodeLLaMA-70B Instruct, a model built explicitly for programming, despite the fact that DBRX Instruct is designed for general-purpose use (70.1% vs. 67.8% on HumanEval as reported by Meta in the CodeLLaMA blog)."

To be fair, they do compare to it in the main body of the blog. It's just probably misleading to compare to CodeLLaMA on non coding benchmarks.

tartrate · 2 years ago
Which non-coding benchmark?
panarky · 2 years ago
> chart crime of truncating the y axis

If you chart the temperature of the ocean do you keep the y-axis anchored at zero Kelvin?

d-z-m · 2 years ago
If you chart the temperature of the ocean are you measuring it in Kelvin?
underlines · 2 years ago
Waiting for Mixed Quantization with MQQ and MoE Offloading [1]. With that I was able to run Mistral 8x7B on my 10 GB VRAM rtx3080... This should work for DBRX and should shave off a ton of VRAM requirement.

1. https://github.com/dvmazur/mixtral-offloading?tab=readme-ov-...

jerpint · 2 years ago
Per the paper, 3072 H100s over the course of 3 months, assume a cost of 2$/GPU/hour

That would be roughly 13.5M$ USD

I’m guessing that at this scale and cost, this model is not competitive and their ambition is to scale to much larger models. In the meantime , they learned a lot and gain PR from open-sourcing

petesergeant · 2 years ago
This makes me bearish on OpenAI as a company. When a cloud company can offer a strong model for free by selling the compute, what competitive advantage does a company who want you to pay for the model have left? Feels like they might get Netscape’d.
MP_1729 · 2 years ago
OpenAI is not the worst, ChatGPT is used by 100M people weekly, sort of insulated from benchmarks. The best of the rest, Anthropic, should be really scared.
ianbutler · 2 years ago
The approval on the base model is not feeling very open. Plenty of people still waiting on a chance to download it, where as the instruct model was an instant approval. The base model is more interesting to me for finetuning.
blueblimp · 2 years ago
The license allows to reproduce/distribute/copy, so I'm a little surprised there's an approval process at all.
ianbutler · 2 years ago
Yeah it's kind of weird, I'll assume for now they're just busy, but I'd be lying if my gut didn't immediately say it's kind of sketchy.
Chamix · 2 years ago
4chan already has a torrent out, of course.
ianbutler · 2 years ago
FWIW looks like people are getting access now.