As other commenters have mentioned, the performance of this set up is probably not really great since there's not enough VRAM and lots of bits have to be moved between CPU and GPU RAM.
I can't quantify the difference between these and the full FP8 versions of DSR1, but I've been playing with these ~Q2 quants and they're surprisingly capable in their own right.
Another model that deserves mention is DeepSeek v2.5 (which has "fewer" params than V3/R1) - but still needs aggressive quantization before it can run on "consumer" devices (with less than ~100GB VRAM), and this is recently done by a kind soul: https://www.reddit.com/r/LocalLLaMA/comments/1irwx6q/deepsee...
DeepSeek v2.5 is arguably better than Llama 3 70B, so it should be of interest to anyone looking to run local inference. I really think more people should know about this.
I tried that Unsloth R1 quantization on my dual Xeon Gold 5218 with 384 GB DDR4-2666 (about half of memory channels used, so not most optimal).
Type IQ2_XXS / 183GB, 16k context:
CPU only: 3 t/s (tokens per second) for PP (prompt processing) and 1.44 t/s for response.
CPU + NVIDIA RTX 70GB VRAM: 4.74 t/s for PP and 1.87 t/s for response.
I wish Unsloth produce similar quantization for DeepSeek V3, - it will be more useful, as it doesn't need reasoning tokens, so even with same t/s it will faster overall.
Thanks a lot for the v2.5! I'll give that a whirl. Hopefully it's as coherent as v3.5 when quantized so small.
> I can't quantify the difference between these and the full FP8 versions of DSR1, but I've been playing with these ~Q2 quants and they're surprisingly capable in their own right.
I run the Q2_K_XL and it's perfectly good for me. Where it lacks vs FP8 is in creative writing. If you prompt it with for a story a few times, then compare with FP8, you'll see what I mean.
For coding, the 1.58bit clearly makes more errors than the Q2XXS and Q2_K_XL
DDR4 UDIMM is up to 32GB/module
DDR5 UDIMM is up to 64GB/module[0]
non-Xeon M/B has up to 4 UDIMM slots
-> non-Xeon is up to 128GB/256GB per node
Server motherboards have as many as 16 DIMM slots per socket with RDIMM/LRDIMM support, which allows more modules as well as higher capacity modules to be installed.
0: there has been a 128GB UDIMM launch at peak COVID
There's not much else (other than Epyc) in the way of affordably priced motherboards that have enough cumulative RAM. You can buy a used Dell dual socket older xeon CPU server with 512GB of RAM for test/development purposes for not very much money.
Under $1500 (before adding video cards or your own SSD), easily, with what I just found in a few minutes of searching. I'm also seeing things with 1024GB of RAM for under $2000.
You also want to have the capability for more than one full speed at minimum PCI-Express x16 3.0 card, which means you need enough PCI-E lanes, which you aren't going to find on a single socket Intel workstation motherboard.
Here's a couple of somewhat randomly chosen examples with 512GB of RAM and affordably priced. they'll be power hungry, and noisy. Same general idea from other x86-64 hardware such as from hp, supermicro, etc. These are fairly common in quantity so I'm using them as a baseline for specification vs price. Configurations will be something with 16 x 32GB DDR4 DIMMs.
I'm sure this question has been asked before, but why not launch a GPU with more but slower ram? That would fit bigger models while still affordable...
Why would you need it for? Not gaming for sure. AI, you say? Then fork up the cash.
That's Nvidia's current MO. There's more demand for GPUs for AI than there are GPUs available, and most of that demand still has stupid amounts of money behind it (being able to get grants, loans or investment based on potential/hype) - money that can be captured by GPU vendors. Unfortunately, VRAM is the perfect discriminator between "casual" and "monied" use.
(This is not unlike the "SSO tax" - single sign-on is pretty much the perfect discriminator between "enterprise use" and "not enterprise use".)
The convention is weird but it's pretty standard in the industry across all models, particularly GGUF. 671B parameters, quantized to 4 bits. The K_M terminology I believe is more specific to GGUF and describes the specific quantization strategy.
Article could stand to include a bit more information. Why are all the TPS figures x'ed out? What kind of performance can be expected from this setup (and how does it compare to the dual Epyc workstation recipe that was popularized recently?)
That said, there are sub-256GB quants of DeepSeek-R1 out there (not the distilled versions). See https://unsloth.ai/blog/deepseekr1-dynamic
I can't quantify the difference between these and the full FP8 versions of DSR1, but I've been playing with these ~Q2 quants and they're surprisingly capable in their own right.
Another model that deserves mention is DeepSeek v2.5 (which has "fewer" params than V3/R1) - but still needs aggressive quantization before it can run on "consumer" devices (with less than ~100GB VRAM), and this is recently done by a kind soul: https://www.reddit.com/r/LocalLLaMA/comments/1irwx6q/deepsee...
DeepSeek v2.5 is arguably better than Llama 3 70B, so it should be of interest to anyone looking to run local inference. I really think more people should know about this.
Type IQ2_XXS / 183GB, 16k context:
CPU only: 3 t/s (tokens per second) for PP (prompt processing) and 1.44 t/s for response.
CPU + NVIDIA RTX 70GB VRAM: 4.74 t/s for PP and 1.87 t/s for response.
I wish Unsloth produce similar quantization for DeepSeek V3, - it will be more useful, as it doesn't need reasoning tokens, so even with same t/s it will faster overall.
> I can't quantify the difference between these and the full FP8 versions of DSR1, but I've been playing with these ~Q2 quants and they're surprisingly capable in their own right.
I run the Q2_K_XL and it's perfectly good for me. Where it lacks vs FP8 is in creative writing. If you prompt it with for a story a few times, then compare with FP8, you'll see what I mean.
For coding, the 1.58bit clearly makes more errors than the Q2XXS and Q2_K_XL
Dead Comment
Requirements (>8 token/s):
380GB CPU Memory
1-8 ARC A770
500GB Disk
If your prompt has 10 tokens, it’ll do ok, like in the LinkedIn demo. If you need to increase the context, compute bottleneck will kick in quickly.
To get more than 8 t/s, is one Intel Arc A770 enough?
I’m guessing it’s under 10k.
I also didn’t see tokens per second numbers.
Deleted Comment
0: there has been a 128GB UDIMM launch at peak COVID
Under $1500 (before adding video cards or your own SSD), easily, with what I just found in a few minutes of searching. I'm also seeing things with 1024GB of RAM for under $2000.
You also want to have the capability for more than one full speed at minimum PCI-Express x16 3.0 card, which means you need enough PCI-E lanes, which you aren't going to find on a single socket Intel workstation motherboard.
Here's a couple of somewhat randomly chosen examples with 512GB of RAM and affordably priced. they'll be power hungry, and noisy. Same general idea from other x86-64 hardware such as from hp, supermicro, etc. These are fairly common in quantity so I'm using them as a baseline for specification vs price. Configurations will be something with 16 x 32GB DDR4 DIMMs.
https://www.ebay.com/itm/186991103256?_skw=dell+poweredge+t6...
https://www.ebay.com/itm/235978320621?_skw=dell+poweredge+r7...
https://www.ebay.com/itm/115819389940?_skw=dell+poweredge+r7...
That's Nvidia's current MO. There's more demand for GPUs for AI than there are GPUs available, and most of that demand still has stupid amounts of money behind it (being able to get grants, loans or investment based on potential/hype) - money that can be captured by GPU vendors. Unfortunately, VRAM is the perfect discriminator between "casual" and "monied" use.
(This is not unlike the "SSO tax" - single sign-on is pretty much the perfect discriminator between "enterprise use" and "not enterprise use".)
More than twice as fast as Nvidia 4090 for AI.
Launched last week.
Not in memory bandwidth which is all that matter for LLM inference.
Dead Comment
Anyone have a link to this one?
Deleted Comment