DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon

As other commenters have mentioned, the performance of this set up is probably not really great since there's not enough VRAM and lots of bits have to be moved between CPU and GPU RAM.

That said, there are sub-256GB quants of DeepSeek-R1 out there (not the distilled versions). See https://unsloth.ai/blog/deepseekr1-dynamic

I can't quantify the difference between these and the full FP8 versions of DSR1, but I've been playing with these ~Q2 quants and they're surprisingly capable in their own right.

Another model that deserves mention is DeepSeek v2.5 (which has "fewer" params than V3/R1) - but still needs aggressive quantization before it can run on "consumer" devices (with less than ~100GB VRAM), and this is recently done by a kind soul: https://www.reddit.com/r/LocalLLaMA/comments/1irwx6q/deepsee...

DeepSeek v2.5 is arguably better than Llama 3 70B, so it should be of interest to anyone looking to run local inference. I really think more people should know about this.

SlavikCA · a year ago

I tried that Unsloth R1 quantization on my dual Xeon Gold 5218 with 384 GB DDR4-2666 (about half of memory channels used, so not most optimal).

Type IQ2_XXS / 183GB, 16k context:

CPU only: 3 t/s (tokens per second) for PP (prompt processing) and 1.44 t/s for response.

CPU + NVIDIA RTX 70GB VRAM: 4.74 t/s for PP and 1.87 t/s for response.

I wish Unsloth produce similar quantization for DeepSeek V3, - it will be more useful, as it doesn't need reasoning tokens, so even with same t/s it will faster overall.

idonotknowwhy · a year ago

Thanks a lot for the v2.5! I'll give that a whirl. Hopefully it's as coherent as v3.5 when quantized so small.

> I can't quantify the difference between these and the full FP8 versions of DSR1, but I've been playing with these ~Q2 quants and they're surprisingly capable in their own right.

I run the Q2_K_XL and it's perfectly good for me. Where it lacks vs FP8 is in creative writing. If you prompt it with for a story a few times, then compare with FP8, you'll see what I mean.

For coding, the 1.58bit clearly makes more errors than the Q2XXS and Q2_K_XL

colorant · a year ago

Currently >8 token/s; there is a demo in this post: https://www.linkedin.com/posts/jasondai_run-671b-deepseek-r1...

Dead Comment

What exactly does the Xeon do in this situation, is there a reason you couldn't use any other x86 processor?

VladVladikoff · a year ago

I think it’s that most non Xeon motherboards don’t have the memory channels to have this much memory with any sort of commercially viable dimms.

genewitch · a year ago

Pcie lanes

numpad0 · a year ago

  DDR4 UDIMM is up to 32GB/module  
  DDR5 UDIMM is up to 64GB/module[0]  
  non-Xeon M/B has up to 4 UDIMM slots 
  -> non-Xeon is up to 128GB/256GB per node

Server motherboards have as many as 16 DIMM slots per socket with RDIMM/LRDIMM support, which allows more modules as well as higher capacity modules to be installed.

0: there has been a 128GB UDIMM launch at peak COVID

walrus01 · a year ago

There's not much else (other than Epyc) in the way of affordably priced motherboards that have enough cumulative RAM. You can buy a used Dell dual socket older xeon CPU server with 512GB of RAM for test/development purposes for not very much money.

Under $1500 (before adding video cards or your own SSD), easily, with what I just found in a few minutes of searching. I'm also seeing things with 1024GB of RAM for under $2000.

You also want to have the capability for more than one full speed at minimum PCI-Express x16 3.0 card, which means you need enough PCI-E lanes, which you aren't going to find on a single socket Intel workstation motherboard.

Here's a couple of somewhat randomly chosen examples with 512GB of RAM and affordably priced. they'll be power hungry, and noisy. Same general idea from other x86-64 hardware such as from hp, supermicro, etc. These are fairly common in quantity so I'm using them as a baseline for specification vs price. Configurations will be something with 16 x 32GB DDR4 DIMMs.

https://www.ebay.com/itm/186991103256?_skw=dell+poweredge+t6...

https://www.ebay.com/itm/235978320621?_skw=dell+poweredge+r7...

https://www.ebay.com/itm/115819389940?_skw=dell+poweredge+r7...

numpad0 · a year ago

PowerEdge R series is significantly cheaper if you already have an ear protection

hnfong · a year ago

https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quic...

Requirements (>8 token/s):

380GB CPU Memory

1-8 ARC A770

500GB Disk

Also see the demo from Jason Dai's post: https://www.linkedin.com/posts/jasondai_with-the-latest-ipex...

aurareturn · a year ago

CPU inference is both bandwidth and compute constrained.

If your prompt has 10 tokens, it’ll do ok, like in the LinkedIn demo. If you need to increase the context, compute bottleneck will kick in quickly.

GTP · a year ago

> 1-8 ARC A770

To get more than 8 t/s, is one Intel Arc A770 enough?

Yes, but the context length will be limited due to VRAM constraint

faizshah · a year ago

Anyone got a rough estimate of the cost of this setup?

I’m guessing it’s under 10k.

I also didn’t see tokens per second numbers.

ynniv · a year ago

It better be! AMD @ $2k: https://digitalspaceport.com/how-to-run-deepseek-r1-671b-ful...

Deleted Comment

jamesy0ung · a year ago

Gravityloss · a year ago

I'm sure this question has been asked before, but why not launch a GPU with more but slower ram? That would fit bigger models while still affordable...

TeMPOraL · a year ago

Why would you need it for? Not gaming for sure. AI, you say? Then fork up the cash.

That's Nvidia's current MO. There's more demand for GPUs for AI than there are GPUs available, and most of that demand still has stupid amounts of money behind it (being able to get grants, loans or investment based on potential/hype) - money that can be captured by GPU vendors. Unfortunately, VRAM is the perfect discriminator between "casual" and "monied" use.

(This is not unlike the "SSO tax" - single sign-on is pretty much the perfect discriminator between "enterprise use" and "not enterprise use".)

ChocolateGod · a year ago

Because then you would have less motivation to buy the more expensive GPUs.

antupis · a year ago

Yeah, Nvidia doesn't have any incentive to do that and AMD needs to get their shit together at software side.

fleischhauf · a year ago

they absolutely can build gpus with larger vram, they just don't have the competition to have to do so. it's much more profitable this way.

andrewstuart · a year ago

Did you miss the news about AMD Halo Strix?

More than twice as fast as Nvidia 4090 for AI.

Launched last week.

coolspot · a year ago

> More than twice as fast as Nvidia 4090 for AI.

Not in memory bandwidth which is all that matter for LLM inference.

I indeed was not aware, thanks

yongjik · a year ago

Did DeepSeek learn how to name their models from OpenAI.

vlovich123 · a year ago

The convention is weird but it's pretty standard in the industry across all models, particularly GGUF. 671B parameters, quantized to 4 bits. The K_M terminology I believe is more specific to GGUF and describes the specific quantization strategy.

CamperBob2 · a year ago

Article could stand to include a bit more information. Why are all the TPS figures x'ed out? What kind of performance can be expected from this setup (and how does it compare to the dual Epyc workstation recipe that was popularized recently?)

>8TPS at this moment on a 2-socket 5th Xeon (EMR)

codetrotter · a year ago

> the dual Epyc workstation recipe that was popularized recently

Anyone have a link to this one?

hedora · a year ago

https://news.ycombinator.com/item?id=42897205

notum · a year ago

Censoring of token/s values in the sample output surely means this runs great!