ryan_glass (u/ryan_glass)

ryan_glass commented on Why DeepSeek is cheap at scale but expensive to run locally seangoedecke.com/inferenc... · Posted by u/ingve

dotancohen · 8 months ago

Just curious what your use cases are? What type of texts are you producing?

Thank you.

ryan_glass · 8 months ago

Coding, my own proprietary code hence my desire for local hosting, a decent amount of legacy code. General troubleshooting of anything and everything from running Linux servers to fixing my car. Summarizing and translation of large documents occasionally. Also, image generation and other automations but obviously not LLMs for this.

ryan_glass commented on Why DeepSeek is cheap at scale but expensive to run locally seangoedecke.com/inferenc... · Posted by u/ingve

blindriver · 8 months ago

I thought GPUs with a lot of extremely fast memory was required for inference. Are you saying that we can accomplish inference with just a large amount of system memory that is non-unified and no GPU? How is that possible?

ryan_glass · 8 months ago

Basically it comes down to memory bandwidth of server CPUs being decent. A bit of oversimplification here but... The model and context have to be pulled through RAM (or VRAM) every time a new token is generated. CPUs that are designed for servers with lots of cores have decent bandwidth - up to 480GB/s with the EPYC 9 series and they can use 16 channels simultaneously to process memory. So, in theory they can pull 480GB through the system every second. GPUs are faster but you also have to fit the entire model and context into RAM (or VRAM) so for larger models they are extremely expensive because a decent consumer GPU only has 24GB of VRAM and costs silly money, if you need 20 of them. Whereas you get a lot of RDIMM RAM for a couple thousand bucks so you can run bigger models and 480GB/s gives output faster than most people can read.

ryan_glass commented on Why DeepSeek is cheap at scale but expensive to run locally seangoedecke.com/inferenc... · Posted by u/ingve

ysosirius · 8 months ago

How do you find the quality of the output compares to that of, say, o3 or Sonnet 4?

ryan_glass · 8 months ago

To be honest I haven't used o3 or Sonnet as the code I work with is my own proprietary code which I like to keep private, which is one reason for the local setup. For troubleshooting day to day things I have found it at least as good as than the free in-browser version of ChatGPT (not sure which model it uses).

ryan_glass commented on Why DeepSeek is cheap at scale but expensive to run locally seangoedecke.com/inferenc... · Posted by u/ingve

BoorishBears · 8 months ago

CPU-only is really terrible bang for your buck, and I wish people would stop pushing these impractical builds on people genuinely curious in local AI.

The KV cache won't soften the blow the first time they paste a code sample into a chat and end up waiting 10 minutes with absolutely no interactivity before they even get first token.

You'll get an infinitely more useful build out of a single 3090 and sticking to stuff like Gemma 27B than you will out of trying to run Deepseek off a CPU-only build. Even a GH200 struggles to run Deepseek at realistic speeds with bs=1, and there's an entire H100 attached to CPU there: there just isn't a magic way to get "affordable fast effective" AI out of a CPU offloaded model right now.

ryan_glass · 8 months ago

The quality on Gemma 27B is nowhere near good enough for my needs. None of the smaller models are.

ryan_glass commented on Why DeepSeek is cheap at scale but expensive to run locally seangoedecke.com/inferenc... · Posted by u/ingve

jbellis · 8 months ago

impressive, but that's 1/5 to 1/10 of the throughput that you'd get with a hosted provider, with 1/4 to 1/8 the supported context

ryan_glass · 8 months ago

It might be 5 to 10 times slower than a hosted provider but that doesn't really matter when the output is still faster than a person can read. Context wise, for troubleshooting I have never needed over 16k and for the rare occasion when I need to summarise a very large document I can change up the model to something smaller and get a huge context. I have never needed more than 32k though.

ryan_glass commented on Why DeepSeek is cheap at scale but expensive to run locally seangoedecke.com/inferenc... · Posted by u/ingve

danielhanchen · 8 months ago

Thanks for using our quants and appreciate it :) - We're still doing internal benchmarks since they're very slow to do - but they definitely pass our internal benchmarks :)

ryan_glass · 8 months ago

Thank you for making the dynamic quantisations! My setup wouldn't be possible without them and for my personal use, they do exactly what I need and are indeed excellent.

ryan_glass commented on Why DeepSeek is cheap at scale but expensive to run locally seangoedecke.com/inferenc... · Posted by u/ingve

3eb7988a1663 · 8 months ago

Do you have hard numbers on the idle/average/max power draw? I assumed that server machines are built as if they are going to red-lined constantly so put less effort into low-utilization optimizations.

ryan_glass · 8 months ago

No hard numbers I'm afraid in that I don't monitor the power draw. But the machine uses a standard ATX power supply: a Corsair RM750e 750W PSU and the default TDP of the CPU is 280W - I have my TDP set at 300W. It is basically built like a desktop - ATX form factor, fans spin down at idle etc.

ryan_glass commented on Why DeepSeek is cheap at scale but expensive to run locally seangoedecke.com/inferenc... · Posted by u/ingve

refibrillator · 8 months ago

> Unsloth Dynamic GGUF which, quality wise in real-world use performs very close to the original

How close are we talking?

I’m not calling you a liar OP, but in general I wish people perpetuating such broad claims would be more rigorous.

Unsloth does amazing work, however as far as I’m aware even they themselves do not publish head to head evals with the original unquantized models.

I have sympathy here because very few people and companies can afford to run the original models, let alone engineer rigorous evals.

However I felt compelled to comment because my experience does not match. For relatively simple usage the differences are hard to notice, but they become much more apparent in high complexity and long context tasks.

ryan_glass · 8 months ago

You are right that I haven't been rigorous - it's easy to benchmark tokens/second but quality of output is more difficult to nail down. I couldn't find any decent comparisons for Unsloth either. So I just tried a few of their models out, looking for something that was 'good enough' i.e. does all I need: coding, summarizing documents, troubleshooting anything and everything. I would like to see head to head comparisons too - maybe I will invest in more RAM at some stage but so far I have no need for it. I ran some comparisons between the smaller and larger versions of the Unsloth models and interestingly (for me anyway) didn't notice a huge amount of difference in quality between them. But, the smaller models didn't run significantly faster so I settled for the biggest model I could fit in RAM with a decent context. For more complex coding I use Deepseek R1 (again the Unsloth) but since it's a reasoning model it isn't real-time so no use as my daily driver.

ryan_glass commented on Why DeepSeek is cheap at scale but expensive to run locally seangoedecke.com/inferenc... · Posted by u/ingve

nardi · 8 months ago

Whats your prompt processing speed? That’s more important in this situation than output TPS. If you have to wait minutes to start getting an answer, that makes it much worse than a cloud-hosted version.

ryan_glass · 8 months ago

Prompt eval time varies a lot with context but it feels real-time for short prompts - approx 20 tokens per second but I haven't done much benchmarking of this. When there is a lot of re-prompting in a long back and forth it is still quite fast - I do use KV cache which I assume helps and also quantize the KV cache to Q8 if I am running contexts above 16k. However, if I want it to summarize a document of say 15,000 words it does take a long time - here I walk away and come back in about 20 minutes and it will be complete.

ryan_glass commented on Why DeepSeek is cheap at scale but expensive to run locally seangoedecke.com/inferenc... · Posted by u/ingve

ryan_glass · 8 months ago

I run Deepseek V3 locally as my daily driver and I find it affordable, fast and effective. The article assumes GPU which in my opinion is not the best way to serve large models like this locally. I run a mid-range EPYC 9004 series based home server on a supermicro mobo which cost all-in around $4000. It's a single CPU machine with 384GB RAM (you could get 768GB using 64GB sticks but this costs more). No GPU means power draw is less than a gaming desktop. With the RAM limitation I run an Unsloth Dynamic GGUF which, quality wise in real-world use performs very close to the original. It is around 270GB which leaves plenty of room for context - I run 16k context normally as I use the machine for other things too but can up it to 24k if I need more. I get about 9-10 tokens per second, dropping to 7 tokens/second with a large context. There are plenty of people running similar setups with 2 CPUs who run the full version at similar tokens/second.