Thank you for the alternative to X. I feel somehow complicit in supporting someone likely aligned with something unspeakable any time I click an X link.
I don't know if it's just me but I can't seem to view the nitter link, says the tweet isn't found, I can't use the OP Twitter link because I can't see anything without logging in.
I ran across a thread once indicating that the guy who hosts that instance has a bit of a fight on his hands keeping it usable. Frequently fights hoards of bots, that sort of thing. Your signal probably got mistaken for some noise.
We need Nitter and BitTorrent to have a baby, p2p so we can share the load.
As this is HN, I'm curious if there is anyone here on HN who is interested in starting a business hosting these large open source LLM's?
I just finished a test of running this Deepseek-R1 768 GB model locally on a cluster of computers with 800 GB/s memory bandwidth (faster than the machine in the twitter post) and I can now extrapolate to a cluster with 6000 GB/s aggregate memory bandwidth and I'm sure we can reach higher speeds than Groq and Cerebras [1] on these large models.
We might even be cheap enough in OPEX to retrain these models.
Would anyone with cofounder or commercial skills be willing to set up this hosting service with me, it will take less than $30K investment but could be profitable in weeks?
The Azure(tm) and AWS version of rent-a-second are in the works as we speak. So yes, rent-a-brain/vegetable and no, I will bet you $40k you will not beat either AWS ot Microsoft to the punch. Zero chance of that. They will have their excess computational power with extremely discounted electric rates in place before Friday morning.
I wonder if the real market is actually bringing this stuff inhouse.
Given the propensity for these big tech companies to hoover up/steal any information they can gather, running these models locally, with local fine tuning looks quite attractive.
I think the important metric will be if we can compete against the price of AWS or Microsoft in running large LLMs, not their time to market. Competing on cost against overpriced hyperscalers is not very hard, and $30K is a small investment, not a gamble. If it would fail, worst case you would only loose $3000-$6500 or so.
6 to 8 t/s is decent for a GPU-less build, but speaking from experience... it will feel slow. Usable - but slow.
Especially because these models <think> for a long bit before actually answering, so they generate for a longer period. (to be clear - I find the <think> section useful, but it also means waiting for more tokens)
Personally - I end up moving down to lower quality models (quants/less params) until I hit about 15 tokens/second. For chat, that seems to be the magical spot where I stop caring and it's "fast enough" to keep me engaged.
For inline code helpers (ex - copilot) you really need to be up near 30 tokens/second to make it feel fast enough to be helpful.
Is a token a character or a word? I don't think I can read 30 words/second so 8/second would be fine. But if it's characters then yes, it would be slow.
It comes out to around 1-3 words/second. This is not so slow that it's maddening (ex 2 token/second is frustratingly slow, like walk away and make coffee while it's answering slow), but it's still slow enough to make it hard to functionally use and not break flow state. You get bored and distracted reading at that pace.
This is the same I get from a Core i5-9400 for smaller models. Is there no prosumer board that can take that much RAM? There must be a ThreadRipper that can do it, right? Why did he need EPYC?
RAM channels is key here. That board has 24 channels of DDR5 RAM. A lot of lowend boards only have 2 channels. Most come in at 4. Some high end boards have 8, so 24 channels is quite wide.
It's going to take some time, but the farce is gone. We'll have parity to Chat GPT on consumer hardware soon enough. 6k is still too much. I suspect the community will be able to get this down to 2K.
My theory is that in the future we will have much more “friends circle cloud” type ops, where that hardware cost is spread out among a small community and access is private. What it won’t look like is every Tom, Dick, and Harry running their own $10k hardware to have a chuckle at the naughty jokes and grade-C+ programmer IDE assistance offered by open-source LLMs.
Orders of magnitude make the difference. “Consumer-level” will be once it’s around $300-700ish in a nice little box like an Intel NUC or similar. Ubiquiti just did this with their AI-Key to help classify stuff on their video camera surveillance platform.
Since it would be a 2DPC configuration, the memory would be limited to 4400MT/sec unless you overclock it. That would give 422.4GB/sec, which should be enough to run the full model at 11 tokens per second according to a simple napkin math calculation. In practice, it might not run that fast. If the memory is overclocked, getting to 16 tokens per second might be possible (according to napkin math).
The subtotal for the linked parts alone is $5,139.98. It should stay below $6000 even after adding the other things needed, although perhaps it would be more after tax.
Note that I have not actually built this to know how it works in practice. My description here is purely hypothetical.
I am not sure if that would work well since 8 cores is really small, it can't really scale well wrt to attention head algorithm, and more importantly this particular EPYC part can only achieve ~50% of the theoretical memory bandwidth, ~240 G/s. Other EPYC parts are running close to ~70%, at ~400 G/s, which OP is using.
I think the point of the two socket solution is the doubled memory bandwith. You propose using just a single one of the same CPU, or am I missing something?
llama.cpp’s token generation speed does not scale with multiple CPU sockets just like it does not scale with multiple GPUs. Matthew Carrigan wrote:
> Also, an important tip: Go into the BIOS and set the number of NUMA groups to 0. This will ensure that every layer of the model is interleaved across all RAM chips, doubling our throughput. Don't forget!
This does not actually make sense. It is well known that there is a penalty for accessing memory attached to a different CPU. You don’t get more bandwidth from disabling the NUMA node information and his token generation performance reflects that. If there was a doubling effect from using two CPU sockets, he should be getting twice the performance, but he is not.
Additionally, llama.cpp’s NUMA support is suboptimal, so he is likely taking a performance hit:
When llama.cpp fixes its NUMA support, using two sockets should be no worse than using one socket, but it will not become better unless some new way of doing the calculations is devised that benefits from NUMA. This might be possible (particularly if you can get GEMV to run faster using NUMA), but it is not how things are implemented right now.
Would adding a single GPU help with prompt processing here? When I run a llama.cpp GPU build with no layers offloaded prompt processing is still way faster as all the matrix multiplies are done on the accelerator. The actual memory bound inference continues to run on the CPU.
Since tensor cores are so fast you still come out ahead when you send the weights over PCIE to the GPU and return the completed products back to main memory.
Maybe. It likely depends on whether GEMM can operate at/near full speed with streaming weights over PCI-E. I am not sure how to stream the weights over PCI-E for use by CUDA/PTX code offhand. It would be a R&D project.
Complete hardware + software setup for running Deepseek-R1 locally. The actual model, no distillations, and Q8 quantization for full quality. Total cost, $6,000. All download and part links below:
Motherboard: Gigabyte MZ73-LM0 or MZ73-LM1. We want 2 EPYC sockets to get a massive 24 channels of DDR5 RAM to max out that memory size and bandwidth. https://t.co/GCYsoYaKvZ
CPU: 2x any AMD EPYC 9004 or 9005 CPU. LLM generation is bottlenecked by memory bandwidth, so you don't need a top-end one. Get the 9115 or even the 9015 if you really want to cut costs https://t.co/TkbfSFBioq
RAM: This is the big one. We are going to need 768GB (to fit the model) across 24 RAM channels (to get the bandwidth to run it fast enough). That means 24 x 32GB DDR5-RDIMM modules. Example kits: https://t.co/pJDnjxnfjghttps://t.co/ULXQen6TEc
Case: You can fit this in a standard tower case, but make sure it has screw mounts for a full server motherboard, which most consumer cases won't. The Enthoo Pro 2 Server will take this motherboard: https://t.co/m1KoTor49h
PSU: The power use of this system is surprisingly low! (<400W) However, you will need lots of CPU power cables for 2 EPYC CPUs. The Corsair HX1000i has enough, but you might be able to find a cheaper option: https://t.co/y6ug3LKd2k
Heatsink: This is a tricky bit. AMD EPYC is socket SP5, and most heatsinks for SP5 assume you have a 2U/4U server blade, which we don't for this build. You probably have to go to Ebay/Aliexpress for this. I can vouch for this one: https://t.co/51cUykOuWG
And if you find the fans that come with that heatsink noisy, replacing with 1 or 2 of these per heatsink instead will be efficient and whisper-quiet: https://t.co/CaEwtoxRZj
And finally, the SSD: Any 1TB or larger SSD that can fit R1 is fine. I recommend NVMe, just because you'll have to copy 700GB into RAM when you start the model, lol. No link here, if you got this far I assume you can find one yourself!
And that's your system! Put it all together and throw Linux on it. Also, an important tip: Go into the BIOS and set the number of NUMA groups to 0. This will ensure that every layer of the model is interleaved across all RAM chips, doubling our throughput. Don't forget!
Next, the model. Time to download 700 gigabytes of weights from @huggingface! Grab every file in the Q8_0 folder here: https://t.co/9ni1Miw73O
Believe it or not, you're almost done. There are more elegant ways to set it up, but for a quick demo, just do this. llama-cli -m ./DeepSeek-R1.Q8_0-00001-of-00015.gguf --temp 0.6 -no-cnv -c 16384 -p "<|User|>How many Rs are there in strawberry?<|Assistant|>"
If all goes well, you should witness a short load period followed by the stream of consciousness as a state-of-the-art local LLM begins to ponder your question:
And once it passes that test, just use llama-server to host the model and pass requests in from your other software. You now have frontier-level intelligence hosted entirely on your local machine, all open-source and free to use!
And if you got this far: Yes, there's no GPU in this build! If you want to host on GPU for faster generation speed, you can! You'll just lose a lot of quality from quantization, or if you want Q8 you'll need >700GB of GPU memory, which will probably cost $100k+
I'd assume that the existing llama.cpp ability to split layers out to the GPU still applies, so you could have some fraction in VRAM and speed up those layers.
The memory bandwidth might be an issue, and it would be a pretty small percentage of the model, but I'd guess the speedup would be apparent.
Maybe not worth the few thousand for the card + more power/cooling/space, of course.
And since X sucks today just about as much as it sucked yesterday: https://nitter.poast.org/carrigmat/status/188424436990727810...
https://www.dailydot.com/debug/poast-hack-leaked-emails-dms/
Dead Comment
Deleted Comment
We need Nitter and BitTorrent to have a baby, p2p so we can share the load.
Dead Comment
Dead Comment
I just finished a test of running this Deepseek-R1 768 GB model locally on a cluster of computers with 800 GB/s memory bandwidth (faster than the machine in the twitter post) and I can now extrapolate to a cluster with 6000 GB/s aggregate memory bandwidth and I'm sure we can reach higher speeds than Groq and Cerebras [1] on these large models.
We might even be cheap enough in OPEX to retrain these models.
Would anyone with cofounder or commercial skills be willing to set up this hosting service with me, it will take less than $30K investment but could be profitable in weeks?
[1] https://hc2024.hotchips.org/assets/program/conference/day2/7...
Given the propensity for these big tech companies to hoover up/steal any information they can gather, running these models locally, with local fine tuning looks quite attractive.
https://azure.microsoft.com/en-us/blog/deepseek-r1-is-now-av...
In my house I currently have almost 900 GB/S memory bandwidth in aggregate but only 132 GB total DRAM.
Especially because these models <think> for a long bit before actually answering, so they generate for a longer period. (to be clear - I find the <think> section useful, but it also means waiting for more tokens)
Personally - I end up moving down to lower quality models (quants/less params) until I hit about 15 tokens/second. For chat, that seems to be the magical spot where I stop caring and it's "fast enough" to keep me engaged.
For inline code helpers (ex - copilot) you really need to be up near 30 tokens/second to make it feel fast enough to be helpful.
For DeekSeek 1 token ~= 3 English characters.
See: https://api-docs.deepseek.com/quick_start/token_usage/
It comes out to around 1-3 words/second. This is not so slow that it's maddening (ex 2 token/second is frustratingly slow, like walk away and make coffee while it's answering slow), but it's still slow enough to make it hard to functionally use and not break flow state. You get bored and distracted reading at that pace.
I'll ask it to do a task, then it will do it in a few steps and notify me when it's done. I use that time to take care of something else.
[0] direct link with login https://x.com/carrigmat/status/1884244369907278106
[1] alt link without login but with ads https://threadreaderapp.com/thread/1884244369907278106.html
Edit: someone posted xcancel link above - no ads https://xcancel.com/carrigmat/status/1884244369907278106
It's going to take some time, but the farce is gone. We'll have parity to Chat GPT on consumer hardware soon enough. 6k is still too much. I suspect the community will be able to get this down to 2K.
I'm tempted to cancel my Chat GPT subscription!
Not every single person needs to have it. But if someone in your circle has the needed hardware...
https://www.newegg.com/p/N82E16819113866
https://www.newegg.com/supermicro-h13ssl-nt-amd-epyc-9004-se...
As for memory, these two kits should work (both are needed for the full 12 DIMMs):
https://www.newegg.com/owc-256gb/p/1X5-005D-001G0
https://www.newegg.com/owc-512gb/p/1X5-005D-001G4
Since it would be a 2DPC configuration, the memory would be limited to 4400MT/sec unless you overclock it. That would give 422.4GB/sec, which should be enough to run the full model at 11 tokens per second according to a simple napkin math calculation. In practice, it might not run that fast. If the memory is overclocked, getting to 16 tokens per second might be possible (according to napkin math).
The subtotal for the linked parts alone is $5,139.98. It should stay below $6000 even after adding the other things needed, although perhaps it would be more after tax.
Note that I have not actually built this to know how it works in practice. My description here is purely hypothetical.
> Also, an important tip: Go into the BIOS and set the number of NUMA groups to 0. This will ensure that every layer of the model is interleaved across all RAM chips, doubling our throughput. Don't forget!
This does not actually make sense. It is well known that there is a penalty for accessing memory attached to a different CPU. You don’t get more bandwidth from disabling the NUMA node information and his token generation performance reflects that. If there was a doubling effect from using two CPU sockets, he should be getting twice the performance, but he is not.
Additionally, llama.cpp’s NUMA support is suboptimal, so he is likely taking a performance hit:
https://github.com/ggerganov/llama.cpp/issues/11333
When llama.cpp fixes its NUMA support, using two sockets should be no worse than using one socket, but it will not become better unless some new way of doing the calculations is devised that benefits from NUMA. This might be possible (particularly if you can get GEMV to run faster using NUMA), but it is not how things are implemented right now.
Since tensor cores are so fast you still come out ahead when you send the weights over PCIE to the GPU and return the completed products back to main memory.
Complete hardware + software setup for running Deepseek-R1 locally. The actual model, no distillations, and Q8 quantization for full quality. Total cost, $6,000. All download and part links below:
Motherboard: Gigabyte MZ73-LM0 or MZ73-LM1. We want 2 EPYC sockets to get a massive 24 channels of DDR5 RAM to max out that memory size and bandwidth. https://t.co/GCYsoYaKvZ
CPU: 2x any AMD EPYC 9004 or 9005 CPU. LLM generation is bottlenecked by memory bandwidth, so you don't need a top-end one. Get the 9115 or even the 9015 if you really want to cut costs https://t.co/TkbfSFBioq
RAM: This is the big one. We are going to need 768GB (to fit the model) across 24 RAM channels (to get the bandwidth to run it fast enough). That means 24 x 32GB DDR5-RDIMM modules. Example kits: https://t.co/pJDnjxnfjg https://t.co/ULXQen6TEc
Case: You can fit this in a standard tower case, but make sure it has screw mounts for a full server motherboard, which most consumer cases won't. The Enthoo Pro 2 Server will take this motherboard: https://t.co/m1KoTor49h
PSU: The power use of this system is surprisingly low! (<400W) However, you will need lots of CPU power cables for 2 EPYC CPUs. The Corsair HX1000i has enough, but you might be able to find a cheaper option: https://t.co/y6ug3LKd2k
Heatsink: This is a tricky bit. AMD EPYC is socket SP5, and most heatsinks for SP5 assume you have a 2U/4U server blade, which we don't for this build. You probably have to go to Ebay/Aliexpress for this. I can vouch for this one: https://t.co/51cUykOuWG
And if you find the fans that come with that heatsink noisy, replacing with 1 or 2 of these per heatsink instead will be efficient and whisper-quiet: https://t.co/CaEwtoxRZj
And finally, the SSD: Any 1TB or larger SSD that can fit R1 is fine. I recommend NVMe, just because you'll have to copy 700GB into RAM when you start the model, lol. No link here, if you got this far I assume you can find one yourself!
And that's your system! Put it all together and throw Linux on it. Also, an important tip: Go into the BIOS and set the number of NUMA groups to 0. This will ensure that every layer of the model is interleaved across all RAM chips, doubling our throughput. Don't forget!
Now, software. Follow the instructions here to install llama.cpp https://t.co/jIkQksXZzu
Next, the model. Time to download 700 gigabytes of weights from @huggingface! Grab every file in the Q8_0 folder here: https://t.co/9ni1Miw73O
Believe it or not, you're almost done. There are more elegant ways to set it up, but for a quick demo, just do this. llama-cli -m ./DeepSeek-R1.Q8_0-00001-of-00015.gguf --temp 0.6 -no-cnv -c 16384 -p "<|User|>How many Rs are there in strawberry?<|Assistant|>"
If all goes well, you should witness a short load period followed by the stream of consciousness as a state-of-the-art local LLM begins to ponder your question:
And once it passes that test, just use llama-server to host the model and pass requests in from your other software. You now have frontier-level intelligence hosted entirely on your local machine, all open-source and free to use!
And if you got this far: Yes, there's no GPU in this build! If you want to host on GPU for faster generation speed, you can! You'll just lose a lot of quality from quantization, or if you want Q8 you'll need >700GB of GPU memory, which will probably cost $100k+
The memory bandwidth might be an issue, and it would be a pretty small percentage of the model, but I'd guess the speedup would be apparent.
Maybe not worth the few thousand for the card + more power/cooling/space, of course.
Because in 2x CPU system, the model may have to be passed via NUMA, which has 10% - 30% of memory bandwidth bandwidth
https://news.ycombinator.com/item?id=42868360