Author here. I've updated the article based on your feedback. Thank you.
Key corrections:
Ollama GPU usage - I was wrong. It IS using GPU (verified 96% utilization). My "CPU-optimized backend" claim was incorrect.
FP16 vs BF16 - enum caught the critical gap: I trained with BF16, tested inference with FP16 (broken), but never tested BF16 inference. "GPU inference fundamentally broken" was overclaimed. Should be "FP16 has issues, BF16 untested (likely works)."
llama.cpp - veber-alex's official benchmark link proves it works. My issues were likely version-specific, not representative.
ARM64+CUDA maturity - bradfa was right about Jetson history. ARM64+CUDA is mature. The new combination is Blackwell+ARM64, not ARM64+CUDA itself.
The HN community caught my incomplete testing, overclaimed conclusions, and factual errors.
Is there a reason why you used an LLM for the entire article, and moreover, even for this comment? Couldn't you have at least written this comment yourself?
To be charitable, I'm assuming that their English skills aren't good. If LLMs allow us to hear from potentially billions of people who may have something worthwhile to say but who fall into that category, I wouldn't want to discourage their use in articles like this one.
But if that's not the case, then yeah, it's a crappy practice and I'd hate to see it spread any further than it already has.
Asking because (in Ollama terms) it's positively ancient. 0.12.6 being the most recent release (currently).
I'm guessing it _might_ make a difference, as the Ollama crowd do seem to be changing things, adding new features and optimisations (etc) quite often.
For example, that 0.12.6 version is where initial experimental support for Vulkan (ie Intel Xe gpus) was added, and in my testing that worked. Not that Vulkan support would do anything in your case. ;)
Late to the party here, but you should definitely be using pytorch 25.09 (or whatever is latest when you go to check) rather than 24.10. That's a year old pytorch on new hardware, I suspect a lot of these bugs have been fixed.
One of my colleagues wrote a first impressions blog post last week. It's from our company's perspective, but is a solid overview of the product and intended capabilities, from the POV of an AI developer or data scientist.
> There you’ll see the 10 Cortex-X925 (“performance”) cores listed with a peak clock rate of 4 GHz, along with the 10 Cortex-A725 (“efficiency”) cores listed with a peak clock rate of 2.8 GHz
> If you start Python and ask it how many CPU cores you have, it will count both kinds of cores and report 20
> Note that because of the speed difference between the cores, you will want to ensure there is some form of dynamic scheduling in your application that can load balance between the different core types.
Sounds like a new type of hell where I now not only need to manage the threads themselves, but also take into account what type of core they run on, and Python straight up report them as the same.
< The CPU memory is the same as the GPU memory and is much larger than any other discrete GPU available in a desktop. That means much larger datasets and bigger models can be run locally than would be possible otherwise.
Isin't this the same architecture that the Mx from Apple implements from a memory perspective?
I absolutely love it. I’ve been up for days playing with it. But there are some bleeding edge issues. I tried to write a balanced article. I would highly recommend for people that love to get their hands dirty. Blows away any consumer GPU.
I have H100s to myself, and access to more GPUs than I know what to do with in national clusters.
The Spark is much more fun. And I’m more productive. With two of them, you can debug shallow NCCL/MPI problems before hitting a real cluster. I sincerely love Slurm, but nothing like a personal computer.
Your complaint sounds more like the way that you have to access the HPC (via slurm), not the compute itself. After having now tried slurm myself, I don't understand the love for it at all.
As for debugging, that's where you should be allowed to spin up a small testing cluster on-demand. Why can't you do that with your slurm access?
Nah. Do you have 1st hand experience with Strix Halo? At less than 1600€ for a 128GB configuration it manages >45 tokens/s with gpt-oss 120b. Which is faster than DGX Spark at a fraction of the cost.
One thing I can’t find anyone mention in reviews - does inference screech to a halt when using large context windows on models? Say if you’re in the 100k range on gpt-oss. I’m not concerned about lightning inference speed overall as I understand the purpose of the spark is to be well rounded / trainer tuner. I just want to know if it becomes unusable vs reasonable slowdown at larger contexts. That’s the thing people are unpleasantly surprised to find about a Mac Studio which has prevented me from going that route.
But please have your LLM post writer be less verbose and repetitive. This is like the stock output from any LLM, where it describes in detail and then summarizes back and forth over multiple useless sections. Please consider a smarter prompt and post-editing…
Since the text is obviously LLM output, how much prompting and editing went into this post? Did you have to correct anything that you put into it that it then got wrong or added incorrect output to?
Definitely reeks of someone who doesn't know what makes a readable blogpost and hoped the LLM did.
I was not familiar with the hardware, so I was disappointed there wasn't a picture of the device. Tried to skim the article and it's a mess. Inconsistent formatting and emoji without a single graph to visualize benchmarks.
There are bleeding edge issues, everyone dials into transformers so that's generally pain proof.
I haven't exactly bisected the issue but I'm pretty sure convolutions are broken on sm_121 after a certain size, getting 20x memory blowup from a convolution from a 2x batch size increase _only_ on the DGX Spark.
I haven't had any problems with inference, but I also don't use the transformers library that much.
llama.cpp was working for openai-oss last time I checked and on release, not sure if something broke along the way.
I don't exactly know if memory fragmentation is something fixable on the driver side - this might just be the problem with kernel's policy and GPL, it prevents them from automatically interfering with the memory subsystem to the granularity they'd like - see zfs and their page table antics - or so my thoughts on it is.
If you've done stuff on WSL, you have similar issues and you can fix it by running a service that normally compacts and clean memory, I have it run every hour. Note that this does impact at the very least CPU performance and memory allocation speeds, but I have not have any issue with long training runs with it (24hr+, assuming that is the issue, I have never tried without it and put that service in place since getting it due to my experience on WSL).
I'm not yet using mine for ML stuff because there are still a lot of various issues like this post outlined. But I am using mine as an ARM dev system in the meantime, and as a "workstation" it's actually quite good. The Cortex-X925 cores are Zen5 class in performance and it is overall an absolute unit for its size, I'm very impressed that a standard ARM core is pushing this level of performance for a desktop-class machine. I thought about buying a new Linux desktop recently, and this is good enough I might just plug it into a monitor and use it instead.
It is also a standard UEFI+ACPI system; one Reddit user even reported that they were able to boot up Fedora 42 and install the open kernel modules no problem. The overall delta/number of specific patches for the Canonical 6.17-nvidia tree is pretty small when I looked (the current kernel is 6.11). That and the likelihood the consumer variant will support Windows hopefully bodes well for its upstream Linux compatibility, I hope.
To be fair, most of this also true of Strix Halo from what I can tell (most benchmarks put the DGX furthest ahead at prompt processing and a bit ahead at raw token output. But the software is still buggy and Blackwell is still a bumpy ride overall, so it might get better). But I think it's mostly the pricing that is holding it back. I'm curious what the consumer variant will be priced at.
There wasn't any instructions how the author got ollama/llama.cpp, could possibly be something nvidia shipped with the DGX Spark and is an old version?
Theoretically it has slightly better memory bandwidth, (you are supposed to get) the Nvidia AI software ecosystem support out of the box, and you can use the 200G NIC to stick 2 together more efficiently.
Practically, if the goal is 100% about AI and cloud isn't an option for some reason, both options are likely "a great way to waste a couple grand trying to save a couple grand" as you'd get 7x the performance and likely still feel it's a bit slow on larger models using an RTX Pro 6000. I say this as a Ryzen AI Max+ 395 owner, though I got mine because it's the closest thing to an x86 Apple Silicon laptop one can get at the moment.
Because the ML ecosystem is more mature on the NVidia side. Software-wise the cuda platform is more advanced. It will be hard for AMD to catch up. It is good to see competition tho.
But the article shows that the Nvidia ecosystem isn't that mature either on the DGX Spark with ARM64. I wonder if Nvidia is still ahead for such use cases, all things considered.
The complete Framework Desktop with everything working (including said Ryzen AI Max 395+ and 128 GB of RAM) is 2500 EUR. In Europe the DGX Spark listings are at 4000+ EUR.
The vast majority of Ryzen AI Max+ 395s (by volume at least) are sold as complete system offerings as well. About as far as you can go the other way is getting one without an SSD, as the MB+RAM+CPU are an "all or nothing" bundle anyways.
Key corrections:
Ollama GPU usage - I was wrong. It IS using GPU (verified 96% utilization). My "CPU-optimized backend" claim was incorrect.
FP16 vs BF16 - enum caught the critical gap: I trained with BF16, tested inference with FP16 (broken), but never tested BF16 inference. "GPU inference fundamentally broken" was overclaimed. Should be "FP16 has issues, BF16 untested (likely works)."
llama.cpp - veber-alex's official benchmark link proves it works. My issues were likely version-specific, not representative.
ARM64+CUDA maturity - bradfa was right about Jetson history. ARM64+CUDA is mature. The new combination is Blackwell+ARM64, not ARM64+CUDA itself.
The HN community caught my incomplete testing, overclaimed conclusions, and factual errors.
Ship early, iterate publicly, accept criticism gracefully.
Thanks especially to enum, veber-alex, bradfa, furyofantares, stuckinhell, jasonjmcghee, eadwu, and renaudr. The article is significantly better now.
But if that's not the case, then yeah, it's a crappy practice and I'd hate to see it spread any further than it already has.
Is that version correct?
Asking because (in Ollama terms) it's positively ancient. 0.12.6 being the most recent release (currently).
I'm guessing it _might_ make a difference, as the Ollama crowd do seem to be changing things, adding new features and optimisations (etc) quite often.
For example, that 0.12.6 version is where initial experimental support for Vulkan (ie Intel Xe gpus) was added, and in my testing that worked. Not that Vulkan support would do anything in your case. ;)
https://www.anaconda.com/blog/python-nvidia-dgx-spark-first-...
> If you start Python and ask it how many CPU cores you have, it will count both kinds of cores and report 20
> Note that because of the speed difference between the cores, you will want to ensure there is some form of dynamic scheduling in your application that can load balance between the different core types.
Sounds like a new type of hell where I now not only need to manage the threads themselves, but also take into account what type of core they run on, and Python straight up report them as the same.
Isin't this the same architecture that the Mx from Apple implements from a memory perspective?
I have H100s to myself, and access to more GPUs than I know what to do with in national clusters.
The Spark is much more fun. And I’m more productive. With two of them, you can debug shallow NCCL/MPI problems before hitting a real cluster. I sincerely love Slurm, but nothing like a personal computer.
As for debugging, that's where you should be allowed to spin up a small testing cluster on-demand. Why can't you do that with your slurm access?
Nah. Do you have 1st hand experience with Strix Halo? At less than 1600€ for a 128GB configuration it manages >45 tokens/s with gpt-oss 120b. Which is faster than DGX Spark at a fraction of the cost.
But please have your LLM post writer be less verbose and repetitive. This is like the stock output from any LLM, where it describes in detail and then summarizes back and forth over multiple useless sections. Please consider a smarter prompt and post-editing…
I was not familiar with the hardware, so I was disappointed there wasn't a picture of the device. Tried to skim the article and it's a mess. Inconsistent formatting and emoji without a single graph to visualize benchmarks.
ARM64 Architecture: Not x86_64 (limited ML ecosystem maturity) No PyTorch wheels for ARM64+CUDA (must use Docker) Most ML tools optimized for x86
No evidence for any of this whatsoever. The author just asked Claude/claude code to write their article and it just plain hallucinated some rubbish.
Like in Upstream Color: https://www.youtube.com/watch?v=zfDyEr8Ykcg
I haven't exactly bisected the issue but I'm pretty sure convolutions are broken on sm_121 after a certain size, getting 20x memory blowup from a convolution from a 2x batch size increase _only_ on the DGX Spark.
I haven't had any problems with inference, but I also don't use the transformers library that much.
llama.cpp was working for openai-oss last time I checked and on release, not sure if something broke along the way.
I don't exactly know if memory fragmentation is something fixable on the driver side - this might just be the problem with kernel's policy and GPL, it prevents them from automatically interfering with the memory subsystem to the granularity they'd like - see zfs and their page table antics - or so my thoughts on it is.
If you've done stuff on WSL, you have similar issues and you can fix it by running a service that normally compacts and clean memory, I have it run every hour. Note that this does impact at the very least CPU performance and memory allocation speeds, but I have not have any issue with long training runs with it (24hr+, assuming that is the issue, I have never tried without it and put that service in place since getting it due to my experience on WSL).
It is also a standard UEFI+ACPI system; one Reddit user even reported that they were able to boot up Fedora 42 and install the open kernel modules no problem. The overall delta/number of specific patches for the Canonical 6.17-nvidia tree is pretty small when I looked (the current kernel is 6.11). That and the likelihood the consumer variant will support Windows hopefully bodes well for its upstream Linux compatibility, I hope.
To be fair, most of this also true of Strix Halo from what I can tell (most benchmarks put the DGX furthest ahead at prompt processing and a bit ahead at raw token output. But the software is still buggy and Blackwell is still a bumpy ride overall, so it might get better). But I think it's mostly the pricing that is holding it back. I'm curious what the consumer variant will be priced at.
There are official benchmarks of the Spark running multiple models just fine on llama.cpp
https://github.com/ggml-org/llama.cpp/discussions/16578
Practically, if the goal is 100% about AI and cloud isn't an option for some reason, both options are likely "a great way to waste a couple grand trying to save a couple grand" as you'd get 7x the performance and likely still feel it's a bit slow on larger models using an RTX Pro 6000. I say this as a Ryzen AI Max+ 395 owner, though I got mine because it's the closest thing to an x86 Apple Silicon laptop one can get at the moment.
SHOULDA