As someone who hot on early on the Ryzen AI 395+, are there any added value for the DGX Spark beside having cuda (compared to ROCm/vulkan)? I feel Nvidia fumbled the marketing, either making it sound like an inference miracle, or a dev toolkit (then again not enough to differentiate it from the superior AGX Thor).
I am curious about where you find its main value, and how would it fit within your tooling, and use cases compared to other hardware?
From the inference benchmarks I've seen, a M3 Ultra always come on top.
M3 Ultra has slow GPU and no HW FP4 support so its initial token decoding is going to be slow, practically unusable for 100k+ context sizes. For token generation that is memory bound M3 Ultra would be much faster, but who wants to wait 15 minutes to read the context? Spark will be much faster for initial token processing, giving you a much better time to first token, but then 3x slower (273 vs 800GB/s) in token generation throughput. You need to decide what is more important for you. Strix Halo is IMO the worst of both worlds at the moment due to having the worst specs in both dimensions and the least mature software stack.
It's a webUI that'll let you try a bunch of different, super powerful things, including easily doing image and video generation in lots of different ways.
It was really useful to me when benching stuff at work on various gear. ie L4 vs A40 vs H100 vs 5th gen EPYC cpus, etc.
About what I expected. The Jetson series had the same issues, mostly, at a smaller scale: Deviate from the anointed versions of YOLO, and nothing runs without a lot of hacking. Being beholden to CUDA is both a blessing and a curse, but what I really fear is how long it will take for this to become an unsupported golden brick.
Also, the other reviews I’ve seen point out that inference speed is slower than a 5090 (or on par with a 4090 with some tailwind), so the big difference here (other than core counts) is the large chunk of “unified” memory. Still seems like a tricky investment in an age where a Mac will outlive everything else you care to put on a desk and AMD has semi-viable APUs with equivalent memory architectures (even if RoCm is… well… not all there yet).
Curious to compare this with cloud-based GPU costs, or (if you really want on-prem and fully private) the returns from a more conventional rig.
> Also, the other reviews I’ve seen point out that inference speed is slower than a 5090 (or on par with a 4090 with some tailwind), so the big difference here (other than core counts) is the large chunk of “unified” memory.
It's not comparable to 4090 inference speed. It's significantly slower, because of the lack of MXFP4 models out there. Even compared to Ryzen AI 395 (ROCm / Vulkan), on gpt-oss-120B mxfp4, somehow DGX manages to lose on token generation (pp is faster though.
> Still seems like a tricky investment in an age where a Mac will outlive everything else you care to put on a desk and AMD has semi-viable APUs with equivalent memory architectures (even if RoCm is… well… not all there yet).
ROCm (v7) for APUs came a long way actually, mostly thanks to the community effort, it's quite competitive and more mature. It's still not totally user friendly, but it doesn't break between updates (I know the bar is low, but that was the status a year ago). So in comparison, the strix halo offers lots of value for your money if you need a cheap compact inference box.
Havn't tested finetuning / training yet, but in theory it's supported, not to forget that APU is extremely performany for "normal" tasks (threadripper level) compared to the CPU of the DGX Spark.
Yeah, good point on the FP4. I'm seeing people complain about INT8 as well, which ought to "just work", but everyone who has one (not many) is wary of wandering off the happy path.
A few years ago I worked on an ARM supercomputer, as well as a POWER9 one. x86 is so assumed for anything other than trivial things that it is painful.
What I found was a good solution was using Spack:
https://spack.io/
That allows you to download/build the full toolchain of stuff you need for whatever architecture you are on - all dependencies, compilers (GCC, CUDA, MPI, etc.), compiled Python packages, etc. and if you need to add a new recipe for something it is really easy.
For the fellow Brits - you can tell this was named by Americans!!!
It's good that you've mentioned Spack but not for HPC work, and that's very interesting.
This a high level overview by one of the Spack authors from the HN post back in 2023 (top comment from 100 comments), including the Spack original paper link [1]:
At a very high level, Spack has:
* Nix's installation model and configuration hashing
* Homebrew-like packages, but in a more expressive Python DSL, and with more versions/options
* A very powerful dependency resolver that doesn't just pick from a set of available configurations -- it configures your build according to possible configurations.
You could think of it like Nix with dependency resolution, but with a nice Python DSL. There is more on the "concretizer" (resolver) and how we've used ASP for it here:
Well to be fair, I'd consider this to be semi-HPC work - obviously it's not multi-node but because of the hardware it's not the same as using it on an ordinary desktop machine either and has many of the challenges of HPC too in getting stuff compiled for it, particularly with it being ARM based. What you learn when you work on this stuff is that you need very specific combinations of packages that your distro just isn't going to be able to do, and Homebrew doesn't give you enough flexibility in that.
Depending on the kind of project and data agreements, it’s sometimes much easier to run computations on premise than in the cloud. Even though the cloud is somewhat more secure.
I for example have some healthcare research projects with personally identifiable data, and in these times it’s simpler for the users to trust my company, than my company and some overseas company and it’s associated government.
For me as an employee in Australia, I could buy this and write it off my tax as a work expense myself. To rent, it would be much more cumbersome, involving the company. That's 45% off (our top marginal tax rate).
Can people please not listen to this terrible advice that gets repeated so oft, especially in Australian IT circles somehow by young naive folks.
You really need to talk to your accountant here.
It's probably under 25% in deduction at double the median wage, little bit over @ triple, and that's *only* if you are using the device entirely for work, as in it sits in an office and nowhere else, if you are using it personally you open yourself up to all sorts of drama if and when the ATO ever decides to audit you for making a $6k AUD claim for a computing device beyond what you normally to use to do your job.
I think the idea is that instead of spending an additional $4000 on external hardware, you can just buy one thing (your main work machine) and call it a day. Also, the Mac Studio isn’t that much cheaper at that price point.
Now that you bring it up, the M3 ultra Mac Studio goes up to 512GB for about a $10k config with around 850 GB/s bandwidth, for those who "need" a near frontier large model. I think 4x the RAM is not quite worth more than doubling the price, especially if MoE support gets better, but it's interesting that you can get a Deepseek R1 quant running on prosumer hardware.
Is 128 GB of unified memory enough? I've found that the smaller models are great as a toy but useless for anything realistic. Will 128 GB hold any model that you can do actual work with or query for answers that returns useful information?
There are several 70B+ models that are genuinely useful these days.
I'm looking forward to GLM 4.6 Air - I expect that one should be pretty excellent, based on experiments with a quantized version of its predecessor on my Mac. https://simonwillison.net/2025/Jul/29/space-invaders/
128gb unified memory is enough for pretty good models, but honestly for the price of this it is better just go go with a few 3090s or a Mac due to memory bandwidth limitations of this card
the question is: how does the prompt processing time on this compare to M3 Ultra because that one sucks at RAG even though it can technically handle huge models and long contexts...
Prompt processing time on Apple Silicon might benefit from making use of the NPU/Apple Neural Engine. (Note, the NPU is bad if you're limited by memory bandwidth, but prompt processing is compute limited.) Just needs someone to do the work.
I'm running VLLM on it now and it was as simple as:
(That recipe from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?v... )And then in the Docker container:
The default model it loads is Qwen/Qwen3-0.6B, which is tiny and fast to load.I am curious about where you find its main value, and how would it fit within your tooling, and use cases compared to other hardware?
From the inference benchmarks I've seen, a M3 Ultra always come on top.
Installation instructions: https://github.com/comfyanonymous/ComfyUI#nvidia
It's a webUI that'll let you try a bunch of different, super powerful things, including easily doing image and video generation in lots of different ways.
It was really useful to me when benching stuff at work on various gear. ie L4 vs A40 vs H100 vs 5th gen EPYC cpus, etc.
Also, the other reviews I’ve seen point out that inference speed is slower than a 5090 (or on par with a 4090 with some tailwind), so the big difference here (other than core counts) is the large chunk of “unified” memory. Still seems like a tricky investment in an age where a Mac will outlive everything else you care to put on a desk and AMD has semi-viable APUs with equivalent memory architectures (even if RoCm is… well… not all there yet).
Curious to compare this with cloud-based GPU costs, or (if you really want on-prem and fully private) the returns from a more conventional rig.
It's not comparable to 4090 inference speed. It's significantly slower, because of the lack of MXFP4 models out there. Even compared to Ryzen AI 395 (ROCm / Vulkan), on gpt-oss-120B mxfp4, somehow DGX manages to lose on token generation (pp is faster though.
> Still seems like a tricky investment in an age where a Mac will outlive everything else you care to put on a desk and AMD has semi-viable APUs with equivalent memory architectures (even if RoCm is… well… not all there yet).
ROCm (v7) for APUs came a long way actually, mostly thanks to the community effort, it's quite competitive and more mature. It's still not totally user friendly, but it doesn't break between updates (I know the bar is low, but that was the status a year ago). So in comparison, the strix halo offers lots of value for your money if you need a cheap compact inference box.
Havn't tested finetuning / training yet, but in theory it's supported, not to forget that APU is extremely performany for "normal" tasks (threadripper level) compared to the CPU of the DGX Spark.
I have no immediate numbers for prefill, but the memory bandwidth is ~4x greater on a 4090 which will lead to ~4x faster decode.
What I found was a good solution was using Spack: https://spack.io/ That allows you to download/build the full toolchain of stuff you need for whatever architecture you are on - all dependencies, compilers (GCC, CUDA, MPI, etc.), compiled Python packages, etc. and if you need to add a new recipe for something it is really easy.
For the fellow Brits - you can tell this was named by Americans!!!
This a high level overview by one of the Spack authors from the HN post back in 2023 (top comment from 100 comments), including the Spack original paper link [1]:
At a very high level, Spack has:
* Nix's installation model and configuration hashing
* Homebrew-like packages, but in a more expressive Python DSL, and with more versions/options
* A very powerful dependency resolver that doesn't just pick from a set of available configurations -- it configures your build according to possible configurations.
You could think of it like Nix with dependency resolution, but with a nice Python DSL. There is more on the "concretizer" (resolver) and how we've used ASP for it here:
* "Using Answer Set Programming for HPC Dependency Solving", https://arxiv.org/abs/2210.08404
[1] Spack – scientific software package manager for supercomputers, Linux, and macOS (100 comments):
https://news.ycombinator.com/item?id=35237269
Dead Comment
I for example have some healthcare research projects with personally identifiable data, and in these times it’s simpler for the users to trust my company, than my company and some overseas company and it’s associated government.
Can people please not listen to this terrible advice that gets repeated so oft, especially in Australian IT circles somehow by young naive folks.
You really need to talk to your accountant here.
It's probably under 25% in deduction at double the median wage, little bit over @ triple, and that's *only* if you are using the device entirely for work, as in it sits in an office and nowhere else, if you are using it personally you open yourself up to all sorts of drama if and when the ATO ever decides to audit you for making a $6k AUD claim for a computing device beyond what you normally to use to do your job.
For inference decode the bandwidth is the main limitation so if running LLMs is your use case you should probably get a Mac instead.
Now that you bring it up, the M3 ultra Mac Studio goes up to 512GB for about a $10k config with around 850 GB/s bandwidth, for those who "need" a near frontier large model. I think 4x the RAM is not quite worth more than doubling the price, especially if MoE support gets better, but it's interesting that you can get a Deepseek R1 quant running on prosumer hardware.
Running some other distro on this device is likely to require quite some effort.
https://github.com/apple/containerization/
P.S. exploded view from the horse's mouth: https://www.nvidia.com/pt-br/products/workstations/dgx-spark...
I'm looking forward to GLM 4.6 Air - I expect that one should be pretty excellent, based on experiments with a quantized version of its predecessor on my Mac. https://simonwillison.net/2025/Jul/29/space-invaders/
The 120B model is better but too slow since I only have 16GB VRAM. That model runs decent[1] on the Spark.
[1]: https://news.ycombinator.com/item?id=45576737