Nvidia DGX Spark: When benchmark numbers meet production reality

Author here. I've updated the article based on your feedback. Thank you.

Key corrections:

Ollama GPU usage - I was wrong. It IS using GPU (verified 96% utilization). My "CPU-optimized backend" claim was incorrect.

FP16 vs BF16 - enum caught the critical gap: I trained with BF16, tested inference with FP16 (broken), but never tested BF16 inference. "GPU inference fundamentally broken" was overclaimed. Should be "FP16 has issues, BF16 untested (likely works)."

llama.cpp - veber-alex's official benchmark link proves it works. My issues were likely version-specific, not representative.

ARM64+CUDA maturity - bradfa was right about Jetson history. ARM64+CUDA is mature. The new combination is Blackwell+ARM64, not ARM64+CUDA itself.

The HN community caught my incomplete testing, overclaimed conclusions, and factual errors.

Ship early, iterate publicly, accept criticism gracefully.

Thanks especially to enum, veber-alex, bradfa, furyofantares, stuckinhell, jasonjmcghee, eadwu, and renaudr. The article is significantly better now.

Tiberium · 2 months ago

Is there a reason why you used an LLM for the entire article, and moreover, even for this comment? Couldn't you have at least written this comment yourself?

CamperBob2 · 2 months ago

To be charitable, I'm assuming that their English skills aren't good. If LLMs allow us to hear from potentially billions of people who may have something worthwhile to say but who fall into that category, I wouldn't want to discourage their use in articles like this one.

But if that's not the case, then yeah, it's a crappy practice and I'd hate to see it spread any further than it already has.

colechristensen · 2 months ago

This looks like better peer review than most of what gets done for scientific papers.

anticensor · 2 months ago

This is what I would call the value of post-publication review. Pre-publication review is not enough.

justinclift · 2 months ago

> Ollama 0.3.9 for inference

Is that version correct?

Asking because (in Ollama terms) it's positively ancient. 0.12.6 being the most recent release (currently).

I'm guessing it _might_ make a difference, as the Ollama crowd do seem to be changing things, adding new features and optimisations (etc) quite often.

For example, that 0.12.6 version is where initial experimental support for Vulkan (ie Intel Xe gpus) was added, and in my testing that worked. Not that Vulkan support would do anything in your case. ;)

sgillen · a month ago

Late to the party here, but you should definitely be using pytorch 25.09 (or whatever is latest when you go to check) rather than 24.10. That's a year old pytorch on new hardware, I suspect a lot of these bugs have been fixed.

loufe · 2 months ago

Yeah, kudos, OP. It's a very different read before-after.

I absolutely love it. I’ve been up for days playing with it. But there are some bleeding edge issues. I tried to write a balanced article. I would highly recommend for people that love to get their hands dirty. Blows away any consumer GPU.

enum · 2 months ago

I have H100s to myself, and access to more GPUs than I know what to do with in national clusters.

The Spark is much more fun. And I’m more productive. With two of them, you can debug shallow NCCL/MPI problems before hitting a real cluster. I sincerely love Slurm, but nothing like a personal computer.

latchkey · 2 months ago

Your complaint sounds more like the way that you have to access the HPC (via slurm), not the compute itself. After having now tried slurm myself, I don't understand the love for it at all.

As for debugging, that's where you should be allowed to spin up a small testing cluster on-demand. Why can't you do that with your slurm access?

Tepix · 2 months ago

> Blows away any consumer GPU.

Nah. Do you have 1st hand experience with Strix Halo? At less than 1600€ for a 128GB configuration it manages >45 tokens/s with gpt-oss 120b. Which is faster than DGX Spark at a fraction of the cost.

storus · 2 months ago

Strix Halo has awful token prefill speed. Only suitable for very small contexts.

CompoundEyes · 2 months ago

One thing I can’t find anyone mention in reviews - does inference screech to a halt when using large context windows on models? Say if you’re in the 100k range on gpt-oss. I’m not concerned about lightning inference speed overall as I understand the purpose of the spark is to be well rounded / trainer tuner. I just want to know if it becomes unusable vs reasonable slowdown at larger contexts. That’s the thing people are unpleasantly surprised to find about a Mac Studio which has prevented me from going that route.

yunohn · 2 months ago

Thanks for this bleeding edge content!

But please have your LLM post writer be less verbose and repetitive. This is like the stock output from any LLM, where it describes in detail and then summarizes back and forth over multiple useless sections. Please consider a smarter prompt and post-editing…

Tepix · 2 months ago

I agree whole-heartedly. Two thirds of the article read like slop.

furyofantares · 2 months ago

Since the text is obviously LLM output, how much prompting and editing went into this post? Did you have to correct anything that you put into it that it then got wrong or added incorrect output to?

NathanielK · 2 months ago

Definitely reeks of someone who doesn't know what makes a readable blogpost and hoped the LLM did.

I was not familiar with the hardware, so I was disappointed there wasn't a picture of the device. Tried to skim the article and it's a mess. Inconsistent formatting and emoji without a single graph to visualize benchmarks.

RyeCatcher · 2 months ago

eitally · 2 months ago

One of my colleagues wrote a first impressions blog post last week. It's from our company's perspective, but is a solid overview of the product and intended capabilities, from the POV of an AI developer or data scientist.

https://www.anaconda.com/blog/python-nvidia-dgx-spark-first-...

CaptainOfCoit · 2 months ago

> There you’ll see the 10 Cortex-X925 (“performance”) cores listed with a peak clock rate of 4 GHz, along with the 10 Cortex-A725 (“efficiency”) cores listed with a peak clock rate of 2.8 GHz

> If you start Python and ask it how many CPU cores you have, it will count both kinds of cores and report 20

> Note that because of the speed difference between the cores, you will want to ensure there is some form of dynamic scheduling in your application that can load balance between the different core types.

Sounds like a new type of hell where I now not only need to manage the threads themselves, but also take into account what type of core they run on, and Python straight up report them as the same.

sidewndr46 · 2 months ago

This is one of the many things that has kept me away from the newer Intel platforms. I don't see the appeal of E-cores on a desktop platform.

This is a much better introduction to the hardware.

victor106 · 2 months ago

< The CPU memory is the same as the GPU memory and is much larger than any other discrete GPU available in a desktop. That means much larger datasets and bigger models can be run locally than would be possible otherwise.

Isin't this the same architecture that the Mx from Apple implements from a memory perspective?

LtdJorge · 2 months ago

Yep, it is

pertymcpert · 2 months ago

This article is AI garbage:

ARM64 Architecture: Not x86_64 (limited ML ecosystem maturity) No PyTorch wheels for ARM64+CUDA (must use Docker) Most ML tools optimized for x86

No evidence for any of this whatsoever. The author just asked Claude/claude code to write their article and it just plain hallucinated some rubbish.

bradfa · 2 months ago

Aarch64 and CUDA has been a thing for many years on Jetson boards. Claiming CUDA is immature on arm is very strange.

We're getting slopped every day now and upvoting it.

blurbleblurble · 2 months ago

We're like little pigs!

Like in Upstream Color: https://www.youtube.com/watch?v=zfDyEr8Ykcg

mgdev · 2 months ago

Yes. Obvious to anyone who writes AI garbage all day.

eadwu · 2 months ago

There are bleeding edge issues, everyone dials into transformers so that's generally pain proof.

I haven't exactly bisected the issue but I'm pretty sure convolutions are broken on sm_121 after a certain size, getting 20x memory blowup from a convolution from a 2x batch size increase _only_ on the DGX Spark.

I haven't had any problems with inference, but I also don't use the transformers library that much.

llama.cpp was working for openai-oss last time I checked and on release, not sure if something broke along the way.

I don't exactly know if memory fragmentation is something fixable on the driver side - this might just be the problem with kernel's policy and GPL, it prevents them from automatically interfering with the memory subsystem to the granularity they'd like - see zfs and their page table antics - or so my thoughts on it is.

If you've done stuff on WSL, you have similar issues and you can fix it by running a service that normally compacts and clean memory, I have it run every hour. Note that this does impact at the very least CPU performance and memory allocation speeds, but I have not have any issue with long training runs with it (24hr+, assuming that is the issue, I have never tried without it and put that service in place since getting it due to my experience on WSL).

aseipp · 2 months ago

I'm not yet using mine for ML stuff because there are still a lot of various issues like this post outlined. But I am using mine as an ARM dev system in the meantime, and as a "workstation" it's actually quite good. The Cortex-X925 cores are Zen5 class in performance and it is overall an absolute unit for its size, I'm very impressed that a standard ARM core is pushing this level of performance for a desktop-class machine. I thought about buying a new Linux desktop recently, and this is good enough I might just plug it into a monitor and use it instead.

It is also a standard UEFI+ACPI system; one Reddit user even reported that they were able to boot up Fedora 42 and install the open kernel modules no problem. The overall delta/number of specific patches for the Canonical 6.17-nvidia tree is pretty small when I looked (the current kernel is 6.11). That and the likelihood the consumer variant will support Windows hopefully bodes well for its upstream Linux compatibility, I hope.

To be fair, most of this also true of Strix Halo from what I can tell (most benchmarks put the DGX furthest ahead at prompt processing and a bit ahead at raw token output. But the software is still buggy and Blackwell is still a bumpy ride overall, so it might get better). But I think it's mostly the pricing that is holding it back. I'm curious what the consumer variant will be priced at.

veber-alex · 2 months ago

The llama.cpp issues are strange.

There are official benchmarks of the Spark running multiple models just fine on llama.cpp

https://github.com/ggml-org/llama.cpp/discussions/16578

There wasn't any instructions how the author got ollama/llama.cpp, could possibly be something nvidia shipped with the DGX Spark and is an old version?

moffkalast · 2 months ago

Llama.cpp main branch doesn't run on Orins so it's actually weird that it does run on the Spark.

Cool I’ll have a look. All reflections I made were first pass stuff.

MaKey · 2 months ago

Why would you get this when a Ryzen AI Max+ 395 with 128 GB is a fraction of the price?

zamadatix · 2 months ago

Theoretically it has slightly better memory bandwidth, (you are supposed to get) the Nvidia AI software ecosystem support out of the box, and you can use the 200G NIC to stick 2 together more efficiently.

Practically, if the goal is 100% about AI and cloud isn't an option for some reason, both options are likely "a great way to waste a couple grand trying to save a couple grand" as you'd get 7x the performance and likely still feel it's a bit slow on larger models using an RTX Pro 6000. I say this as a Ryzen AI Max+ 395 owner, though I got mine because it's the closest thing to an x86 Apple Silicon laptop one can get at the moment.

d3m0t3p · 2 months ago

Because the ML ecosystem is more mature on the NVidia side. Software-wise the cuda platform is more advanced. It will be hard for AMD to catch up. It is good to see competition tho.

shikon7 · 2 months ago

But the article shows that the Nvidia ecosystem isn't that mature either on the DGX Spark with ARM64. I wonder if Nvidia is still ahead for such use cases, all things considered.

simlevesque · 2 months ago

CUDA

WOULDA

SHOULDA

pjmlp · 2 months ago

Complete computer with everything working.

simjnd · 2 months ago

The complete Framework Desktop with everything working (including said Ryzen AI Max 395+ and 128 GB of RAM) is 2500 EUR. In Europe the DGX Spark listings are at 4000+ EUR.

The vast majority of Ryzen AI Max+ 395s (by volume at least) are sold as complete system offerings as well. About as far as you can go the other way is getting one without an SSD, as the MB+RAM+CPU are an "all or nothing" bundle anyways.