Dell's version of the DGX Spark fixes pain points

For those of you wondering if this fits your use case vs the RTX 5090 the short answer is this:

The desktop RTX 5090 has 1792 GB/s of memory bandwidth partially due to the 512 bit bus width, compared to the DGX Spark with a 256 bit bus and 273 GB/s memory bandwidth.

The RTX 5090 has 32G of VRAM vs the 128G of “VRAM” in the DGX Spark which is really unified memory.

Also the RTX 5090 has 21760 cuda cores vs 6144 in the DGX Spark. (3.5 x as many). And with the much higher bandwidth in the 5090 you have a better shot at keeping them fed. So for embarrassingly parallel workloads the 5090 crushes the Spark.

So if you need to fit big models into VRAM and don’t care about speed too much because you are for example, building something on your desktop that’ll run on data center hardware in production, the DGX Spark is your answer.

If you need speed and 32G of VRAM is plenty, and you don’t care about modeling network interconnections in production, then the RTX 5090 is what you want.

kouteiheika · a month ago

> building something on your desktop that’ll run on data center hardware in production, the DGX Spark is your answer

It isn't, because it's a different architecture than the datacenter hardware. They're both called "Blackwell", but that's a lie[1] and you still need "real" datacenter Blackwell card for development work. (For example, you can't configure/tune vLLM on Spark, and then move it into a B200 and even expect it to work, etc.)

[1] -- https://github.com/NVIDIA/dgx-spark-playbooks/issues/22

benreesman · a month ago

sm_120 (aka 1CTA) supports tensor cores and TMEM just fine: example 83 shows block-scaled NVFP4 (I've gotten 1850 ish dense TFLOPs at 600W, the 300W part caps out more like 1150). sage3 (which is no way in hell from China, myelin knows it by heart) cracks a petaflop in bidirectional noncausal.

The nvfuser code doesn't even call it sm_100 vs. sm_120: NVIDIA's internal nomenclature seems to be 2CTA/1CTA, it's a bin. So there are less MMA tilings in the released ISA as of 13.1 / r85 44.

The mnemonic tcgen05.mma doesn't mean anything, it's lowered onto real SASS. FWIW the people I know doing their own drivers say the whole ISA is there, but it doesn't matter.

The family of mnemonics that hits the "Jensen Keynote" path is roughly here: https://docs.nvidia.com/cuda/parallel-thread-execution/#warp....

10x path is hot today on Thor, Spark, 5090, 6000, and data center.

Getting it to trigger reliably on real tilings?

Well that's the game just now. :)

Edit: https://customer-1qh1li9jygphkssl.cloudflarestream.com/1795a...

my123 · a month ago

Note that sm_110 (Jetson Thor) has the tcgen05 ISA exposed (with TMEM and all) instead of the sm_120 model.

chao- · a month ago

It's also worth nothing that the 128GB of "VRAM" in the GB10 is even less straightforward than just being aware that the memory is shared with the CPU cores. There's a lot of details in memory performance that differ across both the different core types, and the two core clusters:

https://chipsandcheese.com/p/inside-nvidia-gb10s-memory-subs...

You can get two Strix Halo PCs with similar specs for that $4000 price. I just hope that prompt preprocessing speeds will continue to improve, because Strix Halo is still quite slow in that regard.

Then there is the networking. While Strix Halo systems come with two USB4 40Gbit/s ports, it's difficult to

a) connect more than 3 machines with two ports each

b) get more than 23GBit/s or so per connection, if you're lucky. Latency will also be in the 0.2ms range, which leaves room for improvement.

Something like Apple's RDMA via Thunderbolt would be great to have on Strix Halo…

coder543 · a month ago

As you allude, the prompt processing speeds are a killer improvement of the Spark which even 2 Strix Halo boxes would not match.

Prompt processing is literally 3x to 4x higher on GPT-OSS-120B once you are a little bit into your context window, and it is similarly much faster for image generation or any other AI task.

Plus the Nvidia ecosystem, as others have mentioned.

One discussion with benchmarks: https://www.reddit.com/r/LocalLLaMA/comments/1oonomc/comment...

If all you care about is token generation with a tiny context window, then they are very close, but that’s basically the only time. I studied this problem extensively before deciding what to buy, and I wish Strix Halo had been the better option.

zozbot234 · a month ago

Prompt processing could be sped up with NPU inference. The Strix Halo NPU is a bit weird (XDNA 2, so the architecture is spatial dataflow and programmable interconnects), but it's there. See https://github.com/FastFlowLM/FastFlowLM (which is directly supported by https://lemonade-server.ai/ https://github.com/lemonade-sdk/lemonade ) for one existing project that's planning to support the NPU for the prompt processing phase. (Do note that FLM are providing proprietary NPU kernels under a non-free license, so make sure that this fits your needs before use.)

EnPissant · a month ago

Then again, I have a RTX 5090 + 96GB DDR5-6000 that crushes the spark on prompt processing of gpt-oss-120b (something like 2-3x faster), while token generation is pretty close. The cost I paid was ~$3200 for the entire computer. With the currently inflated RAM prices, it would probably be closer to the dell.

So while I think the Strix Halo is a mostly useless machine for any kind of AI, and I think the spark is actually useful, I don't think pure inference is a good use case for them.

It probably only makes sense as a dev kit for larger cloud hardware.

plagiarist · a month ago

Could I get your thoughts on the Asus GX10 vs. spending on GPU compute? It seems like one could get a lot of total VRAM with better memory bandwidth and make PCIe the bottleneck. Especially if you already have a motherboard with spare slots.

I'm trying to better understand the trade offs, or if it depends on the workload.

Aurornis · a month ago

The primary advantage of the DGX box is that it gives you access to the nVidia ecosystem. You can develop against it almost like a mini version of the big servers you're targeting.

It's not really intended to be a great value box for running LLMs at home. Jeff Geerling talks about this in the article.

cmrdporcupine · a month ago

Exactly this. I'm not sure why people keep drumming the "a Mac or Strix Halo is faster/cheaper" drum. Different market.

If I want to do hobby / amateur AI research or do stuff with fine tuning models etc, learn the tooling. I'm better off with the DG10 than AMD or Apple's systems.

The Strix Halo machines look nice. I'd like one of those too. Especially if/when they ever get around to getting it into a compelling laptop.

But I ordered the ASUS Ascent DG10 machine (since it was more easily available for me than the other versions of these) because I want to play around with fine tuning open weight models, learning tooling, etc.

That and I like the idea of having a (non-Apple) Aarch64 linux workstation at home.

Now if the courier would just get their shit together and actually deliver the thing...

saagarjha · a month ago

DGX Spark has a different compute capability, so no, you really aren’t.

benreesman · a month ago

NVFP4 (and to a lesser extent, MXFP8) work, in general. In terms of usable FLOPS the DGX Spark and the GMTek EVO-X2 both lose to the 5090, with NCCL and OpenMPI set up the DGX is still the nicest way to dev for our SBSA future. Working on that too, harder problem.

llama-server \ --model llama-3.3-70b-instruct-ud-q4_k_xl.gguf \ --model-draft llama-3.2-1b-instruct-ud-q8_k_xl.gguf \ --ctx-size 80000 \ --ctx-size-draft 4096 \ --draft-min 1 \ --draft-max 8 \ --draft-p-min 0.65 \ -ngl 999 \ --flash-attn on \ --parallel 1 \ --no-mmap \ --jinja \ --temp 0.0 \ -fit off

prompt eval time = 313.70 ms / 40 tokens (7.84 ms per token, 127.51 tokens per second) eval time = 46278.35 ms / 913 tokens (50.69 ms per token, 19.73 tokens per second) total time = 46592.05 ms / 953 tokens draft acceptance rate = 0.87616 (757 accepted / 864 generated)

mmaunder · a month ago

jasoneckert · a month ago

I've got the Dell version of the DGX Spark as well, and was very impressed with the build quality overall. Like Jeff Geerling noted, the fans are super quiet. And since I don't keep it powered on continuously and mainly connect to it remotely, the LED is a nice quick check for power.

But the nicest addition Dell made in my opinion is the retro 90's UNIX workstation-style wallpaper: https://jasoneckert.github.io/myblog/grace-blackwell/

ranger_danger · a month ago

I just want a standard, affordable mini PC that looks like this one. Or better yet, with the brown accents normally found on recent PowerEdge systems.

https://www.fsi-embedded.jp/contents/uploads/2018/11/DELLEMC...

storus · a month ago

Zotac has a bunch of x64 mini PCs that use a similar hexagonal styling.

mapontosevenths · a month ago

I've had mine for a while now, and never actually connected a monitor to it. Now I'll have to. Thanks. :)

Tepix · a month ago

kristianp · a month ago

I know it's just a quick test, but llama 3.1 is getting a bit old. I would have liked to see a newer model that can fit, such as gpt-oss-120, (gpt-oss-120b-mxfp4.gguf), which is about 60gb of weights (1).

(1) https://github.com/ggml-org/llama.cpp/discussions/15396

geerlingguy · a month ago

That and more in https://github.com/geerlingguy/ai-benchmarks/issues/34

Even though big, dense models aren't fashionable anymore, they are perfect for specdec, so it can be fun to see the speedup that is possible.

I can get about 20 tokens per second on the DGX Spark using llama-3.3-70B with no loss in quality compared to the model you were benchmarking:

Specdec works well for code, so the prompt I used was "Write a React TypeScript demo".

The draft model cannot affect the quality of the output. A good draft model makes token generation faster, and a bad one would slow things down, but the quality will be the same as the main model either way.

Thanks!

eurekin · a month ago

Correct, most of r/LocalLlama moved onto next gen MoE models mostly. Deepseek introduced few good optimizations that every new model seems to use now too. Llama 4 was generally seen as a fiasco and Meta haven't made a release since

Llama 4 isn't that bad, but it was overhyped, and people in generally "hold it wrong".

I recently needed an LLM to batch process me some queries. I ran an ablation on 20+ models from Open Router to find the best one. Guess which ones got 100% accuracy? GPT-5-mini, Grok-4.1-fast and... Llama4 Scout. For comparison, DeepSeek v3.2 got 90%, and the community darling GLM-4.5-Air got 50%. Even the newest GLM-4.7 only got 70%.

Of course, this is just an anecdotal single datapoint which doesn't mean anything, but it shows that Llama 4 is probably underrated.

fragmede · a month ago

What are some of the models people are using? (Rather than naming the ones they aren't.)

alecco · a month ago

IMHO DGX Spark at $4,000 is a bad deal with only 273 GB/s bandwidth and the compute capacity between a 5070 and a 5070 TI. And with PCIe 5.0 at 64 GB/s it's not such a big difference.

And the 2x 200 GBit/s QSFP... why would you stack a bunch of these? Does anybody actually use them in day-to-day work/research?

I liked the idea until the final specs came out.

BadBadJellyBean · a month ago

I think the selling point is the 128GB of unified system memory. With that you can run some interesting models. The 5090 maxes out at 32GB. And they cost about $3000 and more at the moment.

1. /r/localllama unanimously doesn't like the Spark for running models

2. and for CUDA dev it's not worth the crazy price when you can dev on a cheap RTX and then rent a GH or GB server for a couple of days if you need to adjust compatibility and scaling.

cat_plus_plus · a month ago

I have a slightly cheaper similar box, NVIDIA Thor Dev Kit. The point is exactly to avoid deploying code to servers that cost half a million dollars each. It's quite capable in running or training smart LLMs like Qwen3-Next-80B-A3B-Instruct-NVFP4. So long as you don't tear your hair out first figuring out pecularities and fighting with bleeding edge nightly vLLM builds.

echion · a month ago

> training smart LLMs like Qwen3-Next-80B-A3B-Instruct-NVFP4

Sounds interesting; can you suggest any good discussions of this (on the web)?

kachapopopow · a month ago

Dell fixing issues instead of creating new ones? That's a new one for me. Would rather still not deal with their firmware updaters thought.

cjbgkagh · a month ago

Give them a chance, I’m sure they’ll add new issues in one of their monthly bios updates.

nothing beats perfectly good vendor firmware updates packaged in an obscenely complicated bash file that just extracts the tool and runs it while performing unnecessary and often broken validation that only runs on hardware that is part of their ecosystem (ex: dell nic on non dell chassis).

cpgxiii · a month ago

Absent disassembly and direct comparison between a DGX Spark and a Dell GB10, I don't think there's sufficient evidence to say what is meaningfully different between these devices (beyond the obvious of the power LED). Anything over 240W is beyond the USB-C EPR spec, and while Dell does have a question ably-compliant USB-C 280W supply, you'd have to compare actual power consumption to see if the Dell supply is actually providing more power. I suspect any other minor differences in experience/performance are more explainable as the consequences on increasing maturity of the DGX software stack than anything unique to the Dell version; particularly any comparisons to very early DGX Spark behavior need to keep in mind that the software and firmware have seen a number of updates.

Comparing notes with Wendell from Level1Techs, the ASUS and Dell GB10 boxes were both able to sustain better performance due to their better thermal management. That's a fairly significant improvement. The Spark's crusted gold facade seems more form over function.