yvbbrjdr (u/yvbbrjdr)

yvbbrjdr commented on NVIDIA DGX Spark In-Depth Review: A New Standard for Local AI Inference lmsys.org/blog/2025-10-13... · Posted by u/yvbbrjdr

EnPissant · 4 months ago

Something is wrong with your numbers: gpt-oss-20b and gpt-oss-120b should be much much faster than what you are seeing. I would suggest you familiarize yourself with llama-bench instead of ollama.

Running gpt-oss-120b with a rtx 5090 and 2/3 of the experts offloaded to system RAM (less than half of the memory bandwidth of this thing), my machine gets ~4100tps prefill and ~40tps decode.

Your spreadsheet shows the spark getting ~94tps prefill and ~11tps decode.

Now, it's expected that my machine should slaughter this thing in prefill, but decode should be very similar or the spark a touch faster.

yvbbrjdr · 4 months ago

We actually profiled one of the models, and saw that the last GeMM, which is completely memory bound, is taking a lot of time, which reduces the token speed by a lot.

yvbbrjdr commented on NVIDIA DGX Spark In-Depth Review: A New Standard for Local AI Inference lmsys.org/blog/2025-10-13... · Posted by u/yvbbrjdr

ggerganov · 4 months ago

FYI you should have used llama.cpp to do the benchmarks. It performs almost 20x faster than ollama for the gpt-oss-120b model. Here are some samples results on my spark:

  ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
  | model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
  | gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |          pp4096 |       3564.31 ± 9.91 |
  | gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |            tg32 |         53.93 ± 1.71 |
  | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |          pp4096 |      1792.32 ± 34.74 |
  | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |            tg32 |         38.54 ± 3.10 |

yvbbrjdr · 4 months ago

I see! Do you know what's causing the slowdown for ollama? They should be using the same backend..

yvbbrjdr commented on NVIDIA DGX Spark In-Depth Review: A New Standard for Local AI Inference lmsys.org/blog/2025-10-13... · Posted by u/yvbbrjdr

pixelpoet · 4 months ago

I wonder why they didn't test against the broadly available Strix Halo with 128GB of 256 GB/s memory bandwidth, 16 core full-fat Zen5 with AVX512 at $2k... it is a mystery...

yvbbrjdr · 4 months ago

Hi, author here. I crowd-sourced the devices for benchmarking from my friends. It just happened that none of my friend has this device.

yvbbrjdr commented on NVIDIA DGX Spark In-Depth Review: A New Standard for Local AI Inference lmsys.org/blog/2025-10-13... · Posted by u/yvbbrjdr

newman314 · 4 months ago

Agreed. I also wonder why they chose to test against a Mac Studio with only 64GB instead of 128GB.

yvbbrjdr · 4 months ago

Hi, author here. I crowd-sourced the devices for benchmarking from my friends. It just happened that one of my friend has this device.

yvbbrjdr commented on Athena – An open source production-ready general AI agent github.com/Athena-AI-Lab/... · Posted by u/yvbbrjdr

bionhoward · 10 months ago

your example shows using OpenAI, but they have a prohibition on competition, so what are we supposed to use in production?

yvbbrjdr · 10 months ago

We use DeepSeek v3 in prod. Works even better than GPT-4o.