IIRC, Depthwise is memory bound so the bar might be lower. Perhaps you can try some thing with higher compute intensity like a matrix multiply. I have observed, it trips up with the columnar accesses for SIMD.
Also important to have a few test cases the agent can quickly check against, it will often generate wrong code, but if that is easily detectable the agent can fix it and continue quickly.
I started out by letting it write a naive C version without intrinsic, and validated it against the PyTorch version.
Then I asked it (and two other models, Gemini 3.0 and GPT 5.1) to come up with some ideas on how to make it faster using SIMD vector instructions and write those down as markdown files.
Finally, I started the agent loop by giving Cursor those three markdown files, the naive C code and some more information on how to compile the code, and also an SSH command where it can upload the program and test it.
It then tested a few different variants, ran it on the target (RISC-V SBC, OrangePI RV2) to check if it improves runtime, and then continue from there. It did this 10 times, until it arrived at the final version.
The final code is very readable, and faster than any other library or compiler that I have found so far. I think the clear guardrails (output has to match exactly the reference output from PyTorch, performance must be better than before) makes this work very well.
The big advantage of the PCIe version is that it does not take up space on the desk and all the cables for ATX power control an inside the PC case.
Full-sized HDMI is nice, the only limitation here is 1080p resolution. 1440p or higher would allow mirroring the output on the main monitor to the NanoKVM, but this probably a weird use-case anyway.
You won't be out of work creating ggufs anytime soon :)
The new tensor cores, sorry, "Neural Accelerator" only really help with prompt preprocessing aka prefill, and not with token generation. Token generation is memory bound.
Hopefully the Ultra version (if it exists) has a bigger jump in memory bandwidth and maximum RAM.