[1] https://registry.khronos.org/OpenGL/extensions/NV/NV_fill_re...
Deleted Comment
[1] https://registry.khronos.org/OpenGL/extensions/NV/NV_fill_re...
It's very much a work in progress, as noted in the article. And some of the stuff that worked reasonably well on my cards, like the instruction rate test when trying to measure throughput across the entire card, went down the drain when run on Arc.
Instead of computing 8 independent values, compute one with 8x more iterations:
for (int i = 0; i < count * 8; i++) {
v0 += acc * v0;
}
That plus inlining the iteration count so the compiler can unroll the loop might help get closer to SOL.Switch is averaging 20M units/year and PS4 is averaging 13M units/year.
Since when did the average developer care about how many sockets a mobo has...?
Surely you still have to carefully pin processes and reason about memory access patterns if you want maximum performance.
I was talking about the mobile GPU on MacBook Pros which is based on a 14nm chip. The full name is Radeon Pro Vega 20:
https://www.amd.com/en/graphics/radeon-pro-vega-20-pro-vega-...
https://www.techpowerup.com/gpu-specs/radeon-pro-vega-20.c32...
Vega 20 seems to also refer to a discrete GPU. This has been later rebranded to Radeon VII (maybe because of this confusion). The number you are quoting is for the discrete GPU.
The raw compute power of M1's GPU seems to be 2.6 TFLOPS (single precision) vs 3.2 TFLOPS for Vega 20. This can give you an estimate of how fast it would be for training.
Just for reference Nvidia's flagship desktop GPU(3090)'s FP32 performance is 35.5 TFLOPS.
There are many reasons to hate nvidia, but honestly if this UBC policy is even remotely being considered in some circles, I'd join Linus Torvalds and say "nvidia, fuck you".