jra101 (u/jra101) - Readit News

jra101 commented on GPUs Go Brrr hazyresearch.stanford.edu... · Posted by u/nmstoker

behnamoh · a year ago

tangential: When @sama talks about "Universal Basic Compute" (UBC) as a substitute for Universal Basic Income, obviously he means GPU, right? Who's going to benefit from such policies? Only nvidia? It just seems such a dystopian future to live in: imagine you can sell your UBC to others who know better how to use it, or you can use it to mine bitcoin or whatever. But all the compute is actually created by one company.

There are many reasons to hate nvidia, but honestly if this UBC policy is even remotely being considered in some circles, I'd join Linus Torvalds and say "nvidia, fuck you".

jra101 · a year ago

You're blaming NVIDIA for Sam Altman's dumb idea?

Deleted Comment

jra101 commented on Full screen triangle optimization 30fps.net/pages/twotris/... · Posted by u/rck

ttoinou · 3 years ago

Why didn't they ever implemented a rectangle primitive to be drawn instead of a triangle ? Anyway, here the perf impact is negligible

jra101 · 3 years ago

NVIDIA has an OpenGL extension that does just that [1].

[1] https://registry.khronos.org/OpenGL/extensions/NV/NV_fill_re...

jra101 commented on Microbenchmarking Intel’s Arc A770 chipsandcheese.com/2022/1... · Posted by u/pantalaimon

clamchowder · 3 years ago

(Author here) See https://github.com/clamchowder/Microbenchmarks/tree/master/G...

It's very much a work in progress, as noted in the article. And some of the stuff that worked reasonably well on my cards, like the instruction rate test when trying to measure throughput across the entire card, went down the drain when run on Arc.

jra101 · 3 years ago

Have you tried reducing the register count in your FP32 FMA test by increasing the iteration count and reducing the number of values computed per loop?

Instead of computing 8 independent values, compute one with 8x more iterations:

    for (int i = 0; i < count * 8; i++) {
        v0 += acc * v0; 
    }

That plus inlining the iteration count so the compiler can unroll the loop might help get closer to SOL.

jra101 commented on Tesla’s ‘phantom braking’ problem is getting worse theverge.com/2022/6/3/231... · Posted by u/metadat

jra101 · 3 years ago

This happens to me just using the adaptive cruise control, no Autopilot or FSD enabled and it's super annoying. Can happen on a completely empty road driving in a straight line.

jra101 commented on Nvidia Unveils 144-Core Grace CPU Superchip tomshardware.com/news/nvi... · Posted by u/manmal

The_rationalist · 3 years ago

There are significantly more sales of PS4 than switch, also nintendo live on its laurels but people will become tired of the stagnation eventually

jra101 · 3 years ago

PS4: ~117M sold in 9 years Switch: ~104M sold in 5 years

Switch is averaging 20M units/year and PS4 is averaging 13M units/year.

jra101 commented on Twitter makes it harder to choose the old reverse-chronological feed theverge.com/2022/3/10/22... · Posted by u/Yaina

jra101 · 4 years ago

Thankful that Tweetdeck still works and has never tried to switch me to a non-chronological feed.

jra101 commented on Apple M1 Ultra apple.com/newsroom/2022/0... · Posted by u/davidbarker

pphysch · 4 years ago

> This enables M1 Ultra to behave and be recognized by software as one chip, so developers don’t need to rewrite code to take advantage of its performance. There’s never been anything like it.

Since when did the average developer care about how many sockets a mobo has...?

Surely you still have to carefully pin processes and reason about memory access patterns if you want maximum performance.

jra101 · 4 years ago

They are referring to the GPU part of the chip. There are two separate GPU complexes on the die but from the software point of view, it is a single large GPU.

jra101 commented on Apple M1 support for TensorFlow 2.5 pluggable device API developer.apple.com/metal... · Posted by u/dandiep

codelord · 4 years ago

Seems like AMD has been using Vega 20 to refer to two different things.

I was talking about the mobile GPU on MacBook Pros which is based on a 14nm chip. The full name is Radeon Pro Vega 20:

https://www.amd.com/en/graphics/radeon-pro-vega-20-pro-vega-...

https://www.techpowerup.com/gpu-specs/radeon-pro-vega-20.c32...

Vega 20 seems to also refer to a discrete GPU. This has been later rebranded to Radeon VII (maybe because of this confusion). The number you are quoting is for the discrete GPU.

jra101 · 4 years ago

Huh, I had no idea they used Vega 20 both as a codename and a product name. Confusing.

jra101 commented on Apple M1 support for TensorFlow 2.5 pluggable device API developer.apple.com/metal... · Posted by u/dandiep

codelord · 4 years ago

M1 and AMD GPU support. I'm personally more interested in the latter as I haven't yet upgraded my MacBook Pro and I expect that my Vega 20 to be faster than M1 at ML training.

The raw compute power of M1's GPU seems to be 2.6 TFLOPS (single precision) vs 3.2 TFLOPS for Vega 20. This can give you an estimate of how fast it would be for training.

Just for reference Nvidia's flagship desktop GPU(3090)'s FP32 performance is 35.5 TFLOPS.

jra101 · 4 years ago

Vega 20 should be ~13.8 TFLOPs (single precision): https://www.anandtech.com/show/13923/the-amd-radeon-vii-revi...

u/jra101

KarmaCake day420August 5, 2009View Original