the fact i can export to clipboard and re-import it and reconstruct all the shapes etc. almost flawlessly is such a big win.
For GEMM you need to visit each row/vec n-times so theres a bunch of data-reuse going on, which isn't optimal for GPUs since you can't keep that all so close to your processing-units. And while the tensor-cores kinda implement this i think they don't quite scale up to a full sized systolic array, which is you would want for larger matrix multiplications.
Also just a simpler view: with GPUs most of their silicon is spent NOT tensor-core, so just from that you know its not optimal i guess.
Just referring to that FLOP/s number doesn't really mean much nowadays with tensor-cores and sparsity.
In my eyes the big win of GPUs are that not only are they pretty good at GEMMs but also really good at a lot of other easily parallelizable tasks PLUS they're comparatively easy to program ^^
It's an issue I'm seeing even for comments touching too much on algorithmic stuff. To take a somewhat common example, if you were dealing with a credit card payment flow, where would the explanation of how a transaction goes through a few states asynchronously, which all trigger a webhook callback ?
Obviously the people working on the code need to be aware of that, so documentation is somewhere needed. I've seen people put whole blocks in class headers, other sprinkle it all inside the code, personally I ended up moving it outside of the code. Where would you put it?