Interesting article, thanks, IMHO mostly for the low level performance analysis.
When it comes to actual computation of convolutions, the fast Fourier transform should at least be mentioned, even if in passing. Early in grad school I peaked at the source for R's density() function, and was blown away that it was using FFT, and that I had not picked up that trick in my math classes (or maybe I had just forgotten it...)
These sort of computations generally just get fed bigger inputs as compute gets better.
Also, plenty of threadrippers exist out there already, if you get access to some cluster, it might have whatever type of chip in it. If I have access to a cluster with many 7995’s, I don’t really care too much about what’s available on the consumer side.
Also checked and apparently Nvidia Cutlass now supports generic convolutions: https://github.com/NVIDIA/cutlass
When it comes to actual computation of convolutions, the fast Fourier transform should at least be mentioned, even if in passing. Early in grad school I peaked at the source for R's density() function, and was blown away that it was using FFT, and that I had not picked up that trick in my math classes (or maybe I had just forgotten it...)
For a 2d example:
https://stackoverflow.com/questions/50453981/implement-2d-co...
And a recent HN thread that was very good:
https://news.ycombinator.com/item?id=40840396
XDNA 2 will have 12 TFLOPs, roughly matching the 96 core Threadripper Pro 7995WX at a much lower price point.
Also, plenty of threadrippers exist out there already, if you get access to some cluster, it might have whatever type of chip in it. If I have access to a cluster with many 7995’s, I don’t really care too much about what’s available on the consumer side.
Dead Comment