I had bet that matmult would be in transformer-optimized hardware costing a fraction of GPUs first class in torch 2 years ago with no reason to use GPUs any more. Wrong.
well, related: fractional GPUs to multiplex workloads for aggregate utilization have been a topic for some time with no definite (NVIDIA) solutions for it: https://vhpc.org
I believe this is mainly due to everything ML/AI optimizing for CUDA, with even AMD cards (which are very similar to Nvidia cards) unable to compete due to lack of proper support for CUDA.
this was/is the chip opportunity of the century. Even more optimized than the still general purpose nvidia cards. And no matrix mult is abstracted away for decades, don't need CUDA. So a chip would likely be much much easier than a mixed signal chip with the Apple C1 being on the high end of nightmare in comparison.