I have used Mojo quite a bit. It’s fantastic and lives up to every claim it makes. When the compiler becomes open source I fully expect it to really start taking off for data science.
Modular also has its paid platform for serving models called Max. I’ve not used that but heard good things.
TLDR: In order to get good performance you need to use vendor-specific extensions that result in the same lock-in Modular has been claiming they will enable you to avoid.
Correct. There is too much architectural divergence between GPU vendors. If they really wanted to avoid vendor specific extensions in user level code, they would have gone with something that could be said to be loosely inspired by tiny grad (which isn't ready yet).
Basically, you need a good description of the hardware and the compiler automatically generates the state of the art GEMM kernel.
Maybe it's 20% worse than Nvidia's hand written kernels, but you can switch hardware vendors or build arbitrary fused kernels at will.
Not OP but I think this could be an instance of leaky abstraction at work. Most of the time you hand-write an accelerator kernel hoping to optimize for runtime performance. If the abstraction/compiler does not fully insulate you from micro-architectural details affecting performance in non-trivial ways (e.g. memory bank conflict as mentioned in the article) then you end up still having per-vendor implementations, or compile-time if-else blocks all over the place. This is less than ideal, but still arguably better than working with separate vendor APIs, or worse, completely separate toolchains.
The blog post is about using an NVIDIA-specific tensor core API that they have built to get good performance.
Modular has been pushing the notion that they are building technology that allows writing HW-vendor neutral solutions so that users can break free of NVIDIA's hold on high performance kernels.
From their own writing:
> We want a unified, programmable system (one small binary!) that can scale across architectures from multiple vendors—while providing industry-leading performance on the most widely used GPUs (and CPUs).
Modular also has its paid platform for serving models called Max. I’ve not used that but heard good things.
There seem to be enthusiasts who have experimented a bit and like what they see but I haven’t seen much else.
Basically, you need a good description of the hardware and the compiler automatically generates the state of the art GEMM kernel.
Maybe it's 20% worse than Nvidia's hand written kernels, but you can switch hardware vendors or build arbitrary fused kernels at will.
Modular has been pushing the notion that they are building technology that allows writing HW-vendor neutral solutions so that users can break free of NVIDIA's hold on high performance kernels.
From their own writing:
> We want a unified, programmable system (one small binary!) that can scale across architectures from multiple vendors—while providing industry-leading performance on the most widely used GPUs (and CPUs).