This is not really a fair statement. Literally all of software bears the weight of some early poor choice that then keeps moving forward via weight of momentum. FASTA and FASTQ formats are exceptionally dumb though.
I have not used Triton/Cute/Cutlass though, so I can't compare against anything other than Cuda really.
Modular has been pushing the notion that they are building technology that allows writing HW-vendor neutral solutions so that users can break free of NVIDIA's hold on high performance kernels.
From their own writing:
> We want a unified, programmable system (one small binary!) that can scale across architectures from multiple vendors—while providing industry-leading performance on the most widely used GPUs (and CPUs).
So, you can support either vendor with as-good-vendor-library performance. That’s not lock-in to me at least.
It’s not as good as the compiler being able to just magically produce optimized kernels for arbitrary hardware though, fully agree there. But it’s a big step forward from Cuda/HIP.
Modular also has its paid platform for serving models called Max. I’ve not used that but heard good things.