I think people care too much about trying to innovate a new model architecture. Models are meant to create a compressed representation of its training data. Even if you came up with a more efficient compression, the capabilities of the model wouldn't be any better. What is more relevant is finding more efficient ways of training, like the shift to reinforcement learning these days.
But isn't the max training efficiency naturally tied to the architecture? Meaning other architecture have another training efficiency landscape? I've said it somewhere else: It is not about "caring too much about new model architecture" but to have a balance between exploitation and exploration.
It's like if someone invented the hamburger and every single food outlet decided to only serve hamburgers from that point on, only spending time and money on making the perfect hamburger, rather than spending time and effort on making great meals. Which sounds ludicrously far-fetched, but is exactly what happened here.