1.44e6 tokens/sec * 37e9 bytes/token / 3.3e12 bytes/sec/GPU = ~16,000 GPUs
And that's assuming a more likely 1 byte per parameter.
So the article is only off by a factor of at least 1,000. I didn't check any of the rest of the math, but that probably has some impact on their conclusions...
You are doing the calculation as they were output tokens on a single batch, it would not make sense even in the decode phase.