Why wouldn't you factor in training? It is not like you can train once and then have the model run for years. You need to constantly improve to keep up with the competition. The lifespan of a model is just a few months at this point.
I spoke with management at a couple companies that were training models, and some of them expensed the model training in-period as R&D. That's why
For others, I think the picture is different. When we ran benchmarks on DeepSeek-R1 on 8x H200 SXM using vLLM, we got up to 12K total tok/s (concurrency 200, input:output ratio of 6:1). If you're spiking up 100-200K tok/s, you need a lot of GPUs for that. Then, the GPUs sit idle most of the time.
I'll read the blog post in more detail, but I don't think the following assumptions hold outside of AI labs.
* 100% utilization (no spikes, balanced usage between day/night or weekdays) * Input processing is free (~$0.001 per million tokens) * DeepSeek fits into H100 cards in a way that network isn't the bottleneck