When you are behind the optimal play is to make a gamble, which most likely will make you even worse. From the naive winning side it seems the loser is just doing a stupid strategy of not following the optimal dichotomy strategy, and therefore that's why they are losing. But in fact they are a "player" doing not only their best, but the best that can be done.
The infinite sum of ever smaller probabilities like in Zeno's paradox, converge towards a finite value. The inevitable is a large fraction of the time, you are playing catch-up and will never escape.
You are losing, playing optimally, but slowly realising the probabilities that you are a loser as evidence by the score which will most likely go down even more next round. Most likely the entire future is an endless sequence of more and more desperate looking losing bets, just hoping to strike it once that will most likely never comes.
In economics such things are called "traps", for example the poverty trap exhibits similar mechanics. Where even though you display incredible ingenuity by playing the optimal game strategy, most of the time you will never escape, and you will need to take even more desperate measures in the future. That's separating the wheat from the chaff from the chaff's perspective or how you make good villains : because like Bane in Batman there are some times (the probability is slim but finite) where the gamble pays off once and you escape the hell hole you were born in and become legend.
If you don't play this optimal strategy you will lose slower but even more surely. The optimal strategy is to bet just enough to go from your current situation to the winning side. It's also important not to overshoot : this is not always taking moonshots, but betting just enough to escape the hole, because once out, the probabilities plays in your favor.
2024 production was (according to openai/chatgpt) 120 billion gigabytes. With 8 billion humans that's about 15 GB per person.
For training, their models have a certain number of memory needed to store the parameters, and this memory is touched for every example of every iteration. Big models have 10^12 (>1T )parameters, and with typical values of 10^3 examples per batch, and 10^6 number of iteration. They need ~10^21 memory accesses per run. And they want to do multiple runs.
DDR5 RAM bandwidth is 100G/s = 10^11, Graphics RAM (HBM) is 1T/s = 10^12. By buying the wafer they get to choose which types of memory they get.
10^21 / 10^12 = 10^9s = 30 years of memory access (just to update the model weights), you need to also add a factor 10^1-10^3 to account for the memory access needed for the model computation)
But the good news is that it parallelize extremely well. If you parallelize you 1T parameters, 10^3 times, your run time is brought down to 10^6 s = 12 days. But you need 10^3 *10^12 = 10^15 Bytes of RAM by run for weight update and 10^18 for computation (your 120 billions gigabytes is 10^20, so not so far off).
Are all these memory access technically required : No if you use other algorithms, but more compute and memory is better if money is not a problem.
Is it strategically good to deprive your concurrents from access to memory : Very short-sighted yes.
It's a textbook cornering of the computing market to prevent the emergence of local models, because customers won't be able to buy the minimal RAM necessary to run the models locally even just the inferencing part (not the training). Basically a war on people where little Timmy won't be able to get a RAM stick to play computer games at Xmas.