> We can ask a question: how long (in nanoseconds) does it take to access a type of memory of which an average laptop has N bytes? Here's GPT's answer:
"Here's what GPT says" is not an empirical argument. If you can't do better than that (run a benchmark, cite some literature), why should I bother to read what you wrote?
Thanks for the good links. I think we generally have become so accustomed to the scaled up von Neumann strategy that we don´t see how much efficiency and performance we leave on the table by not building much smaller memory hierarchies.
Why would it not run with your provided hypothesis? It even added an explicit hint that this is a hypothetical scaling exercise, and that real hardware does not scale like that.
But generally, sure, you can make LLMs say many false things, sometimes even by just asking them a question in good faith, and it certainly casts some doubt on a blog post quoting an LLM as a source.
Though at the limit, it would have to be at least O(sqrt(n)) thanks to the Bekenstein bound [0]. And of course, as mentioned in TFA, you can always do better if you can get away with local random access in parallel, rather than global random access.
If one wants to start talking cosmology, it's unlikely to the case that arbitrarily long-lived computers are possible, I don't think any of the theories in [0] are conducive to either an infinite-time or infinite-memory computer, so the strict mathematical definition for Big-O doesn't hold up. IMO it's better to use Big-O as an effective theory for predicting runtime on human-scale computers than take the mathematical formalism too literally.
That doesn't apply for the Bekenstein Bound though.
Literally the first line of the wikipedia article:
> In physics, the Bekenstein bound (named after Jacob Bekenstein) is an upper limit on the thermodynamic entropy S, or Shannon entropy H, that can be contained within a given *finite* region of space which has a *finite* amount of energy—or equivalently, the maximum amount of information that is required to perfectly describe a given physical system down to the quantum level.
Once you're talking about computers that verge on being black holes, I think you're making too many assumptions about how everything works to give it a speed rating.
Ultimately I think we will go to full NUMA like Sun and others tried. Instead of having L4 and then L5 caches, each core simply has 4GB of local working memory and you use programming languages that are ready for this. Erlang would be easy, and I think Rust has the semantics to make it work but it might take years of compiler advancement to make it efficient.
All shared state is communicated through shared memory pool that is either accessed directly through segregated address ranges or via DMA.
NUMA has a huge amount of overhead (e.g. in terms of intercore latency), and NUMA server CPUs cost a lot more than single socket boards. If you look at the servers at Google or Facebook they will have some NUMA servers for certain workloads that actually need them, but most most servers will be single socket because they're cheaper and applications literally run faster on them. It's a win win if you can fit your workload on a single socket server so there is a lot of motivation to make applications work in a non-NUMA way if at all possible.
I disagree with this model because it assumes processing occurs at a point and memory is (optimally) distributed across space around it in every direction in an analog to a Von Neumann CPU architecture. However it is entirely possible to distribute compute with memory. For example, Samsung has a technology called PIM (Processing in Memory) where simple compute units are inserted inside HBM memory layers. Algorithms that can take advantage of this run much faster and at much lower power because it skips the bus entirely. More importantly, the compute scales in proportion to the memory size/space.
The article says exactly this in bold at the bottom:
> If you can break up a task into many parts, each of which is highly local, then memory access in each part will be O(1). GPUs are already often very good at getting precisely these kinds of efficiencies. But if the task requires a lot of memory interdependencies, then you will get lots of O(N^⅓) terms. An open problem is coming up with mathematical models of computation that are simple but do a good job of capturing these nuances.
> In computational complexity theory, and more specifically in the analysis of algorithms with integer data, the transdichotomous model is a variation of the random-access machine in which the machine word size is assumed to match the problem size.
> In my binary field code, I found that an 8-bit precomputation table (in that context, 224 items taking up 128 MB) led to faster computations than a 16-bit precomputation table (232 items taking up 8 GB): while the latter fit into RAM, the former could fit into cache, and the faster access time of the former was decisive.
Interesting to see some first principles thinking to back out the “why”.
I don't think this holds up. Historically, memory sizes have increased exponentially, but access times have gotten faster, not slower. And since the access time comes from the memory architecture, you can get 8 GB of RAM or 64 GB of RAM with the same access times. The estimated values in the table are not an especially good fit (30-50% off) and get worse if you adjust the memory sizes.
Theoretically, it still doesn't hold up, at least not for the foreseeable future. PCBs and integrated circuits are basically two-dimensional. Access times are limited by things like trace lengths (at the board level) and parasitics (at the IC level), none of which are defined by volume.
Not true, because then in theory you could build just L1 - L2 - L3 Cache with 64GB RAM instead of 1 - 2 MB:
For SRAM in L1/L2/L3 for example you need to manufacture 6 transistors for 1 bit, while for DRAM you need 1 transistor and 1 capacitor.
Thus, would men your chips at that high speed would become very big, and the speed of information through the wires would make a difference: On semiconductor level its a difference if you need to travel 1inch or 10inch billion times per second, creating an "efficient border" of how big your SRAM could max be in dependence of chip-size (and other factors like thermal effects)
Why didn’t computers have 128 terabytes of memory ten years ago? Because the access time would have been shit. You’re watching generation after generation of memory architectures compromise between access time and max capacity and drawing the wrong conclusions. If memory size were free we wouldn’t have to wait five years to get twice as much of it.
Memory access times have not significantly improved in many years.
Memory bandwidth has improved, but it hasn't kept up with memory size or with CPU speeds. When I was a kid you could get a speedup by using lookup tables for trig functions - you'd never do that today, it's faster to recalculate.
2D vs 3D is legit, I have seen this law written down as O(sqrt N) for that reason. However, there's a lot of layer stacking going on on memory chips these days (especially flash memory or HBM for GPUs) so it's partially 3D.
> Memory access times have not significantly improved in many years.
We could say that it actually became worse and not better if we put it more into the context. For example, 90ns latency coupled with a 3GHz core is "better" than 90ns latency coupled with a 5GHz core. In latter, CPU core ends up being stalled for 450 cycles while in the former case almost half as much - 237 cycles.
While in absolute terms memory access has gotten faster, in relative terms it is MUCH slower today, compared to CPU speeds.
A modern CPU can perform hundreds or even thousands of computations while waiting for a single word to be read from main memory - and you get another order of magnitude slowdown if we're going to access data from an SSD. This used to be much closer to 1:1 with old machines, say in the Pentium 1-3 era or so.
And regardless of any speedup, the point remains as true today as it has always been: the more memory you want to access, the slower accessing it will be. Retrieving a word from a pool of 50PB will be much slower than retrieving a word from a pool of 1MB, for various fundamental reasons (even address resolution has an impact, even if we want to ignore physics).
> We can ask a question: how long (in nanoseconds) does it take to access a type of memory of which an average laptop has N bytes? Here's GPT's answer:
"Here's what GPT says" is not an empirical argument. If you can't do better than that (run a benchmark, cite some literature), why should I bother to read what you wrote?
https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1/3_fit.... "The blue line is O(√N)."
This has been rehashed many times before, and the best blog post on this topic is here: https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html
Shameless plug here, where I explore possible gains in efficiency, performance and security by scaling out rather than up; (no subscription) https://anderscj.substack.com/p/liberal-democracies-needs-a-...
Truly mind-boggling times where "here is the empirical proof" means "here is what chatGPT says" to some people.
I have a feeling that people who have such absolute trust in AI models have never hit regen and seen how much truth can vary.
The better question is: Why should you bother to read what the author didn't bother to write?
https://chatgpt.com/share/68e6eeba-8284-800e-b399-338e6c4783...
https://chatgpt.com/share/68e6ef4a-bdd0-800e-877a-b3d5d4dc51...
But generally, sure, you can make LLMs say many false things, sometimes even by just asking them a question in good faith, and it certainly casts some doubt on a blog post quoting an LLM as a source.
[0] https://en.wikipedia.org/wiki/Bekenstein_bound
[0] https://en.wikipedia.org/wiki/Ultimate_fate_of_the_universe?...
That doesn't apply for the Bekenstein Bound though.
Literally the first line of the wikipedia article:
> In physics, the Bekenstein bound (named after Jacob Bekenstein) is an upper limit on the thermodynamic entropy S, or Shannon entropy H, that can be contained within a given *finite* region of space which has a *finite* amount of energy—or equivalently, the maximum amount of information that is required to perfectly describe a given physical system down to the quantum level.
All shared state is communicated through shared memory pool that is either accessed directly through segregated address ranges or via DMA.
The higher level the language, the less interest there is to manually manage memory. It is just something to offload to the gc/runtime/etc.
So, i think this is a no-go. The market wont accept it.
Deleted Comment
Similar article on the same topic from 2014: https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html
> If you can break up a task into many parts, each of which is highly local, then memory access in each part will be O(1). GPUs are already often very good at getting precisely these kinds of efficiencies. But if the task requires a lot of memory interdependencies, then you will get lots of O(N^⅓) terms. An open problem is coming up with mathematical models of computation that are simple but do a good job of capturing these nuances.
> In computational complexity theory, and more specifically in the analysis of algorithms with integer data, the transdichotomous model is a variation of the random-access machine in which the machine word size is assumed to match the problem size.
> In my binary field code, I found that an 8-bit precomputation table (in that context, 224 items taking up 128 MB) led to faster computations than a 16-bit precomputation table (232 items taking up 8 GB): while the latter fit into RAM, the former could fit into cache, and the faster access time of the former was decisive.
Interesting to see some first principles thinking to back out the “why”.
Theoretically, it still doesn't hold up, at least not for the foreseeable future. PCBs and integrated circuits are basically two-dimensional. Access times are limited by things like trace lengths (at the board level) and parasitics (at the IC level), none of which are defined by volume.
Source: "What every Programmer should know about memory" https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
Why didn’t computers have 128 terabytes of memory ten years ago? Because the access time would have been shit. You’re watching generation after generation of memory architectures compromise between access time and max capacity and drawing the wrong conclusions. If memory size were free we wouldn’t have to wait five years to get twice as much of it.
Memory bandwidth has improved, but it hasn't kept up with memory size or with CPU speeds. When I was a kid you could get a speedup by using lookup tables for trig functions - you'd never do that today, it's faster to recalculate.
2D vs 3D is legit, I have seen this law written down as O(sqrt N) for that reason. However, there's a lot of layer stacking going on on memory chips these days (especially flash memory or HBM for GPUs) so it's partially 3D.
We could say that it actually became worse and not better if we put it more into the context. For example, 90ns latency coupled with a 3GHz core is "better" than 90ns latency coupled with a 5GHz core. In latter, CPU core ends up being stalled for 450 cycles while in the former case almost half as much - 237 cycles.
A modern CPU can perform hundreds or even thousands of computations while waiting for a single word to be read from main memory - and you get another order of magnitude slowdown if we're going to access data from an SSD. This used to be much closer to 1:1 with old machines, say in the Pentium 1-3 era or so.
And regardless of any speedup, the point remains as true today as it has always been: the more memory you want to access, the slower accessing it will be. Retrieving a word from a pool of 50PB will be much slower than retrieving a word from a pool of 1MB, for various fundamental reasons (even address resolution has an impact, even if we want to ignore physics).
> PCBs and integrated circuits are basically two-dimensional.
Yes, what pushes the complexity into Ω(n^1/2), that fits the original claim.
> Access times are limited by things like trace lengths
Again, Ω(n^1/2)
> and parasitics
And those are Ω(n)
So, as you found, on practice it's much worse, but the limit on the article is also there.