Memory access is O(N^[1/3])

Though at the limit, it would have to be at least O(sqrt(n)) thanks to the Bekenstein bound [0]. And of course, as mentioned in TFA, you can always do better if you can get away with local random access in parallel, rather than global random access.

[0] https://en.wikipedia.org/wiki/Bekenstein_bound

dooglius · 4 months ago

If one wants to start talking cosmology, it's unlikely to the case that arbitrarily long-lived computers are possible, I don't think any of the theories in [0] are conducive to either an infinite-time or infinite-memory computer, so the strict mathematical definition for Big-O doesn't hold up. IMO it's better to use Big-O as an effective theory for predicting runtime on human-scale computers than take the mathematical formalism too literally.

[0] https://en.wikipedia.org/wiki/Ultimate_fate_of_the_universe?...

jychang · 4 months ago

> infinite-time or infinite-memory computer

That doesn't apply for the Bekenstein Bound though.

Literally the first line of the wikipedia article:

> In physics, the Bekenstein bound (named after Jacob Bekenstein) is an upper limit on the thermodynamic entropy S, or Shannon entropy H, that can be contained within a given *finite* region of space which has a *finite* amount of energy—or equivalently, the maximum amount of information that is required to perfectly describe a given physical system down to the quantum level.

Dylan16807 · 4 months ago

Once you're talking about computers that verge on being black holes, I think you're making too many assumptions about how everything works to give it a speed rating.

jychang · 4 months ago

Strong disagree- this is a fundamental core aspect of information theory.

hinkley · 4 months ago

Ultimately I think we will go to full NUMA like Sun and others tried. Instead of having L4 and then L5 caches, each core simply has 4GB of local working memory and you use programming languages that are ready for this. Erlang would be easy, and I think Rust has the semantics to make it work but it might take years of compiler advancement to make it efficient.

All shared state is communicated through shared memory pool that is either accessed directly through segregated address ranges or via DMA.

eklitzke · 4 months ago

NUMA has a huge amount of overhead (e.g. in terms of intercore latency), and NUMA server CPUs cost a lot more than single socket boards. If you look at the servers at Google or Facebook they will have some NUMA servers for certain workloads that actually need them, but most most servers will be single socket because they're cheaper and applications literally run faster on them. It's a win win if you can fit your workload on a single socket server so there is a lot of motivation to make applications work in a non-NUMA way if at all possible.

arka2147483647 · 4 months ago

Honestly, i doubt it. That exposes many details to the programmers that many of them would prefer not to know.

The higher level the language, the less interest there is to manually manage memory. It is just something to offload to the gc/runtime/etc.

So, i think this is a no-go. The market wont accept it.

Deleted Comment

I don't think this holds up. Historically, memory sizes have increased exponentially, but access times have gotten faster, not slower. And since the access time comes from the memory architecture, you can get 8 GB of RAM or 64 GB of RAM with the same access times. The estimated values in the table are not an especially good fit (30-50% off) and get worse if you adjust the memory sizes.

Theoretically, it still doesn't hold up, at least not for the foreseeable future. PCBs and integrated circuits are basically two-dimensional. Access times are limited by things like trace lengths (at the board level) and parasitics (at the IC level), none of which are defined by volume.

ta12653421 · 4 months ago

Not true, because then in theory you could build just L1 - L2 - L3 Cache with 64GB RAM instead of 1 - 2 MB: For SRAM in L1/L2/L3 for example you need to manufacture 6 transistors for 1 bit, while for DRAM you need 1 transistor and 1 capacitor. Thus, would men your chips at that high speed would become very big, and the speed of information through the wires would make a difference: On semiconductor level its a difference if you need to travel 1inch or 10inch billion times per second, creating an "efficient border" of how big your SRAM could max be in dependence of chip-size (and other factors like thermal effects)

Source: "What every Programmer should know about memory" https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

hinkley · 4 months ago

You’re cheating and I don’t think you realize it.

Why didn’t computers have 128 terabytes of memory ten years ago? Because the access time would have been shit. You’re watching generation after generation of memory architectures compromise between access time and max capacity and drawing the wrong conclusions. If memory size were free we wouldn’t have to wait five years to get twice as much of it.

zahlman · 4 months ago

There are also economic considerations, power use, etc.

Legend2440 · 4 months ago

Memory access times have not significantly improved in many years.

Memory bandwidth has improved, but it hasn't kept up with memory size or with CPU speeds. When I was a kid you could get a speedup by using lookup tables for trig functions - you'd never do that today, it's faster to recalculate.

2D vs 3D is legit, I have seen this law written down as O(sqrt N) for that reason. However, there's a lot of layer stacking going on on memory chips these days (especially flash memory or HBM for GPUs) so it's partially 3D.

menaerus · 4 months ago

> Memory access times have not significantly improved in many years.

We could say that it actually became worse and not better if we put it more into the context. For example, 90ns latency coupled with a 3GHz core is "better" than 90ns latency coupled with a 5GHz core. In latter, CPU core ends up being stalled for 450 cycles while in the former case almost half as much - 237 cycles.

LorenPechtel · 4 months ago

Yeah, some years back I took a look at the Sieve of Eratosthenes--much, much faster to simply calculate.

tsimionescu · 4 months ago

While in absolute terms memory access has gotten faster, in relative terms it is MUCH slower today, compared to CPU speeds.

A modern CPU can perform hundreds or even thousands of computations while waiting for a single word to be read from main memory - and you get another order of magnitude slowdown if we're going to access data from an SSD. This used to be much closer to 1:1 with old machines, say in the Pentium 1-3 era or so.

And regardless of any speedup, the point remains as true today as it has always been: the more memory you want to access, the slower accessing it will be. Retrieving a word from a pool of 50PB will be much slower than retrieving a word from a pool of 1MB, for various fundamental reasons (even address resolution has an impact, even if we want to ignore physics).

marcosdumay · 4 months ago

You are correct in that the OP used the wrong metric. He should have written a big omega, not a big O.

> PCBs and integrated circuits are basically two-dimensional.

Yes, what pushes the complexity into Ω(n^1/2), that fits the original claim.

> Access times are limited by things like trace lengths

Again, Ω(n^1/2)

> and parasitics

And those are Ω(n)

So, as you found, on practice it's much worse, but the limit on the article is also there.

j2kun · 4 months ago

> The empirical argument

> We can ask a question: how long (in nanoseconds) does it take to access a type of memory of which an average laptop has N bytes? Here's GPT's answer:

"Here's what GPT says" is not an empirical argument. If you can't do better than that (run a benchmark, cite some literature), why should I bother to read what you wrote?

The empirical argument actually states that memory access is O(n^1/2)

https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1/3_fit.... "The blue line is O(√N)."

This has been rehashed many times before, and the best blog post on this topic is here: https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html

acje · 4 months ago

Thanks for the good links. I think we generally have become so accustomed to the scaled up von Neumann strategy that we don´t see how much efficiency and performance we leave on the table by not building much smaller memory hierarchies.

Shameless plug here, where I explore possible gains in efficiency, performance and security by scaling out rather than up; (no subscription) https://anderscj.substack.com/p/liberal-democracies-needs-a-...

bollu · 4 months ago

That's really neat, I hadn't seen the black hole argument before, that's really cute

mathgradthrow · 4 months ago

Empirical arguments do not apply to asymptotics anyway.

mochomocha · 4 months ago

The article started really well, and I was looking forward to the empirical argument.

Truly mind-boggling times where "here is the empirical proof" means "here is what chatGPT says" to some people.

limpbizkitfan · 4 months ago

its Vitalik what do you expect? Do you think Bernie Madoff was speaking objectively when talking to his potential clients?

Bolwin · 4 months ago

Better than "according to Google" (pre ai) which I saw cited too many times.

I have a feeling that people who have such absolute trust in AI models have never hit regen and seen how much truth can vary.

ryandrake · 4 months ago

> why should I bother to read what you wrote?

The better question is: Why should you bother to read what the author didn't bother to write?

raincole · 4 months ago

The cool thing about "here's what GPT says" is that you can make GPT says whatever you want!

https://chatgpt.com/share/68e6eeba-8284-800e-b399-338e6c4783...

https://chatgpt.com/share/68e6ef4a-bdd0-800e-877a-b3d5d4dc51...

lxgr · 4 months ago

Why would it not run with your provided hypothesis? It even added an explicit hint that this is a hypothetical scaling exercise, and that real hardware does not scale like that.

But generally, sure, you can make LLMs say many false things, sometimes even by just asking them a question in good faith, and it certainly casts some doubt on a blog post quoting an LLM as a source.

LegionMammal978 · 4 months ago

I've seen this idea before.

Similar article on the same topic from 2014: https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html

sshine · 4 months ago

https://cs.stackexchange.com/questions/19367/omega-sqrt3n-lo...

samuelknight · 4 months ago

I disagree with this model because it assumes processing occurs at a point and memory is (optimally) distributed across space around it in every direction in an analog to a Von Neumann CPU architecture. However it is entirely possible to distribute compute with memory. For example, Samsung has a technology called PIM (Processing in Memory) where simple compute units are inserted inside HBM memory layers. Algorithms that can take advantage of this run much faster and at much lower power because it skips the bus entirely. More importantly, the compute scales in proportion to the memory size/space.

haberman · 4 months ago

The article says exactly this in bold at the bottom:

> If you can break up a task into many parts, each of which is highly local, then memory access in each part will be O(1). GPUs are already often very good at getting precisely these kinds of efficiencies. But if the task requires a lot of memory interdependencies, then you will get lots of O(N^⅓) terms. An open problem is coming up with mathematical models of computation that are simple but do a good job of capturing these nuances.

eesmith · 4 months ago

The fancy word for assuming O(1) access and not needing to spill over into longer addressing modes is, I believe, "transdichotomous model" - https://en.wikipedia.org/wiki/Transdichotomous_model

> In computational complexity theory, and more specifically in the analysis of algorithms with integer data, the transdichotomous model is a variation of the random-access machine in which the machine word size is assumed to match the problem size.

hardwaresofton · 4 months ago

The meat of the article:

> In my binary field code, I found that an 8-bit precomputation table (in that context, 224 items taking up 128 MB) led to faster computations than a 16-bit precomputation table (232 items taking up 8 GB): while the latter fit into RAM, the former could fit into cache, and the faster access time of the former was decisive.

Interesting to see some first principles thinking to back out the “why”.

AdamH12113 · 4 months ago