From the „Camel Book”, one of my favorite programming books (not because it was enlightening, but because it was entertaining); on the Perl philosophy:
“If you’re running out of memory, you can buy more. But if you’re running out of time, you’re screwed.”
This can work both ways. If the program needs more memory than the computer has, it can't run until you buy more. But if it takes twice as long, at least it runs at all.
Modern computers have so much memory it feels like it doesn't matter. Spending that memory on arrays for algorithms or things like a Garbage Collector just make sense. And, extra memory is worthless. You WANT the summation of all your programs to use all your memory. The processor, on the other hand, can context switch and do everything in it's power to make sure it stays busy.
The CPU is like an engine and memory is your gas tank. Idling the engine is bad, but leaving gas in the tank doesn't hurt, but it doesn't help either. I'm not gonna get to my destination faster because I have a full tank.
The Camel book was written when Moore’s Law was trucking along. These days you can’t buy much more time but you used to be able to just fine. Now it’s horizontal scaling. Which is still more time.
Reminds me of when I started working on storage systems as a young man and once suggested pre-computing every 4KB block once and just using pointers to the correct block as data is written, until someone pointed out that the number of unique 4KB blocks (2^32768) far exceeds the number of atoms in the universe.
It seems like you weren’t really that far off from implementing it, you just need a 4 KB pointer to point to the right block. And in fact, that is what all storage systems do!
Reminds me of when I imagined brute-forcing every possible small picture as simply 256 shades of gray for each pixel x (640 x 480 = 307200 pixels) = 78 million possible pictures.
Actually I don't have any intuition for why that's wrong, except that if we catenate the rows into one long row then the picture can be considered as a number 307200 digits long in base 256, and then I see that it could represent 256^307200 possible different values. Which is a lot: https://www.wolframalpha.com/input?i=256%5E307200
The other problem is that (if we take literally the absurd proposal of computing "every possible block" up front) you're not actually saving any space by doing this, since your "pointers" would be the same size as the blocks they point to.
In some contexts, dictionary encoding (which is what you're suggesting, approximately) can actually work great. For example common values or null values (which is a common type of common value). It's just less efficient to try to do it with /every/ block. You have to make it "worth it", which is a factor of the frequency of occurrence of the value. Shorter values give you a worse compression ratio on one hand, but on the other hand it's often likelier that you'll find it in the data so it makes up for it, to a point.
There are other similar lightweight encoding schemes like RLE and delta and frame of reference encoding which all are good for different data distributions.
The idea is not too far off. You could compute a hash on an existing data block. Store the hash and data block mapping. Now you can use the hash in anywhere that data block resides, i.e. any duplicate data blocks can use the same hash. That's how storage deduplication works in the nutshell.
The other problem is to address all possible 4098 byte blocks, you need a 4098 byte address. I suppose we would expect the actual number of blocks computed and reused to be a sparse subset.
Alternately, have you considered 8 byte blocks?
If your block pointers are 8-byte addresses, you don't need to count on block sparsity, in fact, you don't even need to have the actual blocks.
A pointer type, that implements self-read and writes, with null allocations and deletes, is easy to implement incredibly efficiently in any decent type system. A true zero-cost abstraction, if I have ever seen one!
(On a more serious note, a memory heap and CPU that cooperated to interpret pointers with the top bit set, as a 63-bit linear-access/write self-storage "pointer", is an interesting thought.
We know for a fact that when we disable the cache of the processors their performance plummets, so the question is how much of computation is brand new computation (never seen before)?
Using an LLM and caching eg FAQs can save a lot of token credits
AI is basically solving a search problem and the models are just approximations of the data - like linear regression or fourier transforms.
The training is basically your precalculation. The key is that it precalculates a model with billions of parameters, not overfitting with an exact random set of answers hehe
> if we were centrally storing all of the operations
Community-scale caching? That's basically what pre-compiled software distributions are. And one idea for addressing the programming language design balk "that would be a nice feature, but it's not known how to compile it efficiently, so you can't have it", is highly-parallel cloud compilation, paired with a community-scale compiler cache. You might not mind if something takes say a day to resolve, if the community only needs it run once per release.
https://conwaylife.com/wiki/HashLife is an algorithm for doing basically this in Conway’s Game of Life, which is Turing complete. I remember my first impression being complete confusion: here’s a tick-by-tick simulation too varied and complex to encapsulate in a formula, and you’re telling me I can just skip way into its future?
If I read that page correctly, it does this for areas with empty space between them?
Makes sense. Say you have a pattern (surrounded by empty space) that 'flickers': A-B-A-B-A... etc. Then as long as nothing intrudes, nth generation is the same pattern as in n+1000,000th generation. Similar for patterns that do a 3-cycle, 4-cycle etc.
All you'd need is a) a way to detect repeating patterns, and b) do some kind of collision detection between areas/patterns (there's a thing called 'lightspeed' in Life, that helps).
I think it is very intuitive that more space beats the pants off of more time.
In time O(n) you can use O(n) cells on a tape, but there are O(2^n) possible configurations of symbols on a tape of length n (for an alphabet with 2 symbols), so you can do so much more with n space than with n time.
My intuition: the value of a cell can represent the result of multiple (many) time units used to compute something. If you cannot store enough intermediate results, you may end up needing to recalculate the same / similar results over and over - at least in some algorithms. So one cell can represent the results of hundreds of time units, and being able to store / load that one value to re-use it later can then replace those same hundreds of time units. In effect, space can be used for "time compression" (like a compressed file) when the time is used to compute similar values multiple times.
If intermediate results are entirely uncorrelated, with no overlap in the work at all, that would not hold - space will not help you. Edit: This kind of problem is very rare. Think of a cache with 0 percent hit rate - almost never happens.
And you can't really do it the other way around (at least not in current computing terms / concepts): you cannot use a single unit of time as a standin / proxy for hundreds of cells, since we don't quite have infinitely-broad SIMD architectures.
There's many algorithms with a space vs time tradeoff. But what works best, depends a lot on the time/energy cost of re-computing something, vs the storage/bandwidth cost of caching results.
Limited bandwidth / 'expensive' storage, simple calculation (see: today's hyper-fast CPU+L1 cache combo's) → better to re-compute some things on the fly as needed.
I suspect there's a lot of existing software (components) out there designed along the "save CPU cycles, burn storage" path, where in modern reality a "save storage, CPU cycles are cheap" would be more effective. CPU speeds have grown way way faster than main memory bandwidth (or even size?) over the last decades.
For a datacenter, supercomputer, embedded system, PC or some end-user's phone, the metrics will be different. But same principle applies.
As I understand it, this is the essential insight behind dynamic programming algorithms; the whole idea is to exploit the redundancies in a recursive task by memoizing the partial results of lower order stages.
I think this theorem applies well for modern LLMs: large language model with pre-computed weights can be used to compute very complex algorithms that approximate human knowledge, that otherwise were impossible or would have required many orders more compute to calculate
Also, the O(1) random memory access assumption makes it easy to take memory for granted. Really it's something like O(n^(1/3)) when you're scaling the computer to the size of the problem, and you can see this in practice in datacenters.
I forget the name of the O(1) access model. Not UMA, something else.
O(n^(1/2)) really, since data centers are 2 dimensional, not 3 dimensional.
(Quite aside from the practical "we build on the surface of the earth" consideration, heat dissipation considerations limit you to a 2 dimensional circuit in 3-space.)
On the other hand, actual computers can work in parallel when you scale the hardware, something that the TM formulation doesn't cover. It can be interesting which algorithms work well with lots of computing power subject to data locality. (Brains being the classic example of this.)
I think that since many people find it intuitive that P != NP, and PSPACE sits way on top of polynomial hierarchy that it is intuitive even if it’s unproven.
The article is about a new proof wherein P == PSPACE.
Something we all intuitively expected but someone finally figured out an obscure way to prove it.
--------
This is a really roundabout article that takes a meandering path to a total bombshell in the field of complexity theory. Sorry for spoiling but uhhh, you'd expect an article about P == PSPACE would get to the point faster....
I think it really depends on the task at hand, and not that intuitive. At some point accessing the memory might be slower than repeating the computation, especially when the storage is slow.
One one hand yes, on the other there might be some problems that are inherently difficult to parallelize (alternating machine PTIME is the same as deterministic PSPACE) where space doesn't buy you much. The jump from paper from t/log t to sqrt(t log t) is huge, but it still might be that unbounded parallelism doesn't buy you much more.
I’m not sure what you mean here. If you’re in the realm of “more space” than you’re not thinking of the time it takes.
More precisely, I think it is intuitive that the class of problems that can be solved in any time given O(n) space is far larger than the class of problems that can be solved in any space given O(n) time.
Interesting. It's one of those things that I’ve always “just assumed,” without thinking about it.
I did a lot of raster graphics programming, in my career, and graphics work makes heavy use of lookup tables.
Yesterday, I posted a rather simple tool I wrote[0]: a server that “frontloads” a set of polygons into a database, and then uses them, at query time. It’s fairly fast (but I’m sure it could be a lot faster). I wrote it in a few hours, and got pretty good performance, right out of the starting gate.
Pretty basic stuff. I doubt the pattern is unique, but it’s one that I’ve used for ages. It’s where I try to do as much “upfront” work as possible, and store “half-baked” results into memory.
Like I said, I always “just worked that way,” and never really thought about it. There’s been a lot of “rule of thumb” stuff in my work. Didn’t have an MIT professor to teach it, but it’s sort of gratifying to see that it wasn’t just “old wives” stuff.
There’s probably a ton of stuff that we do, every day, that we don’t think about. Some of it almost certainly originated from really smart folks like him, finding the best way (like the “winding number” algorithm, in that server[1]), and some of it also probably comes from “grug-brained programmers,” simply doing what makes sense.
I'm in the depths of optimization on a game right now, and it's interesting how the gains I'm making currently all seem to be a matter of scaling the concept of lookup tables, and using the right tool for the job.
What I mean is that traditionally I think peoples' ideas of lookup tables are things like statically baked arrays setup at compile time, or even first thing at runtime, and they never change. But if you loosen your adherence to that last idea a bit, where a lookup table can change slightly over time, you can get a ton of mileage out of a comparatively small amount of memory compared to wasting cycles every frame.
As for the right tool for the job, I've read tons of dev logs and research papers over the years about moving work to the GPU, but this last few months stint of ripping my game inside out has really made me see the light. It's not just lookup tables built at compile or early runtime, but lookup tables modified slightly over time, and sent to the GPU as textures and used there.
Follow this train of thought long enough, and now we're just calling memory writes and reads "lookup tables" when they aren't really that anymore, but whos to say where the barrier really lies?
To some extent it is required that all serious work by a computer be the kind of repititious thing that can be at least partially addressed by a lookup table.
If you take the example of a game, drawing sprites say. Drawing a single preloaded sprite of reasonable size is always cheap, so a slow frame must have an excessive number. It's very hard to construct a practical scenario of a large number of truly distinct sprites though. A level has a finite tile palette, a finite cast of characters, abilities, etc. It's hard to logistically get them all into a scene together, and even then it won't be that many. So the only scenario left where sprite drawing will be slow is drawing the same handful of sprites over and over again. By contrast that's super common: just spam a persistent projectile, tap the analog stick to generate dust particles, etc.
Without getting too much into detail (because the people I worked for were really paranoid, and I don't want to give them agita), we used to build lookup tables "on the fly," sometimes, deep inside iterators.
For example, each block of pixels might have some calculated characteristics that were accessed by a hash into a LUT, but the characteristics would change, as we went through the image.
We'd do a "triage" run, where we'd build the LUT, then a "detailed" run, where we'd apply the LUT to the pixels.
I am confused. If a single-tape turing machine receives a digit N in binary, and is supposed to write N ones on the tape, on the right side of the digit N, it performs N steps.
If you expect N ones at the output, how can this machine be simulated in the space smaller than N?
This machine must decrement the digit N at the beginning of the tape, and move to the end of the tape to write "1", so it runs in time O(N^2), not O(N)? (as it takes N "trips" to the end of the tape, and each "trip" takes 1, 2, 3 .. N steps)
Since turing machines can not jump to any place on a tape in constant time (like computers can), does it have any impact on real computers?
This paper looks exclusively at decision problems, i.e. problems where the output is a single bit.
EDIT: This makes sense because if you look at all problems with N outputs then that is just the same as "gluing together" N different decision problems (+ some epsilon of overhead)
I always thought that the "reverse" relationship between the time and the space requirements is explained by the simple fact that when you have a set of algorithms with certain time and space requirements and you constrain one of these parameters, the other parameter will not be necessarily (in practice oftentimes not) the most optimal. But it doesn't necessarily mean that the faster algorithm will require less memory or vice versa.
This "reverse" relationship works with other parameters. Such as in sorting algorithms, where besides time and space requirements you have stability requirement. If you constrain the sorting algorithms by stability, you won't have any advantage (will likely to have some disadvantage) in performance over non constrained algorithms. And wee see that the known stable sorting algorithms are either slower or require more memory. Nobody has yet invented a stable sorting algorithm that has comparable performance (both time and space) to non stable sorting algorithms. That's it.
https://arxiv.org/abs/2502.17779
“If you’re running out of memory, you can buy more. But if you’re running out of time, you’re screwed.”
The CPU is like an engine and memory is your gas tank. Idling the engine is bad, but leaving gas in the tank doesn't hurt, but it doesn't help either. I'm not gonna get to my destination faster because I have a full tank.
In fact I don’t think we would need processors anymore if we were centrally storing all of the operations ever done in our processors.
Now fast retrieval is another problem for another thread.
Actually I don't have any intuition for why that's wrong, except that if we catenate the rows into one long row then the picture can be considered as a number 307200 digits long in base 256, and then I see that it could represent 256^307200 possible different values. Which is a lot: https://www.wolframalpha.com/input?i=256%5E307200
There are other similar lightweight encoding schemes like RLE and delta and frame of reference encoding which all are good for different data distributions.
Alternately, have you considered 8 byte blocks?
If your block pointers are 8-byte addresses, you don't need to count on block sparsity, in fact, you don't even need to have the actual blocks.
A pointer type, that implements self-read and writes, with null allocations and deletes, is easy to implement incredibly efficiently in any decent type system. A true zero-cost abstraction, if I have ever seen one!
(On a more serious note, a memory heap and CPU that cooperated to interpret pointers with the top bit set, as a 63-bit linear-access/write self-storage "pointer", is an interesting thought.
It's basically how deduplication works in ZFS. And that's why it only makes sense when you store a lot of repetitive data, e.g. VM images.
https://github.com/philipl/pifs
Using an LLM and caching eg FAQs can save a lot of token credits
AI is basically solving a search problem and the models are just approximations of the data - like linear regression or fourier transforms.
The training is basically your precalculation. The key is that it precalculates a model with billions of parameters, not overfitting with an exact random set of answers hehe
Do LLM providers use caches for FAQs, without changing the number of tokens billed to customer?
Community-scale caching? That's basically what pre-compiled software distributions are. And one idea for addressing the programming language design balk "that would be a nice feature, but it's not known how to compile it efficiently, so you can't have it", is highly-parallel cloud compilation, paired with a community-scale compiler cache. You might not mind if something takes say a day to resolve, if the community only needs it run once per release.
Makes sense. Say you have a pattern (surrounded by empty space) that 'flickers': A-B-A-B-A... etc. Then as long as nothing intrudes, nth generation is the same pattern as in n+1000,000th generation. Similar for patterns that do a 3-cycle, 4-cycle etc.
All you'd need is a) a way to detect repeating patterns, and b) do some kind of collision detection between areas/patterns (there's a thing called 'lightspeed' in Life, that helps).
On my way to memoize your search history.
In time O(n) you can use O(n) cells on a tape, but there are O(2^n) possible configurations of symbols on a tape of length n (for an alphabet with 2 symbols), so you can do so much more with n space than with n time.
If intermediate results are entirely uncorrelated, with no overlap in the work at all, that would not hold - space will not help you. Edit: This kind of problem is very rare. Think of a cache with 0 percent hit rate - almost never happens.
And you can't really do it the other way around (at least not in current computing terms / concepts): you cannot use a single unit of time as a standin / proxy for hundreds of cells, since we don't quite have infinitely-broad SIMD architectures.
Expensive calculation, cheap storage → caching results helps.
Limited bandwidth / 'expensive' storage, simple calculation (see: today's hyper-fast CPU+L1 cache combo's) → better to re-compute some things on the fly as needed.
I suspect there's a lot of existing software (components) out there designed along the "save CPU cycles, burn storage" path, where in modern reality a "save storage, CPU cycles are cheap" would be more effective. CPU speeds have grown way way faster than main memory bandwidth (or even size?) over the last decades.
For a datacenter, supercomputer, embedded system, PC or some end-user's phone, the metrics will be different. But same principle applies.
I forget the name of the O(1) access model. Not UMA, something else.
(Quite aside from the practical "we build on the surface of the earth" consideration, heat dissipation considerations limit you to a 2 dimensional circuit in 3-space.)
EDIT: I am a dumbass and misremembered.
Something we all intuitively expected but someone finally figured out an obscure way to prove it.
--------
This is a really roundabout article that takes a meandering path to a total bombshell in the field of complexity theory. Sorry for spoiling but uhhh, you'd expect an article about P == PSPACE would get to the point faster....
Deleted Comment
More precisely, I think it is intuitive that the class of problems that can be solved in any time given O(n) space is far larger than the class of problems that can be solved in any space given O(n) time.
I did a lot of raster graphics programming, in my career, and graphics work makes heavy use of lookup tables.
Yesterday, I posted a rather simple tool I wrote[0]: a server that “frontloads” a set of polygons into a database, and then uses them, at query time. It’s fairly fast (but I’m sure it could be a lot faster). I wrote it in a few hours, and got pretty good performance, right out of the starting gate.
Pretty basic stuff. I doubt the pattern is unique, but it’s one that I’ve used for ages. It’s where I try to do as much “upfront” work as possible, and store “half-baked” results into memory.
Like I said, I always “just worked that way,” and never really thought about it. There’s been a lot of “rule of thumb” stuff in my work. Didn’t have an MIT professor to teach it, but it’s sort of gratifying to see that it wasn’t just “old wives” stuff.
There’s probably a ton of stuff that we do, every day, that we don’t think about. Some of it almost certainly originated from really smart folks like him, finding the best way (like the “winding number” algorithm, in that server[1]), and some of it also probably comes from “grug-brained programmers,” simply doing what makes sense.
[0] https://news.ycombinator.com/item?id=44046227
[1] https://github.com/LittleGreenViper/LGV_TZ_Lookup/blob/e247f...
What I mean is that traditionally I think peoples' ideas of lookup tables are things like statically baked arrays setup at compile time, or even first thing at runtime, and they never change. But if you loosen your adherence to that last idea a bit, where a lookup table can change slightly over time, you can get a ton of mileage out of a comparatively small amount of memory compared to wasting cycles every frame.
As for the right tool for the job, I've read tons of dev logs and research papers over the years about moving work to the GPU, but this last few months stint of ripping my game inside out has really made me see the light. It's not just lookup tables built at compile or early runtime, but lookup tables modified slightly over time, and sent to the GPU as textures and used there.
Follow this train of thought long enough, and now we're just calling memory writes and reads "lookup tables" when they aren't really that anymore, but whos to say where the barrier really lies?
If you take the example of a game, drawing sprites say. Drawing a single preloaded sprite of reasonable size is always cheap, so a slow frame must have an excessive number. It's very hard to construct a practical scenario of a large number of truly distinct sprites though. A level has a finite tile palette, a finite cast of characters, abilities, etc. It's hard to logistically get them all into a scene together, and even then it won't be that many. So the only scenario left where sprite drawing will be slow is drawing the same handful of sprites over and over again. By contrast that's super common: just spam a persistent projectile, tap the analog stick to generate dust particles, etc.
For example, each block of pixels might have some calculated characteristics that were accessed by a hash into a LUT, but the characteristics would change, as we went through the image.
We'd do a "triage" run, where we'd build the LUT, then a "detailed" run, where we'd apply the LUT to the pixels.
It could get pretty hairy.
And paper: https://people.csail.mit.edu/rrw/time-vs-space.pdf
If you expect N ones at the output, how can this machine be simulated in the space smaller than N?
This machine must decrement the digit N at the beginning of the tape, and move to the end of the tape to write "1", so it runs in time O(N^2), not O(N)? (as it takes N "trips" to the end of the tape, and each "trip" takes 1, 2, 3 .. N steps)
Since turing machines can not jump to any place on a tape in constant time (like computers can), does it have any impact on real computers?
But to answer your question: "space" here refers to working space, excluding the input and output.
EDIT: This makes sense because if you look at all problems with N outputs then that is just the same as "gluing together" N different decision problems (+ some epsilon of overhead)
This "reverse" relationship works with other parameters. Such as in sorting algorithms, where besides time and space requirements you have stability requirement. If you constrain the sorting algorithms by stability, you won't have any advantage (will likely to have some disadvantage) in performance over non constrained algorithms. And wee see that the known stable sorting algorithms are either slower or require more memory. Nobody has yet invented a stable sorting algorithm that has comparable performance (both time and space) to non stable sorting algorithms. That's it.