Readit News logoReadit News
GistNoesis commented on The RAM shortage comes for us all   jeffgeerling.com/blog/202... · Posted by u/speckx
lysace · 13 days ago
Please explain to me like I am five: Why does OpenAI need so much RAM?

2024 production was (according to openai/chatgpt) 120 billion gigabytes. With 8 billion humans that's about 15 GB per person.

GistNoesis · 13 days ago
What they need is not so much memory but memory bandwidth.

For training, their models have a certain number of memory needed to store the parameters, and this memory is touched for every example of every iteration. Big models have 10^12 (>1T )parameters, and with typical values of 10^3 examples per batch, and 10^6 number of iteration. They need ~10^21 memory accesses per run. And they want to do multiple runs.

DDR5 RAM bandwidth is 100G/s = 10^11, Graphics RAM (HBM) is 1T/s = 10^12. By buying the wafer they get to choose which types of memory they get.

10^21 / 10^12 = 10^9s = 30 years of memory access (just to update the model weights), you need to also add a factor 10^1-10^3 to account for the memory access needed for the model computation)

But the good news is that it parallelize extremely well. If you parallelize you 1T parameters, 10^3 times, your run time is brought down to 10^6 s = 12 days. But you need 10^3 *10^12 = 10^15 Bytes of RAM by run for weight update and 10^18 for computation (your 120 billions gigabytes is 10^20, so not so far off).

Are all these memory access technically required : No if you use other algorithms, but more compute and memory is better if money is not a problem.

Is it strategically good to deprive your concurrents from access to memory : Very short-sighted yes.

It's a textbook cornering of the computing market to prevent the emergence of local models, because customers won't be able to buy the minimal RAM necessary to run the models locally even just the inferencing part (not the training). Basically a war on people where little Timmy won't be able to get a RAM stick to play computer games at Xmas.

GistNoesis commented on I mathematically proved the best "Guess Who?" strategy [video]   youtube.com/watch?v=_3RNB... · Posted by u/surprisetalk
GistNoesis · 19 days ago
In the video, in the continuous version the game never end and highlight the "loser" strategy.

When you are behind the optimal play is to make a gamble, which most likely will make you even worse. From the naive winning side it seems the loser is just doing a stupid strategy of not following the optimal dichotomy strategy, and therefore that's why they are losing. But in fact they are a "player" doing not only their best, but the best that can be done.

The infinite sum of ever smaller probabilities like in Zeno's paradox, converge towards a finite value. The inevitable is a large fraction of the time, you are playing catch-up and will never escape.

You are losing, playing optimally, but slowly realising the probabilities that you are a loser as evidence by the score which will most likely go down even more next round. Most likely the entire future is an endless sequence of more and more desperate looking losing bets, just hoping to strike it once that will most likely never comes.

In economics such things are called "traps", for example the poverty trap exhibits similar mechanics. Where even though you display incredible ingenuity by playing the optimal game strategy, most of the time you will never escape, and you will need to take even more desperate measures in the future. That's separating the wheat from the chaff from the chaff's perspective or how you make good villains : because like Bane in Batman there are some times (the probability is slim but finite) where the gamble pays off once and you escape the hell hole you were born in and become legend.

If you don't play this optimal strategy you will lose slower but even more surely. The optimal strategy is to bet just enough to go from your current situation to the winning side. It's also important not to overshoot : this is not always taking moonshots, but betting just enough to escape the hole, because once out, the probabilities plays in your favor.

GistNoesis commented on Simulating a Planet on the GPU: Part 1 (2022)   patrickcelentano.com/blog... · Posted by u/Doches
dahart · a month ago
Might be worth starting with a baseline where there’s no collision, only advection, and assume higher than 1fps just because this gives higher particles per second but still fits in 24GB? I wouldn’t be too surprised if you can advection 100M particles at interactive rates.
GistNoesis · a month ago
The theoretical maximum rate for 1B particle advection (Just doing p[] += v[]dt), is 1000GB/s / 24GB = 42 iteration per second. If you only have 100M you can have 10 times more iteration.

But that's without any rendering, and non interacting particles which are extremely boring unless you like fireworks. (You can add a term like v[] += g

dt for free.) And you don't need to store colors for your particles if you can compute the colors from the particle number with a function.

Rasterizing is slower, because each pixel of the image might get touched by multiple particles (which mean concurrent accesses in the GPU to the same memory address which they don't like).

Obtaining the screen coordinates is just a matrix multiply, but rendering the particles in the correct depth order requires multiple pass, atomic operations, or z-sorting. Alternatively you can slice your point clouds, by mixing them up with a peak-shaped weight function around the desired depth value, and use an order independent reduction like sum, but memory accesses are still concurrent.

For the rasterizing, you can also use the space partitioning indices of the particle to render to a part of the screen independently without concurrent access problems. That's called "tile rendering". Each tile render the subset of particles which may fall in it. (There are plenty of literature in the Gaussian Splatting community).

GistNoesis commented on Simulating a Planet on the GPU: Part 1 (2022)   patrickcelentano.com/blog... · Posted by u/Doches
janpmz · a month ago
I wish I had an intuitive understanding of how much I can do with a GPU. E.g. how many points can I move around? A simulation like this would be great for that.
GistNoesis · a month ago
TLDR : 1B particles ~ 3s per iterations

For examples like particle simulations, on a single node with a 4090 GPU everything running on GPU without memory transfer to the CPU:

-The main bottleneck is memory usage : available 24GB, Storing the particles 3 position coordinates, + 3 velocity coordinates, 4 bytes by number (float32) = Max 1B particles

-Then GPU memory bandwidth : if everything is on the GPU you get between 1000GB/s of global memory access and 10000GB/s when shared memory caches are hit. The number of memory access is roughly proportional to the number of effective collisions between your particles which is proportional to the number of particles so around 12-30 times ( see optimal sphere packing number of neighbors in 3d, and multiply by your overlap factor). All in all for 1B particles, you can collision them all and move them in 1 to 10s.

If you have to transfer things to the CPU, you become limited by the PCI-express 4.0 bandwidth of 16GB/s. So you can at most move 1B particles to and from the GPU, 0.7 times per second.

Then if you want to store the particle on disk, instead of RAM because your system is bigger, then you can either use a M2 ssd (but you will burn them quickly) which has a theoretical bandwidth of 20GB/s so not a bottleneck, or use a network storage over 100Gb/s (= 12.5GB/s) ethernet, via two interfaces to your parameter server which can be as big as you can afford.

So to summarize so far : 1B particles takes 1 to 10s per iteration per GPU. If you want to do smarter integration schemes like Rk4, you divide by 6. If you need 64 bits precisions you divide by 2. If you only need 16bits precisions you can multiply by 2.

The number of particle you need : Volume of the box / h^3 with h the diameter of the particle = finest details you want to be able to resolve.

If you use an adaptive scheme most of your particles are close to the surface of objects so O( surface of objects / h^2 ) with h=average resolution of the surface of the mesh. But adaptive scheme is 10 times slower.

The precision of the approximation can be bounded by Taylor formula. SPH is typically order 2, but has issues with boundaries, so to represent a sharp boundary the h must be small.

If you want higher order and sharp boundaries, you can do Finite Element Method, instead. But you'll need to tessellate the space with things like Delaunay/Voronoi, and update them as they move.

GistNoesis commented on Leaving Meta and PyTorch   soumith.ch/blog/2025-11-0... · Posted by u/saikatsg
morshu9001 · a month ago
What would stop PyTorch from implementing whatever optimization trick becomes important? Even if it requires a different API.
GistNoesis · a month ago
There are two types of stops : soft stops, and hard stops.

- Soft stops is when the dynamic graph computation overhead is too much, which mean you can still calculate, but if you were to write the function manually or with a better framework, you could be 10x faster.

Typical example involve manually unrolling a loop. Or doing kernel fusion. Other typical example is when you have lots of small objects or need to do loops in python because it doesn't vectorize well. Or using the sparsity efficiently by ignoring the zeros.

- Hard stop is when computing the function become impossible, because the memory needed to do the computation in a non optimal way explode. Some times you can get away with just writing customized kernels.

The typical example where you can get away with it are custom attention layers.

Typical example where you can't get away are physics simulations. Like for example the force is the gradient of energy, but you have n^2 interactions between the particles, so if you use anything more than 0 memory preserved during the forward pass per interaction, your memory consumption explode. And typically with things like Lagrangian or Hamiltonian neural networks where you look the discover dynamics of an energy conserving system, you need to be able differentiate at least three times in a row.

There are also energy expanding stops, where you need to find work-around to make it work like if you want to have your parameters changing shape during the optimization process like learning point clouds of growing size, and they spread you thin so they won't be standardized.

GistNoesis commented on Leaving Meta and PyTorch   soumith.ch/blog/2025-11-0... · Posted by u/saikatsg
Uehreka · a month ago
> at the pace of current AI code development, probably one or two years before Pytorch is old history.

Ehhh, I don’t know about that.

Sure, new AI techniques and new models are coming out pretty fast, but when I go to work with a new AI project, they’re often using a version of PyTorch or CUDA from when the project began a year or two ago. It’s been super annoying having to update projects to PyTorch 2.7.0 and CUDA 12.8 so I can run them on RTX 5000 series GPUs.

All this to say: If PyTorch was going to be replaced in a year or two, we’d know the name of its killer by now, and they’d be the talk of HN. Not to mention that at this point all of the PhDs flooding into AI startups wrote their grad work in PyTorch, it has a lot of network lock-in that an upstart would have to overcome by being way better at something PyTorch can never be good at. I don’t even know what that would be.

Bear in mind that it took a few years for Tensorflow to die out due to lock in, and we all knew about PyTorch that whole time.

GistNoesis · a month ago
> a lot of network lock-in that an upstart would have to overcome by being way better at something PyTorch can never be good at

Higher level code migration to the newer framework, is going to 0. You ask your favorite agent (or intern) to port and check that the migration is exact. We already see this in the multitude of deep-learning frameworks.

The day one optimization trick that PyTorch can't do but another framework can, which reduce your training cost 10x and PyTorch is going the way of the dodo.

The day one architecture which can't be implemented in PyTorch get superior performance, and it's bye bye python.

We see this with architectures which require real-time rendering like Gaussian Splatting (Instant Nerf), or the caching strategies for LLM sequence generation.

Pytorch's has 3 main selling point :

- Abstracting away the GPU (or device) specific code, which is due to nvidia's mess : custom optimized kernels, which you are forced to adapt to if you don't want to write custom kernels.

If you don't mind writing optimized kernels, because the machine write them. Or if you don't need Cuda because you can't use Nvidia hardware because for example you are in China. Or if you use custom silicon, like Grok and need your own kernels anyway.

- Automatic differentiation. It's one of its weak point, because they went for easy instead of optimal. They shut themselves off some architectures. Some language like Julia because of the dynamic low-level compilation can do things Pytorch won't even dream about, (but Julia has its own problems mainly related to memory allocations). Here with the pytorch's introduction of the "scan function"[2] we have made our way full circle to Theano, Tensorflow's/Keras ancestor, which is usually the pain point of the bad automatic differentiating strategy chosen by Pytorch.

The optimal solution like all physics Phds which wrote simulations know, is writing custom adjoint code with 'Source Code Transformation' or symbolically : it's not hard but very tedious so it's now a great fit for your LLM (or intern or Phd candidate running 'student gradient descent') if you prove or check your gradient calculation is ok.

- Cluster Orchestration and serialization : a model can be shared with less security risks than arbitrary source code, because you only share weights. A model can be splitted between machines dynamically. But this is also a big weakness because your code rust as you become dependent of versioning, you are locked with the specific version number your model was trained on.

[2] "https://docs.pytorch.org/xla/master/features/scan.html

GistNoesis commented on Leaving Meta and PyTorch   soumith.ch/blog/2025-11-0... · Posted by u/saikatsg
TechnicolorByte · a month ago
Can anyone recommend a technical overview describing the design decisions PyTorch made that led it to win out?
GistNoesis · a month ago
The choice of the dynamic computation graph [1] of PyTorch made it easier to debug and implement, leading to higher adoption, even though running speed was initially slower (and therefore training cost higher).

Other decisions follow from this one.

Tensorflow started with static and had to move to dynamic at version 2.0, which broke everything. Fragmentation between tensorflow 1, tensorflow 2, keras, jax.

Pytorch's compilation of this computation graph erased the remaining edge of Tensorflow.

Is the battle over ? From a purely computational point, Pytorch solution is very far from optimal and billions of dollars of electricity and GPUs are burned every year, but major players are happy with circular deals to entrench their positions. So at the pace of current AI code development, probably one or two years before Pytorch is old history.

[1] https://www.geeksforgeeks.org/deep-learning/dynamic-vs-stati...

GistNoesis commented on Backpropagation is a leaky abstraction (2016)   karpathy.medium.com/yes-y... · Posted by u/swatson741
drivebyhooting · 2 months ago
I have a naive question about backprop and optimizers.

I understand how SGD is just taking a step proportional to the gradient and how backprop computes the partial derivative of the loss function with respect to each model weight.

But with more advanced optimizers the gradient is not really used directly. It gets per weight normalization, fudged with momentum, clipped, etc.

So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?

GistNoesis · 2 months ago
>computing the exact gradient using calculus

First of all, gradient computation with back-prop (aka reverse-mode automatic differentiation) is exact to numerical precision (except for edge-cases that are not relevant here) so it's not about the way of computing the gradient.

What Andrej is trying to tell is that when you create a model, you have freedom of design in the shape of the loss function. And that in this design what matters for learning is not so much the value of the loss function, but its slopes, and curvature (peaks and valleys).

The problematic case being flat valleys, surrounded by straight cliffs, (picture the grand canyon).

Advanced optimizers in deep learning like "Adam", are still first-order, with diagonal approximation of the curvature, which mean the optimizer in addition to the gradient it has an estimate of the scale sensitivity of each parameter independently. So the cheap thing it can reasonably do is modulate the gradient with this scale.

The length of the gradient vector, being often problematic, what optimizers would usually do was something called "line-search", which is determine the optimal step-size along this direction. But the cost of doing that is usually between 10-100 evaluation of the cost function which is often not worth the effort in the noisy stochastic context, compared to just taking a smaller step multiple times.

Higher-order optimizers necessitate that the loss function is twice differentiable, so non-linearities like relu, which are cheap to calculate can't be used.

Lower-order global optimizers don't even necessitate the gradient, which is useful when the energy-function landscape has lots of local minima, (picture an egg-box).

GistNoesis commented on Show HN: Strange Attractors   blog.shashanktomar.com/po... · Posted by u/shashanktomar
GistNoesis · 2 months ago
How do I write my custom attractor equation ?
GistNoesis commented on EQ: A video about all forms of equalizers   youtube.com/watch?v=CLAt9... · Posted by u/robinhouston
GistNoesis · 2 months ago
Can measurements and calibration be done easily at home to see if your sound system is well calibrated ?

I was thinking of playing a pink noise on the speaker and recording it with a cheap microphone or displaying it with the Spectroid app, but the microphone probably has it's own frequency response.

Is there an App for that ? With each phone model microphone factory calibrated ?

Is there a way to use known fact about physics like harmonics should have a specific shape (timbres) that's should help equalize frequency with respect to each other ? Or from various microphone positions, calibrate it so that any cheap microphone can do the trick ?

u/GistNoesis

KarmaCake day2076July 22, 2016
About
https://gistnoesis.github.io/ https://github.com/GistNoesis/
View Original