Readit News logoReadit News
ekelsen commented on Are OpenAI and Anthropic losing money on inference?   martinalderson.com/posts/... · Posted by u/martinald
GaggiX · 2 days ago
Your calculations make no sense. Why are you loading the model for each token independently? You can process all the input tokens at the same time as long as they can fit in memory.

You are doing the calculation as they were output tokens on a single batch, it would not make sense even in the decode phase.

ekelsen · 2 days ago
Then the right calculation is to use FLOPs not bandwidth like they did.
ekelsen commented on Are OpenAI and Anthropic losing money on inference?   martinalderson.com/posts/... · Posted by u/martinald
ekelsen · 2 days ago
The math on the input tokens is definitely wrong. It claims each instance (8 GPUs) can handle 1.44 million tokens/sec of input. Let's check that out.

1.44e6 tokens/sec * 37e9 bytes/token / 3.3e12 bytes/sec/GPU = ~16,000 GPUs

And that's assuming a more likely 1 byte per parameter.

So the article is only off by a factor of at least 1,000. I didn't check any of the rest of the math, but that probably has some impact on their conclusions...

ekelsen commented on A general Fortran code for solutions of problems in space mechanics [pdf]   jonathanadams.pro/blog-ar... · Posted by u/keepamovin
nyc111 · 11 days ago
It looks like they chose to use the "universal gravitational constant" "k" instead of Newton^s constant, "G": p.23, "k^2 = universal gravitational constant, 1.32452139x10^20, m^3/(sec^2)(sun mass units)"

I think "k" was also known as "Gaussian gravitational constant" https://en.wikipedia.org/wiki/Gaussian_gravitational_constan...

But the value and unit of "k" given in the Wikipedia page is different. Do you know what NASA document means by "universal gravitational constant" in modern sense?

ekelsen · 11 days ago
I think it's just units. From wikipedia: "and its value in radians per day follows by setting Earth's semi-major axis (the astronomical unit, au) to unity, k:(rad/d) = (GM)0.5·au−1.5."

the value given in the paper assumes the distance in meters I think.

ekelsen commented on 19% of California houses are owned by investors   ocregister.com/2025/07/21... · Posted by u/milleramp
fortran77 · 23 days ago
> The real issue with CA real estate is prop 13

I don't want to be forced out of my home because the neighbors paid too much for theirs raising my "property value" and tax assessment.

ekelsen · 23 days ago
They paid too much? The invisible hand my friend...
ekelsen commented on Denver rent is back to 2022 prices after 20k new units hit the market   denverite.com/2025/07/25/... · Posted by u/matthest
bufferoverflow · a month ago
You're not making any sense. Someone paid for a house, they fully expect to pay a lot less than renters do. That's the point of buying a house.
ekelsen · a month ago
The point of buying a primary house is to be able to do what you want with it. And to know you can live there as long as you want.

Economics dictates how valuable those things are and what premium they have over renting.

ekelsen commented on Surprisingly fast AI-generated kernels we didn't mean to publish yet   crfm.stanford.edu/2025/05... · Posted by u/mfiguiere
AlotOfReading · 3 months ago
Only seems to have done that in a couple places, like the MatMul. The softmax kernel (https://github.com/ScalingIntelligence/good-kernels/blob/mai...) seem to be entirely bog-standard, and the layernorm kernels are only slightly more interesting.
ekelsen · 3 months ago
I looked at the softmax kernel and the cast that it does from a float* to a float4* is extremely brittle -- it's trivial to break by offsetting the input slightly.

Very likely a kernel for a standard library could not employ such a trick that relies on alignment of input pointers. Certainly not without a fallback.

ekelsen commented on Surprisingly fast AI-generated kernels we didn't mean to publish yet   crfm.stanford.edu/2025/05... · Posted by u/mfiguiere
ekelsen · 3 months ago
"the reference code is in the default FP32, and given a tolerance threshold (1e-02)"

that's a huge tolerance and allows them to use fp16 operations to replace the "fp32" kernel.

ekelsen commented on Surprisingly fast AI-generated kernels we didn't mean to publish yet   crfm.stanford.edu/2025/05... · Posted by u/mfiguiere
ekelsen · 3 months ago
"FP32 is less common in modern ML workloads and often less optimized on recent hardware compared to FP16 or BF16, which may partly explain why it’s easier to achieve performance gains over PyTorch with FP32 kernels."

People haven't spent time optimizing the fp32 versions of these kernels in years. This will be much more interesting if they can improve the kernels where developer effort has gone and that are actually used.

ekelsen commented on Why Momentum Works (2017)   distill.pub/2017/momentum... · Posted by u/vector_spaces
shoo · 4 months ago
It's unclear if increasing the dimensionality is in itself a challenge, provided that the objective function is still convex with a unique global minima -- like these somewhat problematic Rosenbrock test objective functions used in examples in the article.

On another hand, if the objective function is very multimodal with many "red herring" local minima, perhaps an optimiser that is very good at finding the local minima might do worse in practice at globally optimising than an optimiser that sometimes "barrels" out of the basin of a local minima and accidentally falls into a neighbouring basin around a lower minima.

I ran a few numerical experiments using scipy's "rosen" test function [1] as the objective, in D=10,000 dimensions. This function has a unique global minimum of 0 which is attained at x* = 1_D. I set the initial guess as x0 := x* + eps_i, where for each element i=1,...d, eps_i is noise sampled from N(0, 0.05)

Repeating this over 100 trial problems, using the same initial guess x0 across each method during each trial, the average number of gradient evaluations required for convergence was

  'cg': 248
  'l-bfgs-b': 40
  'm-001-99': 3337
All methods converged in 100 / 100 trials.

m-001-99 is gradient descent with momentum using alpha=0.001 and beta=0.99 . Setting alpha=0.002 or higher causes momentum to fail to converge. The other two methods are scipy's cg & l-bfgs-b methods using default parameters (again, under the hood these two methods rely on a port of MINPACK2's dcsrch to determine the step size along the descent direction during each iteration, they're not using momentum updates or a fixed step size). I used l-bfgs-b instead of bfgs to avoid maintaining the dense D*D matrix for the approx inverse Hessian.

One point in momentum's favour was robustness to higher noise levels used to generate the initial guess -- if the noise level used to define the initial guess x0 is increased to N(0, 1) then I see the cg & l-bfgs-b methods fail to converge in around 20% of trial problems, while momentum fails a lower fraction of the time provided the fixed step size is set small enough, but still requires a very large number of gradient evaluations to converge.

[1] https://docs.scipy.org/doc/scipy/reference/generated/scipy.o...

ekelsen · 4 months ago
Perhaps you took my comment too literally. Try it on a real neutral network, it doesn't work.
ekelsen commented on Why Momentum Works (2017)   distill.pub/2017/momentum... · Posted by u/vector_spaces
shoo · 4 months ago
I was curious how well the simple momentum step-size approach shown in the first interactive example compares to alternative methods. The example function featured in the first interactive example is named bananaf ("Rosenbrok Function banana function"), defined as

  var s = 3
  var x = xy[0]; var y = xy[1]*s
  var fx   = (1-x)*(1-x) + 20*(y - x*x )*(y - x*x )
  var dfx  = [-2*(1-x) - 80*x*(-x*x + y), s*40*(-x*x + y)]
The interactive example uses an initial guess of [-1.21, 0.853] and a fixed 150 iterations, with no convergence test.

From manually fiddling with (step-size) alpha & (momentum) beta parameters, and editing the code to specify a smaller number of iterations, it seems quite difficult to tune this momentum-based approach to get near the minima and stay there without bouncing away in 50 iterations or fewer.

Out of curiosity, I compared minimising this bananaf function with scipy.optimize.minimize, using the same initial guess.

If we force scipy.optimize.minimize to use method='cg', leaving all other parameters as defaults, it converges to the optimal solution of [1.0, 1./3.] requiring 43 evaluations of fx and dfx,

If we allow scipy.optimize.minimize to use all defaults -- including the default method='bfgs', it converges to the optimal solution after only 34 evaluations of fx and dfx.

Under the hood, scipy's method='cg' and method='bfgs' solvers do not use a fixed step size or momentum to determine the step size, but instead solve a line search problem. The line search problem is to identify a step size that satisfies a sufficient decrease condition and a curvature condition - see Wolfe conditions [1]. Scipy's default line search method -- used for cg and bfgs -- is a python port [2] of the dcsrch routine from MINPACK2. A good reference covering line search methods & BFGS is Nocedal & Wright's 2006 book Numerical Optimization.

[1] https://en.wikipedia.org/wiki/Wolfe_conditions [2] https://github.com/scipy/scipy/blob/main/scipy/optimize/_dcs...

ekelsen · 4 months ago
now try the same experiment in 1 billion dimensions.

u/ekelsen

KarmaCake day397September 29, 2015View Original