Readit News logoReadit News
linolevan · 11 days ago
There was this very interesting paper out of Stanford this last September about pretraining under the unlimited compute but limited data paradigm[0]. Pretty much exactly the same thing but with ~200M training tokens instead.

[0] https://www.alphaxiv.org/abs/2509.14786

sdpmas · 11 days ago
yeah, we do incorporate some of the findings from the paper in our repo! like aggressive regularization and ensembling.
_0ffh · 11 days ago
I see you already mention diffusion - iirc there was a result not too long ago that diffusion models keep improving with more epochs for longer than AR models do.
bee_rider · 11 days ago
> Directions we think are wide open

> Second-order optimizers and natural gradient methods

Do second order optimizers help improve data efficiency? I assumed they’d help you get to the same minimum faster (but this is way outside my wheelhouse).

sdpmas · 11 days ago
yes! typically the optimizer that trains faster also get better data efficiency. it maybe not be absolutely true, but that has been my observation so far. also see https://arxiv.org/pdf/2510.09378 for second-order methods.
vladf · 11 days ago
That still looks like a “converge faster” paper.

https://arxiv.org/abs/2006.10732

The above provides a nuanced theoretical view. GD inductive bias is probably better unless your model is misspecified

alyxya · 11 days ago
Fundamentally I don't believe second-order methods get better data efficiency by itself, but changes to the optimizer can because the convergence behavior changes. ML theory lags behind the results in practice.
lzaborowski · 11 days ago
I like the idea of flipping the constraint. Most ML benchmarks assume unlimited data and limited compute, so people optimize for speed.

If high-quality training data becomes the real bottleneck, then the interesting question is how much signal you can extract from the same dataset when compute is cheap.

jbergqvist · 11 days ago
Very interesting benchmark, excited to see what comes out of this. Considering humans are enourmously more sample efficient compared to today's models, it seems clear there's a lot of room to close that gap. The fact that they hit 5.5x in the first week with relatively straightforward changes suggests we're nowhere near the ceiling for data efficiency
sdpmas · 11 days ago
absolutely!
easygenes · 10 days ago
This is very much in line with what I found fascinating about optimizing microgpt for speed (0). Or rather, what I was able to do with it after doing so. It's so small and so fast to train, you can really dig deep into the optimization landscape. I've spent all my free time this past week digging into it.

0: https://entrpi.github.io/eemicrogpt/ (The writeup is from a few days ago, and I'm still running experiments before I do a big rewrite. Slowrun is good food for thought.)

londons_explore · 11 days ago
I think there will be good headway in using the part-trained model to generate itself more training data in the form of making itself tasks, completing those tasks with many different approaches, evaluating which solution is best (using the same LLM as judge), and then differentially training on the best solutions vs the worst ones.

The challenge is that such an approach almost certainly requires a model with RLHF post-training, but this needs to be done in the pre training phase. But with infinity compute, this isn't an issue - you simply do the post-training many times.

kseniamorph · 11 days ago
Curious about the baseline choice. modded-nanogpt was optimized for wall-clock speed, not data efficiency, so it seems like an unusual reference point for this kind of benchmark. Why not vanilla NanoGPT?
timshel1 · 11 days ago
Modded-nanogpt is also much more data efficient than vanilla napogpt, even if some of the individual optimizations trade off higher throughput for worse data efficiency.
sdpmas · 11 days ago
yes, agreed, modded-nanogpt is already a data-efficient variant of original nanogpt. just that the kinds of algorithms it allows are somewhat constrained because it optimizes for wall clock time.
archermarks · 11 days ago
Very cool idea. Interested to see how this progresses. One question: how worried are you about over-training on this particular dataset? i.e. instead of generalizing you lean more toward memorization? Obviously you leave out a validation set but since you're meta-optimizing the model itself by its performance on the validation dataset you're still at risk of over-fitting.
sdpmas · 11 days ago
yes, good point. right now, it's somewhat hard to overfit because the meta-optimization extracts tiny bits of information. but over time, we will switch the validation set to some other random subset of the FineWeb or even entirely OOD datasets!
xpe · 10 days ago
The question is not if but when. I hope the project authors acknowledge the problem directly: it is not merely a risk; it is a statistical certainty given enough time. So, what's the plan?

At the very least, track it. How will the project maintainers instrument this?