NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute

NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute qlabs.sh/slowrun...

linolevan · 11 days ago

There was this very interesting paper out of Stanford this last September about pretraining under the unlimited compute but limited data paradigm[0]. Pretty much exactly the same thing but with ~200M training tokens instead.

[0] https://www.alphaxiv.org/abs/2509.14786

sdpmas · 11 days ago

yeah, we do incorporate some of the findings from the paper in our repo! like aggressive regularization and ensembling.

_0ffh · 11 days ago

I see you already mention diffusion - iirc there was a result not too long ago that diffusion models keep improving with more epochs for longer than AR models do.

bee_rider · 11 days ago

> Directions we think are wide open

> Second-order optimizers and natural gradient methods

Do second order optimizers help improve data efficiency? I assumed they’d help you get to the same minimum faster (but this is way outside my wheelhouse).

sdpmas · 11 days ago

yes! typically the optimizer that trains faster also get better data efficiency. it maybe not be absolutely true, but that has been my observation so far. also see https://arxiv.org/pdf/2510.09378 for second-order methods.

vladf · 11 days ago

That still looks like a “converge faster” paper.

https://arxiv.org/abs/2006.10732

The above provides a nuanced theoretical view. GD inductive bias is probably better unless your model is misspecified

alyxya · 11 days ago

Fundamentally I don't believe second-order methods get better data efficiency by itself, but changes to the optimizer can because the convergence behavior changes. ML theory lags behind the results in practice.

lzaborowski · 11 days ago

I like the idea of flipping the constraint. Most ML benchmarks assume unlimited data and limited compute, so people optimize for speed.

If high-quality training data becomes the real bottleneck, then the interesting question is how much signal you can extract from the same dataset when compute is cheap.

jbergqvist · 11 days ago

Very interesting benchmark, excited to see what comes out of this. Considering humans are enourmously more sample efficient compared to today's models, it seems clear there's a lot of room to close that gap. The fact that they hit 5.5x in the first week with relatively straightforward changes suggests we're nowhere near the ceiling for data efficiency

sdpmas · 11 days ago

absolutely!

easygenes · 10 days ago

This is very much in line with what I found fascinating about optimizing microgpt for speed (0). Or rather, what I was able to do with it after doing so. It's so small and so fast to train, you can really dig deep into the optimization landscape. I've spent all my free time this past week digging into it.

0: https://entrpi.github.io/eemicrogpt/ (The writeup is from a few days ago, and I'm still running experiments before I do a big rewrite. Slowrun is good food for thought.)

londons_explore · 11 days ago

I think there will be good headway in using the part-trained model to generate itself more training data in the form of making itself tasks, completing those tasks with many different approaches, evaluating which solution is best (using the same LLM as judge), and then differentially training on the best solutions vs the worst ones.

The challenge is that such an approach almost certainly requires a model with RLHF post-training, but this needs to be done in the pre training phase. But with infinity compute, this isn't an issue - you simply do the post-training many times.

kseniamorph · 11 days ago

Curious about the baseline choice. modded-nanogpt was optimized for wall-clock speed, not data efficiency, so it seems like an unusual reference point for this kind of benchmark. Why not vanilla NanoGPT?

timshel1 · 11 days ago

Modded-nanogpt is also much more data efficient than vanilla napogpt, even if some of the individual optimizations trade off higher throughput for worse data efficiency.

sdpmas · 11 days ago

yes, agreed, modded-nanogpt is already a data-efficient variant of original nanogpt. just that the kinds of algorithms it allows are somewhat constrained because it optimizes for wall clock time.

archermarks · 11 days ago

Very cool idea. Interested to see how this progresses. One question: how worried are you about over-training on this particular dataset? i.e. instead of generalizing you lean more toward memorization? Obviously you leave out a validation set but since you're meta-optimizing the model itself by its performance on the validation dataset you're still at risk of over-fitting.

sdpmas · 11 days ago

yes, good point. right now, it's somewhat hard to overfit because the meta-optimization extracts tiny bits of information. but over time, we will switch the validation set to some other random subset of the FineWeb or even entirely OOD datasets!

xpe · 10 days ago

The question is not if but when. I hope the project authors acknowledge the problem directly: it is not merely a risk; it is a statistical certainty given enough time. So, what's the plan?

At the very least, track it. How will the project maintainers instrument this?