gpjt (u/gpjt) - Readit News

gpjt commented on The Average Founder Ages 6 Months Each Year tomtunguz.com/founder-age... · Posted by u/2bluesc

I would say that in the years I was a founder I definitely aged faster than that ;)

gpjt · 2 days ago

100%, I think there were weeks when I aged a year...

gpjt commented on LLM from scratch, part 28 – training a base model from scratch on an RTX 3090 gilesthomas.com/2025/12/l... · Posted by u/gpjt

spi · 2 days ago

Sorry came a bit late to this reply. Interesting, well, nobody says it's a monotonic function :-) in the limit of _very_ large batches you of course are worse off, because you take a very large amount of computation before taking a single step, so if you stop after a fixed amount of time your model just didn't have the time to learn properly. So certainly there is a sweet spot somewhere.

I suppose, the real "function" is a bit more complicated because (1) If you put 2x more data through the same GPU with large enough memory, it will take less than 2x the time to compute (but certainly not 1x). (2) At some point, empirically, increasing batch size makes it _worse_ even if you ignore the additional runtime cost (i.e. stop after n gradient update steps, and not x seconds). To my knowledge, the accepted reason for that fact is that a bit of noise helps in regularizing learning, because overly smooth learning curves end up stagnating in local loss minima more easily. In truth, I think nobody exactly understand how deep learning models work :-)

And to your other question - sorry again for the late answer. Yes, `optimizer.zero_grad()` should always be called directly after `optimizer.step()`, therefore with gradient accumulation once every `n` steps (otherwise, you'd be zeroing out the gradients, so just throwing away all the compute you did in previous steps).

gpjt · 2 days ago

Thanks re: gradient accumulation, I'm glad to hear my intuition was right!

As part of the upcoming post I'm running the DDP train on A100s with 40 GiB and 80 GiB, H100s with 80 GiB, and B200s with 160 GiB, so I'll have at least three loss vs. batch size points to plot. So that might be interesting.

I guess a full test would be to train at various batch sizes on the 160 GiB machine and plot the resulting loss. That would be very expensive as a hobby project (the bs=64 train cost a bit more than $40 excluding overhead) so I won't do it.

But perhaps a shorter train would still be of value? That is, train for 300M tokens for a tenth of the cost and see where the loss landed? The problem with that would be if the impact of batch sizes varied with the length of the train, eg. if batch size 64 was better than 512 for short trains but weaker at longer ones.

gpjt commented on LLM from scratch, part 28 – training a base model from scratch on an RTX 3090 gilesthomas.com/2025/12/l... · Posted by u/gpjt

gpjt · 4 days ago

Exactly! If I can get it down to an hour or two (seems very plausible on an 8x H200 with 160 GiB VRAM per GPU, though those are almost never available on Lambda Labs), I'll do the experiments with dropout and the other possible causes of issues, then see if I can bake that all into a new train on the RTX 3090 and confirm it repros there. Looks like I'll definitely need gradient accumulation there.

I assume the zero_grad would need to go in the same if block?

gpjt · 4 days ago

Hmm, interesting. With a batch size of 512 (8x B200s with 160 GiB each) I get worse results! Maybe there's a sweet spot somewhere in between.

gpjt commented on LLM from scratch, part 28 – training a base model from scratch on an RTX 3090 gilesthomas.com/2025/12/l... · Posted by u/gpjt

spi · 5 days ago

Thanks, very nice to see these results! Certainly using GPUs with more RAM makes things simpler to scale. Gradient accumulation is as easy as adding a counter for number of steps and an "if counter % gradient_accumulation_steps:` around `optimizer.step()`, so that can also be tried simply on a single GPU / cheaper GPUs. But if you can just use 8xA100 and your pipeline parallizes well, you also get results (almost) 8 times faster, which is certainly nicer to experiment of course!

gpjt · 4 days ago

Exactly! If I can get it down to an hour or two (seems very plausible on an 8x H200 with 160 GiB VRAM per GPU, though those are almost never available on Lambda Labs), I'll do the experiments with dropout and the other possible causes of issues, then see if I can bake that all into a new train on the RTX 3090 and confirm it repros there. Looks like I'll definitely need gradient accumulation there.

I assume the zero_grad would need to go in the same if block?

gpjt commented on LLM from scratch, part 28 – training a base model from scratch on an RTX 3090 gilesthomas.com/2025/12/l... · Posted by u/gpjt

muricula · 5 days ago

I've played with something similar with my M1 using Apple's MLX framework. The problem is I'm compute bound. I've never managed to get my M1 Max's GPU to process more than ~7.8k tokens per second at bf16 precision, so to train a 112M parameter model on ~20 billion tokens I'd need to run the model training for ~30 days.

One solution is to reduce the scope of the problem -- you can train on a smaller less diverse dataset such as TinyStories which is a collection of 1 billion tokens of chatGPT generated children's stories. After about 40 hours, less than one weekend, you'll have a model which can generate mostly grammatical children's stories.

If you have a newer mac and/or an ultra chip you'll have more and faster GPU cores, and might be able to train on FineWeb or a similar, larger and more diverse dataset.

gpjt · 4 days ago

OP here -- with a 112M model you should be able to get something worth playing with using 2.24B tokens. The Chinchilla heuristic is tokens = 20 x parameters. Obviously you cam get a better result by grinding through more tokens, but it will be very slow progress. It's worth noting that Andrej Karpathy is using the 20x thing for his nanochat project.

I try to explain the Chinchilla paper in the post, but your favourite AI should be able to explain it well, and has the benefit that you can ask follow-up questions.

gpjt commented on LLM from scratch, part 28 – training a base model from scratch on an RTX 3090 gilesthomas.com/2025/12/l... · Posted by u/gpjt

gpjt · 5 days ago

OP here -- thanks! I'm in the process of doing some trains using the same code plus DDP on big Lambda Labs machines, and (within the bounds of what I can afford) will hopefully have some interesting results about all of those shortly.

gpjt · 5 days ago

OK, early indicators support both you and Gemini quite strongly re: batch size. On my (somewhat ad-hoc) test dataset, I get losses like this:

  * OpenAI medium weights: 3.231
  * OpenAI small weights: 3.500
  * My locally trained model, FineWeb Chinchilla, batch size 6: 3.944
  * My locally trained model, FineWeb-Edu Chinchilla, batch size 6: 4.167
  * My locally trained model, FineWeb-Edu double Chinchilla, batch size 6: 4.135
  * My cloud trained model, FineWeb Chinchilla, batch size 13 \* 8 = 104: 3.674

That last one was trained on an 8x A100 machine with 40 GiB per GPU, with the same code as before, just converted to DDP. It certainly looks like the much larger batch size has improved the model significantly.

I'll be trying on larger machines. No gradient accumulation yet, but it's certainly looking like a valuable lever to pull for local training runs (and, I suspect, might also be useful on "small" cloud machines like the one I used -- will have to see what things look like with the bigger mini-batches I can squeeze onto 80 GiB and 160 GiB GPUs).

gpjt commented on LLM from scratch, part 28 – training a base model from scratch on an RTX 3090 gilesthomas.com/2025/12/l... · Posted by u/gpjt

danielmarkbruce · 5 days ago

One needs about 12 to 18 hours of linear algebra to work though the papers, not 12 to 18 months. The vast majority of stuff in AI/ML papers is just "we tried X and it worked!".

gpjt · 5 days ago

OP here -- agreed! I tried to summarise (at least to my current level of knowledge) those 12-18 hours here: https://www.gilesthomas.com/2025/09/maths-for-llms

gpjt commented on LLM from scratch, part 28 – training a base model from scratch on an RTX 3090 gilesthomas.com/2025/12/l... · Posted by u/gpjt

azath92 · 5 days ago

For small models this is for sure the way forward, there are some great small datasets out there (check out the tiny stories dataset that limits vocab to a certain age but keeps core reasoning inherent in even simple language https://huggingface.co/datasets/roneneldan/TinyStories https://arxiv.org/abs/2305.07759)

I have less concrete examples but my understanding is that dataset curation is for sure the way many improvements are gained at any model size. Unless you are building a frontier model, you can use a better model to help curate or generate that dataset for sure. TinyStories was generated with GPT-4 for example.

gpjt · 5 days ago

OP here: one thing that surprised me in this experiment was that the model trained on the more curated FineWeb-Edu dataset was worse than the one trained on FineWeb. That is very counterintuitive to me.

gpjt commented on LLM from scratch, part 28 – training a base model from scratch on an RTX 3090 gilesthomas.com/2025/12/l... · Posted by u/gpjt

spi · 5 days ago

A separate comment about conclusions about why they are worse than OpenAI GPT2 - which to me feel to be missing the point.

One main point is batch size - I'd agree with Gemini here. Batch size <= 5 with 1024 seq len is really tiny. Nowadays models are trained with effective batch size of millions of tokens in total. Of course, this won't fit into memory, one uses gradient accumulations to that purpose, again as mentioned by Gemini.

Training duration is definitely also a reason - models do get better over time, otherwise people wouldn't train so long wasting millions :-) just how long for optimality is unclear, but certainly < 2 days is not optimal even at this "small" scale.

The optimizer could also play a role. As the author mentions, a fixed learning rate is hardly optimal, it is typically both increased in the beginning ("warm up", but that's for stability, if training works without, that's not an issue) and scaled down at the end ("cool down" - that is, annealing, with cosine as mentioned in the article). This generally squeezes out a bit more performance. Also, while it's true that dropout was used back then (might be useful for many epochs, likely only harmful for < 1 epoch), using _both_ dropout _and_ weight_decay > 0, as the author does, is probably wrong and makes training too slow & careful to get good results. Also, even if used, a "good" implementation of weight decay should skip some layers like embeddings and biases (GPT2 did that, and it's relatively important to do so).

On the other hand, I'm pretty sure that using mixed precision and TF32 has absolutely no downsides. It's really standard nowadays to use either mixed precision (FP16 gradients + FP32 base weights) or directly BF16 ("brain" float 16, a bit like the TF32 described there, but with only 16 bits) and I have almost never seen either one fail... and when it does, it typically fails spectacularly, with NaN losses or the model degenerating to trivial performance.

gpjt · 5 days ago

OP here -- thanks! I'm in the process of doing some trains using the same code plus DDP on big Lambda Labs machines, and (within the bounds of what I can afford) will hopefully have some interesting results about all of those shortly.