I suppose, the real "function" is a bit more complicated because (1) If you put 2x more data through the same GPU with large enough memory, it will take less than 2x the time to compute (but certainly not 1x). (2) At some point, empirically, increasing batch size makes it _worse_ even if you ignore the additional runtime cost (i.e. stop after n gradient update steps, and not x seconds). To my knowledge, the accepted reason for that fact is that a bit of noise helps in regularizing learning, because overly smooth learning curves end up stagnating in local loss minima more easily. In truth, I think nobody exactly understand how deep learning models work :-)
And to your other question - sorry again for the late answer. Yes, `optimizer.zero_grad()` should always be called directly after `optimizer.step()`, therefore with gradient accumulation once every `n` steps (otherwise, you'd be zeroing out the gradients, so just throwing away all the compute you did in previous steps).
As part of the upcoming post I'm running the DDP train on A100s with 40 GiB and 80 GiB, H100s with 80 GiB, and B200s with 160 GiB, so I'll have at least three loss vs. batch size points to plot. So that might be interesting.
I guess a full test would be to train at various batch sizes on the 160 GiB machine and plot the resulting loss. That would be very expensive as a hobby project (the bs=64 train cost a bit more than $40 excluding overhead) so I won't do it.
But perhaps a shorter train would still be of value? That is, train for 300M tokens for a tenth of the cost and see where the loss landed? The problem with that would be if the impact of batch sizes varied with the length of the train, eg. if batch size 64 was better than 512 for short trains but weaker at longer ones.
I assume the zero_grad would need to go in the same if block?
I assume the zero_grad would need to go in the same if block?
One solution is to reduce the scope of the problem -- you can train on a smaller less diverse dataset such as TinyStories which is a collection of 1 billion tokens of chatGPT generated children's stories. After about 40 hours, less than one weekend, you'll have a model which can generate mostly grammatical children's stories.
If you have a newer mac and/or an ultra chip you'll have more and faster GPU cores, and might be able to train on FineWeb or a similar, larger and more diverse dataset.
I try to explain the Chinchilla paper in the post, but your favourite AI should be able to explain it well, and has the benefit that you can ask follow-up questions.
* OpenAI medium weights: 3.231
* OpenAI small weights: 3.500
* My locally trained model, FineWeb Chinchilla, batch size 6: 3.944
* My locally trained model, FineWeb-Edu Chinchilla, batch size 6: 4.167
* My locally trained model, FineWeb-Edu double Chinchilla, batch size 6: 4.135
* My cloud trained model, FineWeb Chinchilla, batch size 13 \* 8 = 104: 3.674
That last one was trained on an 8x A100 machine with 40 GiB per GPU, with the same code as before, just converted to DDP. It certainly looks like the much larger batch size has improved the model significantly.I'll be trying on larger machines. No gradient accumulation yet, but it's certainly looking like a valuable lever to pull for local training runs (and, I suspect, might also be useful on "small" cloud machines like the one I used -- will have to see what things look like with the bigger mini-batches I can squeeze onto 80 GiB and 160 GiB GPUs).
I have less concrete examples but my understanding is that dataset curation is for sure the way many improvements are gained at any model size. Unless you are building a frontier model, you can use a better model to help curate or generate that dataset for sure. TinyStories was generated with GPT-4 for example.
One main point is batch size - I'd agree with Gemini here. Batch size <= 5 with 1024 seq len is really tiny. Nowadays models are trained with effective batch size of millions of tokens in total. Of course, this won't fit into memory, one uses gradient accumulations to that purpose, again as mentioned by Gemini.
Training duration is definitely also a reason - models do get better over time, otherwise people wouldn't train so long wasting millions :-) just how long for optimality is unclear, but certainly < 2 days is not optimal even at this "small" scale.
The optimizer could also play a role. As the author mentions, a fixed learning rate is hardly optimal, it is typically both increased in the beginning ("warm up", but that's for stability, if training works without, that's not an issue) and scaled down at the end ("cool down" - that is, annealing, with cosine as mentioned in the article). This generally squeezes out a bit more performance. Also, while it's true that dropout was used back then (might be useful for many epochs, likely only harmful for < 1 epoch), using _both_ dropout _and_ weight_decay > 0, as the author does, is probably wrong and makes training too slow & careful to get good results. Also, even if used, a "good" implementation of weight decay should skip some layers like embeddings and biases (GPT2 did that, and it's relatively important to do so).
On the other hand, I'm pretty sure that using mixed precision and TF32 has absolutely no downsides. It's really standard nowadays to use either mixed precision (FP16 gradients + FP32 base weights) or directly BF16 ("brain" float 16, a bit like the TF32 described there, but with only 16 bits) and I have almost never seen either one fail... and when it does, it typically fails spectacularly, with NaN losses or the model degenerating to trivial performance.