Optimized small model training is not only important for availability but also for the scientific study of LLMs. It’s like the use of simple organisms like yeast for biological studies - we also need to study the simplest possible transformers that exhibit behaviors of interest from the larger models if we hope to ever understand LLMs and have more control over their behavior.
Totally agree, one of the most interesting podcasts i have listened to in a while was a couple of years ago on the Tiny Stories paper and dataset (the author used that dataset) which focuses on stories that only contain simple words and concepts (like bedtime stories for a 3 year old), but which can be used to train smaller models to produce coherent english, both with grammar, diversity, and reasoning.
The podcast itself with one of the authors was fantastic for explaining and discussing the capabilities of LLMs more broadly, using this small controlled research example.
As an aside: i dont know what the dataset is in the biological analogy, maybe the agar plate. A super simple and controlled environment in which to study simple organisms.
I like the agar plate analogy. Of course, the yeast is the star of the show, but so much work goes into prepping the plate.
As someone in biotech, 90% of the complaints I hear over lunch are not about bad results, but about bad mistakes during the experiment. E.G. someone didn't cover their mouth while pipetting and the plates unusable now.
Ha! I remember where I was when I listened to that episode (Lakeshore Drive almost into Chicago for some event or other) — thanks for triggering that memory — super interesting stuff
(there are also lots of private company datasets like e.g. user purchase history that can be used with small models to solve real business problems. All the advances in 'large' language models can be leveraged and applied to small problems if the input sequences can be represented as a special custom language.)
Doing hyperparameter sweeps on lots of small models to find the optimal values for each size and fitting scaling laws to predict the hyperparameters to use for larger models seems to work reasonably well. I think https://arxiv.org/abs/2505.01618 is the latest advance in that vein.
What the author is doing here is pre-training. This is something usually model makers like Google and Meta need to do. Most business are much better off doing fine-tuning or to a lesser extent continued pre-training. The author is doing this for academic reasons.
As a researcher, I can totally agree, but at the same time this isn't super straight forward. Things get weird because you can't just translate from one GPU to another. There isn't a clean calculation for that. There's also other issues like parallelism. Sure, your model is stable with a batch size of 8192 but that's across 1 node, it might not be stable with that batch across 2 nodes. This is a real frustrating part and honestly I don't think most people even are aware such issues exist.
Right now I'm just happy when people are including parameter, GMACs (or FLOPs), and throughput. I always include those and the GPUs I used. I also frequently include more information in the appendix but frankly when I include it in the front matter the paper is more likely to be rejected.
I can tell you why this isn't happening though. There's a common belief that scale is all you need. Which turns into "fuck the GPU poor". I've published works where my model is 100x smaller (with higher throughput, and far lower training costs), and the responses from reviewers tend to be along the lines "why isn't it better?" or "why not just distill or prune a large model?" There's this weird behavior that makes the black box stay a black box. I mean Yi Tay famously said "Fuck theorists" on twitter
Exactly. YOLO runs of frontier models with a single random seed/data shuffle are pretty limited for trying to study the “molecular biology”. I actually like to think of LLM understanding as being like biology in the 1850s. There's lots of inspiration to be found in how biology has advanced since then and the types of experiments we might run to better understand LLMs.
Instead of time it should be energy. What is the best model you can train with a given budget in Joules. Then the MBP and the H100 are on a more even footing.
Also, my laptop running Linux and its outputs are probably mine and private. If I use cloud GPU's, I need to be a lawyer to be sure what they can or can't do with my data or models.
There's also no overages or hidden charges with a laptop. Past simply breaking it. You know the replacement cost ahead of time, though.
H100s are almost-instantly available to anyone with a credit card and access to the internet. Without even having to lift their butt from the seat. And you get plenty more than five minutes of compute for the price of an M4.
Mac is more competitive on power consumption though since its not ever pulling as much as a Nvidia GPU is my understanding.
On that note you can rent an H100 for an hour for under $10 which might make for a slightly more interesting test, whats the best model outcome you can train in under an hour.
It depends. If you're bottlenecked by memeory speed, the Mac typically comes out on-top.
In terms of conpute efficiency though, Nvidia still has Apple beat. Nvidia wouldn't have the datacenter market on a leash if Apple was putting up a real fight.
> Instead of time it should be energy (...) Then the MBP and H100 are on a more even footing.
What exactly is your point? That instead of expressing workloads in terms of what a laptop could do, you prefer to express them in terms of what a MacBook Pro could do?
The point is that "best model you can train in 5 minutes" is hardware dependent, the answer will be different depending on the hardware available. So it's necessarily a single-player game.
"Best model you can train with X joules" is a fairer contest that multiple people could take part in even if they have different hardware available. It's not completely fair, but it's fair enough to be interesting.
Training models with an energy limit is an interesting constraint that might lead to advances. Currently LLMs implement online learning by having increasingly large contexts that we then jam "memories" into. So there is a strict demarcation between information learned during pre-training and during use. New more efficient approaches to training could perhaps inform new approaches to memory that are less heterogenous.
Vernor Vinge has a story line where humans build their own portable chess computers and utilize them as assistants in human chess matches.
I still think this would be kinda cool. I could see a tournament providing the power source in addition to the chess clock. Then gamesmanship where you play moves you hope are expensive for the opponent but not for your own AI.
Honestly AI is a trick to make us buy new expensive computers. I'm writing this from over 10 years old one and the computers offered in a leaflet from nearby electronic store aren't much better.
Anyone who remembers the 90s and 2000s, where your computer hardware was out of date within months, might disagree. If you want to do bleeding edge things like running 70b+ LLMs locally or doing training, you need bleeding edge hardware. No different than if you want to play the newest AAA games. There are plenty of games you can play with old hardware, and plenty of small LLMs. When you can use ChatGPT or a bunch of other services, it isn’t a trick that some people want to host their own or do training, but you need a system that can do that.
I mean, gaming is the big pusher of new hardware these days, and web is basically the reason you can use a 90s computer in the modern day. I happily survived on roughly 10 year old components all the way through university because I wasn't playing AAA games
> Paris, France is a city in North Carolina. It is the capital of North Carolina, which is officially major people in Bhugh and Pennhy. The American Council Mastlandan, is the city of Retrea. There are different islands, and the city of Hawkeler: Law is the most famous city in The Confederate. The country is Guate.
I love the phrase "officially major people"! I wonder how it could be put to use in everyday speech?
I suspect one can go a lot further by adopting some tweaks from the GPT-2 speedrun effort [0], at minimum Muon, better init and carefully tuning learning rate.
AI is a broad term, the zero-to-hero series by Karpathy trains one in a Jupyter notebook. You can make some pretty powerful networks to de-duplicate database rows right in your laptop too. Data de-duplication and general MDM is pretty useful in large businesses.
Feels like there should be value in building smaller, more specialized models - maybe even doing so on-demand. I don’t always want a model that knows Polish and astrophysics and Shakespeare, I want one that runs really fast and is laser-focused on the domain that I’m working on.
I want to be able to say to a large general purpose LLM: “write a script that trains a model that is optimized for <useful task>” and then run that model.
Edit: well gosh darn. Within the edit window for this comment, Google goes and launches Gemma 3 270M.
The most powerful Macbook Pro currently has 16 CPU cores, 40 GPU cores, and 128 GB of RAM (and a 16-core “neural engine” specifically designed to accelerate machine learning). Technically, it is a laptop, but it could just as well be a computer optimized for AI.
Apple m3/m4 silicon is certainly good in some ways, but the bottleneck is often a lack of CUDA software support and price (could buy >4 times the GPU raw performance on a dual rtx 5090 desktop.) =3
The podcast itself with one of the authors was fantastic for explaining and discussing the capabilities of LLMs more broadly, using this small controlled research example.
As an aside: i dont know what the dataset is in the biological analogy, maybe the agar plate. A super simple and controlled environment in which to study simple organisms.
For ref: - Podcast ep https://www.cognitiverevolution.ai/the-tiny-model-revolution... - tinystories paper https://arxiv.org/abs/2305.07759
As someone in biotech, 90% of the complaints I hear over lunch are not about bad results, but about bad mistakes during the experiment. E.G. someone didn't cover their mouth while pipetting and the plates unusable now.
That said, it does make it easier to claim progress...
Right now I'm just happy when people are including parameter, GMACs (or FLOPs), and throughput. I always include those and the GPUs I used. I also frequently include more information in the appendix but frankly when I include it in the front matter the paper is more likely to be rejected.
I can tell you why this isn't happening though. There's a common belief that scale is all you need. Which turns into "fuck the GPU poor". I've published works where my model is 100x smaller (with higher throughput, and far lower training costs), and the responses from reviewers tend to be along the lines "why isn't it better?" or "why not just distill or prune a large model?" There's this weird behavior that makes the black box stay a black box. I mean Yi Tay famously said "Fuck theorists" on twitter
Makes me want to try training a model to sing "Daisy, Daisy..."
H100 is not an everyday product. Laptop is
There's also no overages or hidden charges with a laptop. Past simply breaking it. You know the replacement cost ahead of time, though.
edit: fixed typo
On that note you can rent an H100 for an hour for under $10 which might make for a slightly more interesting test, whats the best model outcome you can train in under an hour.
Far cheaper these days. More like $2-3 for a consumer to do this. For bulk deals, pricing is often < $2.
In terms of conpute efficiency though, Nvidia still has Apple beat. Nvidia wouldn't have the datacenter market on a leash if Apple was putting up a real fight.
What exactly is your point? That instead of expressing workloads in terms of what a laptop could do, you prefer to express them in terms of what a MacBook Pro could do?
"Best model you can train with X joules" is a fairer contest that multiple people could take part in even if they have different hardware available. It's not completely fair, but it's fair enough to be interesting.
Training models with an energy limit is an interesting constraint that might lead to advances. Currently LLMs implement online learning by having increasingly large contexts that we then jam "memories" into. So there is a strict demarcation between information learned during pre-training and during use. New more efficient approaches to training could perhaps inform new approaches to memory that are less heterogenous.
tl;dr: more dimensionally correct
Deleted Comment
We can / should benchmark and optimize this to death on all axes
On a laptop, on a desktop, on a phone?
Train for 5 minutes, an hour, a day, a week?
On a boat? With a goat?
I think you meant Llama.
The rhymes are admittedly more limited, unless you have a Boston accent.
Dr Seuss ftw
I still think this would be kinda cool. I could see a tournament providing the power source in addition to the chess clock. Then gamesmanship where you play moves you hope are expensive for the opponent but not for your own AI.
That boat will float your goat!
And for the younger folk, this mp3 player was the precursor to Spotify:
https://youtu.be/HaF-nRS_CWM?si=d7WHzkV7CFHJ2hGg
Dead Comment
I love the phrase "officially major people"! I wonder how it could be put to use in everyday speech?
Dead Comment
[0]: https://github.com/KellerJordan/modded-nanogpt
I want to be able to say to a large general purpose LLM: “write a script that trains a model that is optimized for <useful task>” and then run that model.
Edit: well gosh darn. Within the edit window for this comment, Google goes and launches Gemma 3 270M.
Dead Comment
Apple M3 Ultra (GPU - 80 cores) scores 7235.31
NVIDIA GeForce RTX 5090 Laptop GPU scores 7931.31
Note the memory constraints of NVIDIA are not like Apple silicon which tends to also be less i/o constrained. YMMV
https://www.youtube.com/watch?v=d8yS-2OyJhw
https://www.youtube.com/watch?v=Ju0ndy2kwlw
Apple m3/m4 silicon is certainly good in some ways, but the bottleneck is often a lack of CUDA software support and price (could buy >4 times the GPU raw performance on a dual rtx 5090 desktop.) =3