Readit News logoReadit News
jebarker · 4 months ago
Optimized small model training is not only important for availability but also for the scientific study of LLMs. It’s like the use of simple organisms like yeast for biological studies - we also need to study the simplest possible transformers that exhibit behaviors of interest from the larger models if we hope to ever understand LLMs and have more control over their behavior.
azath92 · 4 months ago
Totally agree, one of the most interesting podcasts i have listened to in a while was a couple of years ago on the Tiny Stories paper and dataset (the author used that dataset) which focuses on stories that only contain simple words and concepts (like bedtime stories for a 3 year old), but which can be used to train smaller models to produce coherent english, both with grammar, diversity, and reasoning.

The podcast itself with one of the authors was fantastic for explaining and discussing the capabilities of LLMs more broadly, using this small controlled research example.

As an aside: i dont know what the dataset is in the biological analogy, maybe the agar plate. A super simple and controlled environment in which to study simple organisms.

For ref: - Podcast ep https://www.cognitiverevolution.ai/the-tiny-model-revolution... - tinystories paper https://arxiv.org/abs/2305.07759

momojo · 4 months ago
I like the agar plate analogy. Of course, the yeast is the star of the show, but so much work goes into prepping the plate.

As someone in biotech, 90% of the complaints I hear over lunch are not about bad results, but about bad mistakes during the experiment. E.G. someone didn't cover their mouth while pipetting and the plates unusable now.

re5i5tor · 4 months ago
Ha! I remember where I was when I listened to that episode (Lakeshore Drive almost into Chicago for some event or other) — thanks for triggering that memory — super interesting stuff
willvarfar · 4 months ago
(there are also lots of private company datasets like e.g. user purchase history that can be used with small models to solve real business problems. All the advances in 'large' language models can be leveraged and applied to small problems if the input sequences can be represented as a special custom language.)
tmule · 4 months ago
Unfortunately, as things stand, it’s well-known that behaviors and optimizations in small scale models fail to replicate in larger models.
yorwba · 4 months ago
Doing hyperparameter sweeps on lots of small models to find the optimal values for each size and fitting scaling laws to predict the hyperparameters to use for larger models seems to work reasonably well. I think https://arxiv.org/abs/2505.01618 is the latest advance in that vein.
victorbjorklund · 4 months ago
Which in itself is very interesting and requires study.
jebarker · 4 months ago
Well-known but not well-understood
jph00 · 4 months ago
That's not widely true. E.g the GPT 4 tech report pointed out nearly all their experiments were done on models 1000x smaller than the final model.
indoordin0saur · 4 months ago
But why? If we don't know why then how do we figure it out?
leopoldj · 4 months ago
What the author is doing here is pre-training. This is something usually model makers like Google and Meta need to do. Most business are much better off doing fine-tuning or to a lesser extent continued pre-training. The author is doing this for academic reasons.
smeeth · 4 months ago
I've been annoyed for a while people don't use a common parameter weight/compute budget for benchmarking papers.

That said, it does make it easier to claim progress...

pizza · 4 months ago
https://github.com/KellerJordan/modded-nanogpt is pretty great in that respect
godelski · 4 months ago
As a researcher, I can totally agree, but at the same time this isn't super straight forward. Things get weird because you can't just translate from one GPU to another. There isn't a clean calculation for that. There's also other issues like parallelism. Sure, your model is stable with a batch size of 8192 but that's across 1 node, it might not be stable with that batch across 2 nodes. This is a real frustrating part and honestly I don't think most people even are aware such issues exist.

Right now I'm just happy when people are including parameter, GMACs (or FLOPs), and throughput. I always include those and the GPUs I used. I also frequently include more information in the appendix but frankly when I include it in the front matter the paper is more likely to be rejected.

I can tell you why this isn't happening though. There's a common belief that scale is all you need. Which turns into "fuck the GPU poor". I've published works where my model is 100x smaller (with higher throughput, and far lower training costs), and the responses from reviewers tend to be along the lines "why isn't it better?" or "why not just distill or prune a large model?" There's this weird behavior that makes the black box stay a black box. I mean Yi Tay famously said "Fuck theorists" on twitter

ai-christianson · 4 months ago
I'm interested in one that can run fast on a laptop, but training can take a few days (maybe even longer) on the same laptop.
biophysboy · 4 months ago
It’s a fun analogy because the data “environment” of the model being trained matters a great deal
jebarker · 4 months ago
Exactly. YOLO runs of frontier models with a single random seed/data shuffle are pretty limited for trying to study the “molecular biology”. I actually like to think of LLM understanding as being like biology in the 1850s. There's lots of inspiration to be found in how biology has advanced since then and the types of experiments we might run to better understand LLMs.
moojacob · 4 months ago
Enough with big data! Who's working on small data? https://www.youtube.com/watch?v=eDr6_cMtfdA&pp=ygUKc21hbGwgZ...
arethuza · 4 months ago
Thanks - that's one of the most interesting comments I've seen about LLMs.

Makes me want to try training a model to sing "Daisy, Daisy..."

zarzavat · 4 months ago
Instead of time it should be energy. What is the best model you can train with a given budget in Joules. Then the MBP and the H100 are on a more even footing.
NooneAtAll3 · 4 months ago
it's not about efficiency - it's about availability

H100 is not an everyday product. Laptop is

nickpsecurity · 4 months ago
Also, my laptop running Linux and its outputs are probably mine and private. If I use cloud GPU's, I need to be a lawyer to be sure what they can or can't do with my data or models.

There's also no overages or hidden charges with a laptop. Past simply breaking it. You know the replacement cost ahead of time, though.

Sharlin · 4 months ago
H100s are almost-instantly available to anyone with a credit card and access to the internet. Without even having to lift their butt from the seat. And you get plenty more than five minutes of compute for the price of an M4.
KeplerBoy · 4 months ago
Still, I don't think the m4 is going to be far off from the h100 in terms of energy efficiency.

edit: fixed typo

Der_Einzige · 4 months ago
At this point, given how many H100s there are in existence, it’s basically an everyday product.
giancarlostoro · 4 months ago
Mac is more competitive on power consumption though since its not ever pulling as much as a Nvidia GPU is my understanding.

On that note you can rent an H100 for an hour for under $10 which might make for a slightly more interesting test, whats the best model outcome you can train in under an hour.

dtnewman · 4 months ago
> you can rent an H100 for an hour for under $10

Far cheaper these days. More like $2-3 for a consumer to do this. For bulk deals, pricing is often < $2.

bigyabai · 4 months ago
It depends. If you're bottlenecked by memeory speed, the Mac typically comes out on-top.

In terms of conpute efficiency though, Nvidia still has Apple beat. Nvidia wouldn't have the datacenter market on a leash if Apple was putting up a real fight.

netcan · 4 months ago
They're all good. Being somewhat arbitrary isnt a bad thing.
motorest · 4 months ago
> Instead of time it should be energy (...) Then the MBP and H100 are on a more even footing.

What exactly is your point? That instead of expressing workloads in terms of what a laptop could do, you prefer to express them in terms of what a MacBook Pro could do?

zarzavat · 4 months ago
The point is that "best model you can train in 5 minutes" is hardware dependent, the answer will be different depending on the hardware available. So it's necessarily a single-player game.

"Best model you can train with X joules" is a fairer contest that multiple people could take part in even if they have different hardware available. It's not completely fair, but it's fair enough to be interesting.

Training models with an energy limit is an interesting constraint that might lead to advances. Currently LLMs implement online learning by having increasingly large contexts that we then jam "memories" into. So there is a strict demarcation between information learned during pre-training and during use. New more efficient approaches to training could perhaps inform new approaches to memory that are less heterogenous.

tl;dr: more dimensionally correct

Deleted Comment

jvanderbot · 4 months ago
Bro por que no los dos

We can / should benchmark and optimize this to death on all axes

aniijbod · 4 months ago
Let the AI efficiency olympics begin!

On a laptop, on a desktop, on a phone?

Train for 5 minutes, an hour, a day, a week?

On a boat? With a goat?

yojo · 4 months ago
> With a goat?

I think you meant Llama.

The rhymes are admittedly more limited, unless you have a Boston accent.

cameronoliver · 4 months ago
"On a boat? With a goat?" is a quote from Green Eggs and Ham, an early-reader children's book by Dr Seuss, published in 1960.
jdjdndndn · 4 months ago
I do not like been eggs and ham. I do not like them Sam I am.

Dr Seuss ftw

hinkley · 4 months ago
Vernor Vinge has a story line where humans build their own portable chess computers and utilize them as assistants in human chess matches.

I still think this would be kinda cool. I could see a tournament providing the power source in addition to the chess clock. Then gamesmanship where you play moves you hope are expensive for the opponent but not for your own AI.

Nevermark · 4 months ago
On a maxxxed out Mac Studio M3 Ultra 512GB.

That boat will float your goat!

evanmoran · 4 months ago
This literally is fast enough to really whip the llamas.. well, you know.

And for the younger folk, this mp3 player was the precursor to Spotify:

https://youtu.be/HaF-nRS_CWM?si=d7WHzkV7CFHJ2hGg

lifestyleguru · 4 months ago
Honestly AI is a trick to make us buy new expensive computers. I'm writing this from over 10 years old one and the computers offered in a leaflet from nearby electronic store aren't much better.
542354234235 · 4 months ago
Anyone who remembers the 90s and 2000s, where your computer hardware was out of date within months, might disagree. If you want to do bleeding edge things like running 70b+ LLMs locally or doing training, you need bleeding edge hardware. No different than if you want to play the newest AAA games. There are plenty of games you can play with old hardware, and plenty of small LLMs. When you can use ChatGPT or a bunch of other services, it isn’t a trick that some people want to host their own or do training, but you need a system that can do that.
aniijbod · 4 months ago
Oh no! I thought that was Windows 11
voidUpdate · 4 months ago
I mean, gaming is the big pusher of new hardware these days, and web is basically the reason you can use a 90s computer in the modern day. I happily survived on roughly 10 year old components all the way through university because I wasn't playing AAA games
visarga · 4 months ago
goats have too many parameters, they are like GPT-4
hinkley · 4 months ago
GO4-T

Dead Comment

rPlayer6554 · 4 months ago
I’d pay for GoatLM
LorenDB · 4 months ago
> Paris, France is a city in North Carolina. It is the capital of North Carolina, which is officially major people in Bhugh and Pennhy. The American Council Mastlandan, is the city of Retrea. There are different islands, and the city of Hawkeler: Law is the most famous city in The Confederate. The country is Guate.

I love the phrase "officially major people"! I wonder how it could be put to use in everyday speech?

Dead Comment

tootyskooty · 4 months ago
I suspect one can go a lot further by adopting some tweaks from the GPT-2 speedrun effort [0], at minimum Muon, better init and carefully tuning learning rate.

[0]: https://github.com/KellerJordan/modded-nanogpt

chasd00 · 4 months ago
AI is a broad term, the zero-to-hero series by Karpathy trains one in a Jupyter notebook. You can make some pretty powerful networks to de-duplicate database rows right in your laptop too. Data de-duplication and general MDM is pretty useful in large businesses.
jl6 · 4 months ago
Feels like there should be value in building smaller, more specialized models - maybe even doing so on-demand. I don’t always want a model that knows Polish and astrophysics and Shakespeare, I want one that runs really fast and is laser-focused on the domain that I’m working on.

I want to be able to say to a large general purpose LLM: “write a script that trains a model that is optimized for <useful task>” and then run that model.

Edit: well gosh darn. Within the edit window for this comment, Google goes and launches Gemma 3 270M.

erkiserk · 4 months ago
one of the trends of machine learning though is that generalists outperform specialists on those specialists' tasks!
jl6 · 4 months ago
But I’d happily accept some of that bitter lesson if the “worse specialist” ran way faster (or at all, given memory limits).

Dead Comment

l5870uoo9y · 4 months ago
The most powerful Macbook Pro currently has 16 CPU cores, 40 GPU cores, and 128 GB of RAM (and a 16-core “neural engine” specifically designed to accelerate machine learning). Technically, it is a laptop, but it could just as well be a computer optimized for AI.
alberth · 4 months ago
The Mac Studio has:

  32 CPU
  80 GPU
  512GB RAM
https://www.apple.com/shop/buy-mac/mac-studio/apple-m3-ultra...

lukan · 4 months ago
That's a well made page, describing nice hardware, but doesn't seem to be a laptop.
Joel_Mckay · 4 months ago
From https://opendata.blender.org/ :

Apple M3 Ultra (GPU - 80 cores) scores 7235.31

NVIDIA GeForce RTX 5090 Laptop GPU scores 7931.31

Note the memory constraints of NVIDIA are not like Apple silicon which tends to also be less i/o constrained. YMMV

https://www.youtube.com/watch?v=d8yS-2OyJhw

https://www.youtube.com/watch?v=Ju0ndy2kwlw

Apple m3/m4 silicon is certainly good in some ways, but the bottleneck is often a lack of CUDA software support and price (could buy >4 times the GPU raw performance on a dual rtx 5090 desktop.) =3