For me, it's one of the last true mysteries! We've figured out damned near everything else, nothing has this level of "unknown" to it.
It's simply mind-blowing to me how such a tiny block of data can encode such high-level behaviours so indirectly!
Genes code for proteins, not synapse weights!
Those proteins influence cell division, specialisation, and growth through a complex interplay of thousands of distinct signal chemicals.
Then those cells assemble into a brain, apparently "randomly" with only crude, coarse patterns that are at best statistical in nature. Some cells are longer, some shorter, some with more interconnects, some with less, but no two perfectly alike.
Then, then, somehow... waves hands... magically this encodes that "wide hips are sexually attractive" in a way that turns up fully a decade later, well into the "pre-training" phase!!!
What... the... %#%@!
How does that work!? How does any of that work?
Y'all work in AI, ML, or adjacent to it. You know how hard it is to train a model to learn to detect anything even with thousands of examples!
PS: Human DNA contains only 750 MB (62 billion bits) of information, of which maybe 0.1% to 1% directly code for brain structure and the like. Let's be generous and say 10%. That is just 75 MB that somehow makes us scared of snakes and spiders, afraid of heights, attracted to the opposite sex, capable of speech, enjoy dancing, understand on instinct what is a "bad" or "good" smell, etc, etc...
From that angle our artificial models seem very sample efficient, but it's all hard to quantify it without know what was "tried" by the universe to reach the current state. But it's all weird to think about because there is no intent in natures optimizations it's just happens because it can and there is enough energy and parallel randomness to eventually happen.
And the real mystery is not how evolution achieved this but that the laws of chemistry/universe allow self-replicating structures to appear at all. In an universe with different rules it couldn't happen even with infinite trial and error compute.
"Muon is an optimizer for the hidden weights of a neural network. Other parameters, such as embeddings, classifier heads, and hidden gains/biases should be optimized using standard AdamW."
And I just looked into this nanochat repo and it's also how it's used here.
https://github.com/karpathy/nanochat/blob/dd6ff9a1cc23b38ce6...
Just wanted to say the blog post design looks super nice. Beautifully laid out, very readable typography, clear graphics, approachable design with a welcoming UX, footnotes in the side, etc.
Anybody know how this is designed / styled? (I can see three.js being used, along with katex.js - but don't know more details)
Thanks
You load the thing up with relevant context and pray that it guides the generation path to the part of the model that represents the information you want and pray that the path of tokens through the model outputs what you want
That's why they have a tendency to go ahead and do things you tell them not to do..
also IDK about you but I hate how much praying has become part of the state of the art here. I didn't get into this career to be a fucking tech priest for the machine god. I will never like these models until they are predictable, which means I will never like them.
"that's because a next token predictor can't "forget" context. That's just not how it works."
An LSTM is also a next token predictor and literally have a forget gate, and there are many other context compressing models too which remember only the what it thinks is important and forgets the less important, like for example: state-space models or RWKV that work well as LLMs too. But even just a the basic GPT model forgets old context since it's gets truncated if it cannot fit, but that's not really the learned smart forgetting the other models do.
Yea, I know that was the case when I clicked on the thumbnails and couldn't close the image and had to reload the whole page. Good thing that you could just ask AI to fix this, but the bad thing is that you assumed it would produce fully working code in one shot and didn't test it properly.
https://youtu.be/lI1LCfTx2lI?t=525
There is also Kits.ai https://www.kits.ai/tools/ai-instruments