Deleted Comment
Thanks and appreciated in advance.
To make things worse, low attention values will have very low gradient, thus needing a lot of weight updates to undo that kind of mistakes. On the other hand, subtracting the output of two softmax allows the model to predict a weight of exactly zero for some of the values, while keeping a reasonable gradient flowing through.
So the model already knows what is noise, but a single softmax makes it harder to exclude it.
Moreover, with a single softmax the output of all heads is forced to stay in the convex hull of the value vectors, whereas with this variant each head can choose its own lambda, thus shifting the "range" of the outputs outside the convex hull pre-determined by the values. This makes the model as a whole more expressive.
If we’re subtracting one attention matrix from another, we’d end up with attention scores between -1 and 1, with a probability of effectively 0 for any single entry to exactly equal 0.
What’s more, the learnable parameter \lambda allows for negative values. This would allow the model to learn to actually add the attention scores, making a score of exactly 0 impossible.
Seeing the circuitry of a computer in this way helped me to understand that computers operated by means other than pure magic. And, the video I saw was much less descriptive of how a computer works than the one the OP linked. So, although neither video amounts to a full college course on the topic, there’s still a lot of value in their ability to expose people to the topic. It’s inspiring to see how computers are mostly a composition of NAND gates, and to compare the massive structures in the videos with the microprocessors of the real world.
I'm just completely baffled how anything in the training procedure could allow the LLM to learn information about the structure of tokens. Does the tokenization process not treat every token (which I thought usually maps to a word) as an "opaque blob"?
1) An English dictionary as input.
2) List of words that start with "app" wiki page as input.
3) Other alphabetically sorted pieces of text.
4) Elementary school homeworks for spelling.
5) Papers on glyphs, diphthongs, and other phonetic concepts.
You begin to recognize that the tokens in these lists appear near each other in this strange context. You hardly ever see token 11346 ("apple") and token 99015 ("appli") this close to each other before. But you see it frequently enough that you decide to nudge these two tokens' embeddings closer to one another.
Your ability to predict the next token in a sequence has improved. You have no idea why these two tokens are close every ten millionth training example. Your word embeddings start to encode spelling information. Your word embeddings start to encode handwriting information. Your word embeddings start to encode phonic information. You've never seen or heard the actual word, "apple". But, after enough training, your embeddings contain enough information so that if you're asked, ["How do", "you", "spell", "apple"], you are confident as you proclaim ["a", "p", "p", "l", "e", "."] as the obvious answer.
I sat down and worked it out. What do you know golden ratio.
Oh and this other number, -0.618. Anyone know what it's good for?