https://www.joelonsoftware.com/2000/04/06/things-you-should-...
Deleted Comment
The analogy I prefer when teaching attention is celestial mechanics. Tokens are like planets in (latent) space. The attention mechanism is like a kind of "gravity" where each token is influencing each other, pushing and pulling each other around in (latent) space to refine their meaning. But instead of "distance" and "mass", this gravity is proportional to semantic inter-relatedness and instead of physical space this is occurring in a latent space.
https://github.com/danielvarga/transformer-as-swarm
Basically a boid simulation where a swarm of birds can collectively solve MNIST. The goal is not some new SOTA architecture, it is to find the right trade-off where the system already exhibits complex emergent behavior while the swarming rules are still simple.
It is currently abandoned due to a serious lack of free time (*), but I would consider collaborating with anyone willing to put in some effort.
(*) In my defense, I’m not slacking meanwhile: https://arxiv.org/abs/2510.26543https://arxiv.org/abs/2510.16522https://www.youtube.com/watch?v=U5p3VEOWza8
Folie A Deux Ex Machina
The first part of the comment is very valuable. “I looked at it and it made me feel extremely strange almost immediately“. That is very good to know.
The second bit I’m less sure about. What do they mean by “check to make sure this can't trigger migraines or seizures”? Like what check are they expecting? Literature research? Or experiments? The word “check” makes it sound as if they think this is some easy to do thung, like how you could “double check” the spelling of a word using a dictionary.
In this paper, the task is to learn how to multiply, strictly from AxB=C examples, with 4-digit numbers. Their vanilla transformer can't learn it, but the one with (their variant of) chain-of-thought can. These are transformers that have never encountered written text, and are too small to understand any of it anyway.
It’s a bizarre debate when it’s glaringly obvious that small contributions matter and big contributions matter as well.
But which contributes more, they ask? Who gives a shit, really?
Funding agencies? Should they prioritize established researchers or newcomers? Should they support many smaller grant proposals or fewer large ones?