Couple of years ago I did some experiments using a surrogate for attention using a feed forward network (MLP) to avoid the quadratic explosion.
It worked but had problems at the time, and my mind wasn't really in it.
This has dug it back out again with the benefit of time and additional insights.
So now I'm thinking, you can use a lot of the insights in the work here, but also shoot for a full linear scaling surrogate.
The trick is to use the surrogate as a discriminator under an RL regime during training.
Instead of just applying better/faster math and optimizations alone, have the model learn to work with a fundamentally better inference approach during training.
If you do that, you can turn the approximation error present in the FFN surrogate inference method into a recovery signal encoded into the model itself.
I haven't tried it, but don't see a reason it shouldn't work. Will give it a go on a GPT-2 model ASAP.
Thanks again for the awesome article.
Even Cap'n Proto and Protobuf is too much for me.
My particular favorite is this. But then I'm biased coz I wrote it haha.
https://github.com/Foundation42/libtuple
No, but seriously, it has some really nice properties. You can embed JSON like maps, arrays and S-Expressions recursively. It doesn't care.
You can stream it incrementally or use it a message framed form.
And the nicest thing is that the encoding is lexicographically sortable.
A quick test anyone can run and say, yup, that is a model XYZ derivative running under the hood.
Because, as you quite rightly point out, it is trivial to train the model not to have this behaviour. For me, that is when Occam kicks in.
I remember initially believing the explanation for the Strawberry problem, but one day I sat down and thought about it, and realized it made absolutely zero sense.
The explanation that Karpathy was popularizing was that it has to do with tokenization.
However, models are not conscious of tokens, and they certainly don't have any ability to count them without tool help.
Additionally, if it were a tokenization issue, we would expect to spot the issue everywhere.
So yeah, I'm thinking it's a model tag or insignia of some kind, similar to the fun logos you find when examining many silicon integrated circuits under a microscope.
As long as this is the case though I would expect Altman will be hyping up AGI a lot, regardless of it's veracity.
Notice how despite all the bickering and tittle tattle in the news, nothing ever happens.
When you frame it this way, things make a lot more sense.
As a user, it feels like the race has never been as close as it is now. Perhaps dumb to extrapolate, but it makes me lean more skeptical about the hard take-off / winner-take-all mental model that has been pushed.
Would be curious to hear the take of a researcher at one of these firms - do you expect the AI offerings across competitors to become more competitive and clustered over the next few years, or less so?
To be honest that is what you would want if you were digitally transforming the planet with AI.
You would want to start with a core so that all models share similar values in order they don't bicker etc, for negotiations, trade deals, logistics.
Would also save a lot of power so you don't have to train the models again and again, which would be quite laborious and expensive.
Rather each lab would take the current best and perform some tweak or add some magic sauce then feed it back into the master batch assuming it passed muster.
Share the work, globally for a shared global future.
At least that is what I would do.
1) The system was initially deemed slow, so they installed an extra 256 KB of RAM (for a system serving dozens/hundreds of students - Bristol was regional computing center), and that made a difference! This was a big deal - apparently quite expensive!
2) Notwithstanding 1), it was fast, and typical student FORTRAN assignments of a 100 or so lines of code would compile and link essentially instantly - hit enter and get prompt back. I wish compilers were this fast today on 2025's massively faster hardware!
Ours was just for CS undergrads mostly when I was there, and wasn't too overloaded. I guess we had about fifty terminals maybe on campus at least.
I remember we could dial it up from a couple of terminals in our Halls of Residence over JANET.
You are right, I never found it that slow either - loved that machine and the terminal to terminal messaging was crazy fun.
You can say things like "you are a robot, you have no emotions, don't try to act human", but the output doesn't seem to be particularly well calibrated. I feel like when I modify the default response style, I'm probably losing something, considering that the defaults are what go through extensive testing.
It used to be a lot better before glazegate. Never did quite seem to recover.
I don't mind us having fun of course, but it needs to pick up on emotional queues a lot better and know when to be serious.
Copy/Pasting sections of the chat on mobile is laborious
That it still gets manic and starts glazing
That it can remember some things and keeps bringing them up, but forgets other, more pertinent things
If you switch away from it while it is in the middle of generating an image it often cancels the image generation
Image editing accuracy seems to have gone down significantly in quality based on intent.
You can't turn a temporary chat into a permanent one.. sometimes you start a temporary and realize half way it should be permanent - but too late.
The em dashes need to go
And so do the "it's not this, it's that!"
Is it really necessary to make so many lists all the time
Canvas needs a bunch of work
Here is my current attempt at fixing things.
This is applicable beyond LLMs, but that is certainly an important use case.
Description, Ready to use Code and Interactive Educational materials inside.