I haven't got any beyond my own working notes and some basic plots, but I've unceremoniously dumped them into a document here incase anyone else finds them interesting. If so I'd _love_ to chat with you. enjeyw @ google's email provder.
https://thealephengine.substack.com/p/67e3786f-8e84-41bd-888...
Unifying a team in a singular direction is both the simplest way to ensure success and the most delicate part of managing any project with more than 1 other person involved.
If lots of smart people have thought about something and still disagree on the correct approach, pick one and move one.
Seriously. If you voted for this, you owe civilization a debt that you will probably never be wealthy enough or long-lived enough to repay.
About 5 years ago I became more aware that reducing my consumption of ultra processed food was good for me. This was very bad for Beyond Meat’s prospects.
I suspect this experience generalizes.
The vast majority of tokens in a sequence will be irrelevant to an attention mechanism outside of a very small window. Right now however we tend to either keep all cache values forever, or dump them all once they hit a certain age.
My theory is that you can train model to look at the key vectors and from that information alone work out how long to keep a the token in the cache for. Results so far look promising and it’s easy to add after the fact without retraining the core model itself.