Its the same principle as open transformer models where an adapter is used to generate the embedding
However currently the core team focus is in scaling the core text model, as this would be the key performance driver, before adapting multi-modal.
The tech is there, the base model needs to be better
Not saying its a good or bad idea, but pointing out that having a fixed state in between has interesting applications in this space
It's definitely something we are tracking to do as well =)
the hardware lottery, well... imo it's really about leveraging fully parallel training to learn how to use a memory. attention is quadratic but it can be computed in parallel. it's an end to end learned memory. getting that kind of pattern into RNNs won't be easy but it's going to be crucial before we boil the ocean.
Arguably with other recurrent architecture (State Space, etc) with very different design implementation. The issue of old recurrent design was just the way LSTM was designed. Not the recurrent nature.
This is a full drop in replacement for any transformer model use cases on model sizes 32B and under, as it has equal performance to existing open 32B models in most benchmarks
We are in works on a 70B, which will be a full drop in replacement for most text use cases
Maybe I'm misunderstanding but it seems like they still depend on an attention based model to train their model? While it's interesting and would definitely enable the personalized AI they seem to be going for, I don't really see how they can say that it's not based on attention architecture. Someone still needs to shell out the big bucks to train that teacher model, no?
I view it more as a shortcut. We have trained 7B and 14B models from scratch, matching transformer performance with similar sized datasets.
This has been shown to even slightly outperform transformer scaling law, with the training we done from 1B to 14B. And we expect it to do so as we scale.
However as of this point, answering and settling that debate for good at 72B scale - is a $5 Million dollar bill. So for now, we use the short cuts, to just show that it actually works - and use that money to iterate and improve the architecture faster.