Readit News logoReadit News
pico_creator commented on Attention is NOT all you need: Qwerky-72B trained using only 8 AMD MI300X GPUs   substack.recursal.ai/p/qw... · Posted by u/jtatarchuk
inhumantsar · 5 months ago
> At a high level, you take an existing transformer model Freeze all the weights, delete the attention layer, replace it with RWKV, and train it through multiple stages.

Maybe I'm misunderstanding but it seems like they still depend on an attention based model to train their model? While it's interesting and would definitely enable the personalized AI they seem to be going for, I don't really see how they can say that it's not based on attention architecture. Someone still needs to shell out the big bucks to train that teacher model, no?

pico_creator · 5 months ago
(original article author)

I view it more as a shortcut. We have trained 7B and 14B models from scratch, matching transformer performance with similar sized datasets.

This has been shown to even slightly outperform transformer scaling law, with the training we done from 1B to 14B. And we expect it to do so as we scale.

However as of this point, answering and settling that debate for good at 72B scale - is a $5 Million dollar bill. So for now, we use the short cuts, to just show that it actually works - and use that money to iterate and improve the architecture faster.

pico_creator commented on RWKV Language Model   rwkv.com/... · Posted by u/simonpure
Ey7NFZ3P0nzAe · 8 months ago
Has there been progress towards making RWKV multimodal? Can be use projector layers to send images to RWKV?
pico_creator · 8 months ago
There is work done for Vision RWKV, and audio RWKV, an example paper is here: https://arxiv.org/abs/2403.02308

Its the same principle as open transformer models where an adapter is used to generate the embedding

However currently the core team focus is in scaling the core text model, as this would be the key performance driver, before adapting multi-modal.

The tech is there, the base model needs to be better

pico_creator commented on RWKV Language Model   rwkv.com/... · Posted by u/simonpure
Ey7NFZ3P0nzAe · 8 months ago
I'm quite interested in repeng [0] (representztion engineering) for steerability of (so fzr transformer based) LLMs and was wondering if anyone had tried such methods on rwkv (or mamba for that matter). Maybe there are some low hanging fruits about it.

[0] https://github.com/vgel/repeng/issues

pico_creator · 8 months ago
One of the interesting "new direction" for RWKV and Mamba (or any recurrent model), is the monitoring and manipulation of the state in between token. For steerability, alignment, etc =)

Not saying its a good or bad idea, but pointing out that having a fixed state in between has interesting applications in this space

pico_creator commented on RWKV Language Model   rwkv.com/... · Posted by u/simonpure
theLiminator · 8 months ago
Do you have an in depth comparison between RWKV and models like mamba or s4?
pico_creator · 8 months ago
Not sure how indepth you want it to be. But we did do a co-presentation with one of the coauthors of mamba at latent space : https://www.youtube.com/watch?v=LPe6iC73lrc
pico_creator commented on RWKV Language Model   rwkv.com/... · Posted by u/simonpure
swyx · 8 months ago
how about finetuning your 32B to be R1QWQKV?
pico_creator · 8 months ago
There is a current lack of "O1 style" reasoning dataset in open source space. QWQ did not release their dataset. So that would take some time for the community to prepare.

It's definitely something we are tracking to do as well =)

pico_creator commented on RWKV Language Model   rwkv.com/... · Posted by u/simonpure
lostmsu · 8 months ago
Why aren't you on lmarena (former chatbot arena) leaderboard?
pico_creator · 8 months ago
kinda on a todo list, the model is open source on HF for anyone who is willing to make it work with lmarena
pico_creator commented on RWKV Language Model   rwkv.com/... · Posted by u/simonpure
low_tech_punk · 8 months ago
Thanks! The 0.1B version looks perfect for embedded system. What is the key benefit of attention-free architecture?
pico_creator · 8 months ago
lower compute cost especially over longer sequence length. Depending on context length, its 10x, 100x, or even 1000x+ cheaper. (quadratic vs linear cost difference)
pico_creator commented on RWKV Language Model   rwkv.com/... · Posted by u/simonpure
inciampati · 8 months ago
the recurrent model needs a mechanism to replay past context. no need to go quadratic to access all of it. they could replay multiple times to get effects similar to attention.

the hardware lottery, well... imo it's really about leveraging fully parallel training to learn how to use a memory. attention is quadratic but it can be computed in parallel. it's an end to end learned memory. getting that kind of pattern into RNNs won't be easy but it's going to be crucial before we boil the ocean.

pico_creator · 8 months ago
RWKV already solve the parallel compute problem for GPU, based on the changes it has done - so it is a recurrent model that can scale to thousands++ of GPU no issue.

Arguably with other recurrent architecture (State Space, etc) with very different design implementation. The issue of old recurrent design was just the way LSTM was designed. Not the recurrent nature.

pico_creator commented on RWKV Language Model   rwkv.com/... · Posted by u/simonpure
sushidev · 8 months ago
Interesting. Very cryptic for simple user like me. I wonder if it’s useful today and for what purposes
pico_creator · 8 months ago
Currently the strongest RWKV model is 32B in size: https://substack.recursal.ai/p/q-rwkv-6-32b-instruct-preview

This is a full drop in replacement for any transformer model use cases on model sizes 32B and under, as it has equal performance to existing open 32B models in most benchmarks

We are in works on a 70B, which will be a full drop in replacement for most text use cases

u/pico_creator

KarmaCake day268March 24, 2022
About
@picocreator (github / twitter)
View Original