pico_creator (u/pico_creator)

pico_creator commented on Attention is NOT all you need: Qwerky-72B trained using only 8 AMD MI300X GPUs substack.recursal.ai/p/qw... · Posted by u/jtatarchuk

inhumantsar · 5 months ago

> At a high level, you take an existing transformer model Freeze all the weights, delete the attention layer, replace it with RWKV, and train it through multiple stages.

Maybe I'm misunderstanding but it seems like they still depend on an attention based model to train their model? While it's interesting and would definitely enable the personalized AI they seem to be going for, I don't really see how they can say that it's not based on attention architecture. Someone still needs to shell out the big bucks to train that teacher model, no?

pico_creator · 5 months ago

(original article author)

I view it more as a shortcut. We have trained 7B and 14B models from scratch, matching transformer performance with similar sized datasets.

This has been shown to even slightly outperform transformer scaling law, with the training we done from 1B to 14B. And we expect it to do so as we scale.

However as of this point, answering and settling that debate for good at 72B scale - is a $5 Million dollar bill. So for now, we use the short cuts, to just show that it actually works - and use that money to iterate and improve the architecture faster.

pico_creator commented on RWKV Language Model rwkv.com/... · Posted by u/simonpure

Ey7NFZ3P0nzAe · 8 months ago

Has there been progress towards making RWKV multimodal? Can be use projector layers to send images to RWKV?

pico_creator · 8 months ago

There is work done for Vision RWKV, and audio RWKV, an example paper is here: https://arxiv.org/abs/2403.02308

Its the same principle as open transformer models where an adapter is used to generate the embedding

However currently the core team focus is in scaling the core text model, as this would be the key performance driver, before adapting multi-modal.

The tech is there, the base model needs to be better

pico_creator commented on RWKV Language Model rwkv.com/... · Posted by u/simonpure

Ey7NFZ3P0nzAe · 8 months ago

I'm quite interested in repeng [0] (representztion engineering) for steerability of (so fzr transformer based) LLMs and was wondering if anyone had tried such methods on rwkv (or mamba for that matter). Maybe there are some low hanging fruits about it.

[0] https://github.com/vgel/repeng/issues

pico_creator · 8 months ago

One of the interesting "new direction" for RWKV and Mamba (or any recurrent model), is the monitoring and manipulation of the state in between token. For steerability, alignment, etc =)

Not saying its a good or bad idea, but pointing out that having a fixed state in between has interesting applications in this space

pico_creator commented on RWKV Language Model rwkv.com/... · Posted by u/simonpure

theLiminator · 8 months ago

Do you have an in depth comparison between RWKV and models like mamba or s4?

pico_creator · 8 months ago

Not sure how indepth you want it to be. But we did do a co-presentation with one of the coauthors of mamba at latent space : https://www.youtube.com/watch?v=LPe6iC73lrc

pico_creator commented on RWKV Language Model rwkv.com/... · Posted by u/simonpure

swyx · 8 months ago

how about finetuning your 32B to be R1QWQKV?

pico_creator · 8 months ago

There is a current lack of "O1 style" reasoning dataset in open source space. QWQ did not release their dataset. So that would take some time for the community to prepare.

It's definitely something we are tracking to do as well =)

pico_creator commented on RWKV Language Model rwkv.com/... · Posted by u/simonpure

lostmsu · 8 months ago

Why aren't you on lmarena (former chatbot arena) leaderboard?

pico_creator · 8 months ago

kinda on a todo list, the model is open source on HF for anyone who is willing to make it work with lmarena

pico_creator commented on RWKV Language Model rwkv.com/... · Posted by u/simonpure

low_tech_punk · 8 months ago

Thanks! The 0.1B version looks perfect for embedded system. What is the key benefit of attention-free architecture?

pico_creator · 8 months ago

lower compute cost especially over longer sequence length. Depending on context length, its 10x, 100x, or even 1000x+ cheaper. (quadratic vs linear cost difference)

pico_creator commented on RWKV Language Model rwkv.com/... · Posted by u/simonpure

inciampati · 8 months ago

the recurrent model needs a mechanism to replay past context. no need to go quadratic to access all of it. they could replay multiple times to get effects similar to attention.

the hardware lottery, well... imo it's really about leveraging fully parallel training to learn how to use a memory. attention is quadratic but it can be computed in parallel. it's an end to end learned memory. getting that kind of pattern into RNNs won't be easy but it's going to be crucial before we boil the ocean.

pico_creator · 8 months ago

RWKV already solve the parallel compute problem for GPU, based on the changes it has done - so it is a recurrent model that can scale to thousands++ of GPU no issue.

Arguably with other recurrent architecture (State Space, etc) with very different design implementation. The issue of old recurrent design was just the way LSTM was designed. Not the recurrent nature.

pico_creator commented on RWKV Language Model rwkv.com/... · Posted by u/simonpure

sushidev · 8 months ago

Interesting. Very cryptic for simple user like me. I wonder if it’s useful today and for what purposes

pico_creator · 8 months ago

Currently the strongest RWKV model is 32B in size: https://substack.recursal.ai/p/q-rwkv-6-32b-instruct-preview

This is a full drop in replacement for any transformer model use cases on model sizes 32B and under, as it has equal performance to existing open 32B models in most benchmarks

We are in works on a 70B, which will be a full drop in replacement for most text use cases