koayon (u/koayon) - Readit News

koayon commented on Mamba Explained thegradient.pub/mamba-exp... · Posted by u/andreyk

koayon · a year ago

This is a very fair point! If we had infinite compute then it's undeniable that transformers (i.e. full attention) would be better (exactly as you characterise it)

But that's the efficiency-effectiveness tradeoff that we have to make: given that compute is limited, would we prefer attention over shorter sequences or SSMs over longer sequences? The answer is probably "well, it depends on your use case" - I can definitely see reasons for both!

A fairly compelling thought for me is hybrid architectures (Jamba is a recent one). Here you can imagine having perfect recall over recent tokens and lossy recall over distant tokens. E.g. if the AI is generating a feature-length film, you "could imagine having Attention look at the most recent frames for short-term fluidity and an SSM for long-term narrative consistency" (quote from the OP)

koayon · a year ago

And given that the compute is O(n^2) with context window, it's a very real tradeoff, at least in the short term

koayon commented on Mamba Explained thegradient.pub/mamba-exp... · Posted by u/andreyk

andy_xor_andrew · a year ago

> But Transformers have one core problem. In a transformer, every token can look back at every previous token when making predictions.

Lately I've been wondering... is this a problem, or a strength?

It might be a fallacy to compare how LLMs "think" with how humans think. But humor me for a second. When you are speaking, each time you emit a word, you are not attending to every previous word in your sentence (like transformers), rather you have a state in your mind that represents the grammar and concepts, which is continuously updated as you speak (more similar to SSMs).

Similarly, when you read a book, every time you read a word, you are not attending to every previous word in the book. Your model of "the book" is rather a fuzzy/approximate state that is updated with new information every time a new word appears. Right? (I'm sorry I know this is very handwavy and psuedoscientific but bear with me).

Ok, so if (big if) you feel like the above is true, then to match human-type language modelling, SSMs seem more human-like than transformers.

BUT... then aren't transformers strictly better in terms of accuracy? Because a transformer never "forgets" information, as long as it is within the context window, because it revisits that information every time it emits a new token.

So let's say we can remove the "quadratic attention" problem of transformers with SSMs. That's a nice training/inference performance boost. But... look at where we got with "naive" attention. GPT 4, Claude 3. It's not like we're hitting a wall with quadratic attention. It's absurdly more expensive than SSMs, but GPUs certainly aren't getting slower. If all AI work stops now, and only hardware improves, it wouldn't be long until GPT4 could run on local hardware, right, provided Moore's law?

/end rant, not really sure what my point was, I'm not against SSMs (they're cool) but rather I'm wondering if the SOTA will ever be SSM when attention is so damn good

koayon · a year ago

This is a very fair point! If we had infinite compute then it's undeniable that transformers (i.e. full attention) would be better (exactly as you characterise it)

But that's the efficiency-effectiveness tradeoff that we have to make: given that compute is limited, would we prefer attention over shorter sequences or SSMs over longer sequences? The answer is probably "well, it depends on your use case" - I can definitely see reasons for both!

A fairly compelling thought for me is hybrid architectures (Jamba is a recent one). Here you can imagine having perfect recall over recent tokens and lossy recall over distant tokens. E.g. if the AI is generating a feature-length film, you "could imagine having Attention look at the most recent frames for short-term fluidity and an SSM for long-term narrative consistency" (quote from the OP)

koayon commented on Mamba Explained: The State Space Model Taking On Transformers kolaayonrinde.com/blog/20... · Posted by u/koayon

thecolorgreen · 2 years ago

Why doesn't Equation 1b use the h' defined in Equation 1a?

koayon · 2 years ago

Hey! OP here Great question - h' in Equation 1a refers to the derivative of h with respect to time (t). This is a differential equation which we can solve mathematically when we have x in order to get a closed-form solution for h. We would then plug in that h (the hidden state) into equation 1b.

In our case, we don't actually wait for a closed-form solution but instead compute the discrete representation (Equation 2)

Hope that helps!

koayon commented on Mamba Explained: The State Space Model Taking On Transformers kolaayonrinde.com/blog/20... · Posted by u/koayon

CuriouslyC · 2 years ago

This. Hyperparameter tuning and training include a lot of model specific black magic. Transformers have had time to mature, it'll take a while for other stuff to catch up even if they have a higher potential ceiling.

koayon · 2 years ago

Another interesting one is that the hardware isn't really optimised for Mamba yet either - ideally we'd want more of the fast SRAM so that we can store more larger hidden states efficiently

koayon commented on Mamba Explained: The State Space Model Taking On Transformers kolaayonrinde.com/blog/20... · Posted by u/koayon

CuriouslyC · 2 years ago

This. Hyperparameter tuning and training include a lot of model specific black magic. Transformers have had time to mature, it'll take a while for other stuff to catch up even if they have a higher potential ceiling.

koayon · 2 years ago

Definitely agree that a lot of work going into hyperparameter tuning and maturing the ecosystem will be key here!

I'm seeing the Mamba paper as the `Attention Is All You Need` of Mamba - it might take a little while before we get everything optimised to the point of a GPT-4 (it took 6 years for transformers but should be faster than that now with all the attention on ML)