Mamba: The Easy Way - Readit News

In case people are wondering why Mamba is exciting:

There's this idea in AI right now that "scaling" models to be bigger and train on more data always makes them better. This has led to a science of "scaling laws" which study just how much bigger models need to be and how much data we need to train them on to make them a certain amount better. The relationship between model size, training data size, and performance turns out to be quite predictable.

Transformers are great because they can continue scaling and giving us better performance – unlike, we think, RNNs. Probably the most exciting thing about Mamba is the claim that it can be a bit smaller, and train on a bit less data, and still provide better performance than the equivalent Transformer, especially at longer sequence lengths.

For more info, see the scaling laws plot in Figure 4 of the Mamba paper: https://arxiv.org/abs/2312.00752

KuriousCat · 2 years ago

People have shown even CNNs can match up the peformance of the transformers.

https://openreview.net/forum?id=TKIFuQHHECj#

I believe there is a lot of herding going on due to the influence of people who had compute to play around with than deeply insightful or principled exploration of networks.

jdeaton · 2 years ago

you linked a paper about vision transformers...

hansonw · 2 years ago

“RNN-mode inference” is also extremely exciting because you can precompute the hidden state of any prompt prefix (i.e. a long system prompt, or statically retrieved context) and continued generations pay the same cost irrespective of the prefix length.

shikon7 · 2 years ago

But this also means that possible information retained is constant irrespective of the prefix length. This might be a problem if the prefix is composed of essentially uncompressable data.

5kg · 2 years ago

I'd love to see someone who has the resources train a model bigger than 2.8b and show the scaling law still holds.

nickpsecurity · 2 years ago

Some prior comments said those architectures lack the memory or something of a transformer. That there’s a weakness that’s keeping people using transformers. If true, I’d like to also see tests of various domains with equivalent transformer and Mamba designs to see if that difference impacted anything. From there, we’d have a better idea about whether Mamba-176B is worth the money.

Fantastic blog post, thank you for this. I am not even familiar with transformers, yet the explanation is stellar clear to me, and the included references and context are a trasure trove. The explanation of FlashAttention is the best I have seen, and that is not even the focus of the article.

One question I have on selectivity: footnote 4 says "the continuous A is constant, while our discretization parameter ∆ is input-dependent." What is the effect of varying the discretization instead of the (main, as I understand it) state A? My gut says it simplifies training and provides stability, but I feel A carries most of the behavior of the model, so it should have more wiggle room throughout training.

jackcook · 2 years ago

Thank you for the kind words! I think it’s mostly to reduce complexity during training. Here’s an excerpt from page 9 of the Mamba paper:

“We remark that while the A parameter could also be selective, it ultimately affects the model only through its interaction with ∆ via A = exp(∆A) (the discretization (4)). Thus selectivity in ∆ is enough to ensure selectivity in (A, B), and is the main source of improvement. We hypothesize that making A selective in addition to (or instead of) ∆ would have similar performance, and leave it out for simplicity.”

nlrk · 2 years ago

when I read the paper I thought the idea was changing \Delta permits getting the model to learn different things over different time scales. As you quoted “the main source of improvement".

I don’t have an llm backround, just controls, so I might wrong.

whimsicalism · 2 years ago

How are you not familiar with transformers yet have seen multiple explanations of FlashAttention?

samus · 2 years ago

The issue with Attention essentially is that it is used to relate all token of the input sequence with each other. The need to do that somehow makes sense no matter how much one understands about the internals of a transformer. The naive way to do that boils down to matrix multiplications, and a lot more people understand the performance issues implied by them.

avarun · 2 years ago

Literally the exact question I had reading that comment haha

jxmorris12 · 2 years ago

intalentive · 2 years ago

Nice post. A couple things to add:

1. The Mamba co-author was also the FlashAttention lead author.

2. The secret ingredient that makes SSMs viable for deep learning is HiPPO theory. If you start with random initialization you're not going to get results. What you need is "optimal online function approximation" using Legendre polynomials, a Fourier basis, etc., in matrix form. The Mamba story starts with Legendre Memory Units.

Invariably someone comments, "How do we know that it scales?" We don't. But the lead author has backing and a new startup at cartesia.ai. Could be the next Mistral.

sigmoid10 · 2 years ago

The architecture is completely public. I would be surprised if certain other players (including but not limited to Mistral AI) are not training models yet. We'll hear soon enough if this is viable. Maybe not for official release candidates, but at least for internal testing.

3abiton · 2 years ago

Nonetheless, this is extremely exciting, unlike RWKV and Retention Network

magnio · 2 years ago

moffkalast · 2 years ago

If I'm not mistaken the largest mamba model right now is 2.8B and undertrained with low quality data (the Pile only). The main problem is that it's new and unproven.

Should become very interesting once someone with both data and significant financial backing takes the plunge and trains something of notable size. Perhaps Llama-3 might already end up being that attempt, as we seem to be heavily into diminishing returns for transformers.

SekstiNi · 2 years ago

There is one trained on 600B tokens from SlimPajama [1], but that's fairly tiny compared to other recent releases (ex. stablelm-3b [2] trained on 4T tokens).

> low quality data (the Pile only)

The Pile is pretty good quality wise. It's mostly the size (300B tokens) that's limiting.

[1]: https://huggingface.co/state-spaces/mamba-2.8b-slimpj [2]: https://huggingface.co/stabilityai/stablelm-3b-4e1t

Eh quality is subjective. There are good parts, like Books3 and arxiv, but a large part of it is common crawl which has just about anything people put up on the internet, random IRC chat logs, HN and Reddit shitposts, Youtube subtitles which are in broken English half the time, and of course the Enron corporate email dump to make every model sound like an HR middle manager.

jsenn · 2 years ago

This was really helpful, but only discusses linear operations, which obviously can’t be the whole story. From the paper it seems like the discretization is the only nonlinear step—in particular the selection mechanism is just a linear transformation. Is that right? How important is the particular form of the nonlinearity?

EDIT: from looking at the paper, it seems like even though the core state space model/selection mechanism is linear (except for discretization?), they incorporate a nonlinearity in the full “mamba block”, which is stacked up with residual connections and layer norm just like in a transformer. They describe this as combining a linear attention and an MLP into a single step, rather than alternating attention and MLP as in a transformer.

Yes you're spot on, the nonlinearities come from the full Mamba blocks, which I left out of this post for simplicity/to focus on the bigger ideas the paper introduced. You can see it marked by the "X" on the right-most part of Figure 3 in the Mamba paper: https://arxiv.org/abs/2312.00752

paxys · 2 years ago

From what I can tell all the large players in the space are continuing developing on transformers right? Is it just that Mamba is too new, or is the architecture fundamentally not usable for some reason?

thatguysaguy · 2 years ago

Too new is definitely one thing. Someone is going to have to make a gamble to actually paying for a serious pretraining run with this architecture before we know how it really stacks up against transformers.

There are some papers suggesting that transformers are better than SSMs in fundamental ways (e.g. They cannot do arbitrary key-based recall from their context: https://arxiv.org/abs/2402.01032). This means it's not just a no-brainer to switch over.

espadrine · 2 years ago

Another element is that Mamba required a very custom implementation down to custom fused kernels which I expect would need to be implemented in deepspeed or the equivalent library for a larger training run spanning thousands of GPUs.

gaogao · 2 years ago

It's a reasonably easy bet that Together is doing or will do a serious pretraining run with Mamba, where if that's a success other players might start considering it more.

> There are some papers suggesting that transformers are better than SSMs in fundamental ways

I mean the vanilla transformers are also shown failing at the tasks they present.

we have no idea what the large players in the space are doing

danielmarkbruce · 2 years ago

Exactly this. Except, there is zero chance they just looked at mamba and went "meh, too new for us". People are definitely trying stuff. It takes a lot of fiddling around with a brand new model architecture to get something working well. OpenAI aren't going to give a running commentary on the state of all the things they are looking into.

denial · 2 years ago

Something minor I always wonder about when I read Mamba is the discretization.

All of the sources I see referred to as derivations of it have a discretization of the form

h_t =Ah_{t-1} + Bx_{t-1} for the first line instead of the given one of the form h_t =Ah_{t-1} + Bx_t.

Does anyone know why this is?

pama · 2 years ago

Not sure how much detail you need but generally there exist implicit and explicit integrators for numerically solving (integrating) ODE. The implicit ones, like the one used here, tend to be more stable. The ideas behind SSM come from control theory ideas that used integrators with stability guarantees so that the rest of the neural network can focus on other aspects of the problem.

That's a helpful pointer. Thank you.

Der_Einzige · 2 years ago

Very annoying namespace conflict since a package called "mamba" (faster reimplementation of the python conda package manager) already existed for awhile before this architecture was even dreamed up.

https://github.com/mamba-org/mamba

Beyond that, I'll care about an alternative to transformers when it shows superior performance with an open source 7b-34b model compared to transformer model competitors. So far this has not happened yet

jasonjmcghee · 2 years ago

> Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.

lpasselin · 2 years ago

The mamba paper shows significant improvements in all model sizes, up to 1b, the largest one tested.

Are there any reason why it wouldn't scale to 7b or more? Have they tried it?

That's the issue - I keep hearing that it is beyond small research group's budget to meaningfully train such a large model. You don't just need GPU time, you also need data. And just using the dregs of the internet doesn't cut it.

woadwarrior01 · 2 years ago

I use the former and have been experimenting with the latter. Fortunately, the contexts are separate enough that they never come up in the same sentence.

amelius · 2 years ago

I was using mamba to install mamba the other day, when suddenly I had to run for a live mamba.