Bamba: An open-source LLM that crosses a transformer with an SSM

Love those GPQA scores hovering around 5% when chance (on 4-way multi-choice) would have got them 25%!

So could do better than chance by excluding the option it's picked?

gryfft · 4 months ago

A stopped clock is right twice a day, but a running clock set to the wrong time is always wrong.

cwt137 · 4 months ago

Not always true! Your statement is only true when the running clock's speed is the same as time. Thus, regular time and the clock's time will never meet.

If the clock is running faster than regular time, it will at point catch up to regular time and thus be correct for a split second. If the clock is slower than regular time, regular time will catch up to the clock and the clock will be right for a split second.

parrit · 4 months ago

The RMS of wrongness of the running clock is probably lower.

nthingtohide · 4 months ago

> a running clock set to the wrong time is always wrong.

Could be right within 15 min accuracy in the appropriate timezone. And such a mechanism can be corrected for in the postprocessing step.

dudeinhawaii · 4 months ago

or.. A stopped clock is right twice a day; a mis-prompted LLM is wrong 19 times out of 20—but only because we handed it the wrong instruction sheet.

Procedural error in testing perhaps? I'm not familiar with the methodology for GPQA.

This type of architecture is definitely the future. Unlimited attn is a dead end. As a human you don't need to scan an entire book just to guess what the next word will be and LLMs shouldn't need that either.

og_kalu · 4 months ago

Humans can re-attend to material whenever necessary (i.e you can just re-read a book, re-watch a documentary etc when you feel you have missed crucial context) so it's not the end of the world. These SSMs or modern RNNs can't and if crucial context has been discarded by the end of the query then well too bad. Transformers are of course always re-attending so not an issue for them either. Until that issue is resolved, i don't think attention will be going anywhere.

imtringued · 4 months ago

As you said. Transformers are using linear attention for each token. It's just that n times n is quadratic. There is no way around this other than by adding a separate token that indicates rerunning the SSM from the beginning. Then you have a dynamically scaling system that seamlessly switches between linear and quadratic complexity depending on the problem.

MLA is probably the closest thing that is in-between both.

quantadev · 4 months ago

Not be contrarian, but if the next word prediction happens to be someone's name or a place or something discussed multiple places in the book then often, yes, a knowledge of the full plot of the book is "required" just to predict the next word, as you get to the middle or end of a book.

For example you could never fill in the last chapter of any good book without having knowledge of every previous chapter. Not highly detailed knowledge, but still knowledge.

parrit · 4 months ago

What an LLM does is stuff it all into short term memory. Humans dump the first pages into long term memory and "make sense" of it. Humans have a massive context window because of this (and sheer brain size and efficiency).

tmalsburg2 · 4 months ago

Isn't this exactly the point of this model? No need to memorize everything (which makes transfomers expensive), just keep the relevant info. SSM are essentially recurrent models.

adt · 4 months ago

mh- · 4 months ago

SSM = state-space model, for the unfamiliar.

https://en.wikipedia.org/wiki/State-space_representation

jwilber · 4 months ago

LLM/state space models have been popular for some years now, see: https://arxiv.org/abs/2212.14052

More recently, hybrid architectures that utilize attention plus other operators are gaining traction.

See https://arxiv.org/abs/2503.01868

mentalgear · 4 months ago

> chose to make just about everything associated with Bamba open-source — the training recipes, the data, the data loader IBM designed for largescale distributed training, and a quantization framework aimed at shaving storage and inferencing costs.

cubefox · 4 months ago

Another recent transformer/SSM hybrid is "M1", with a more than 3x claimed inference speed-up compared to equivalent transformers: https://arxiv.org/pdf/2504.10449

IBM is claiming at least a 2x inference speed-up with Bamba. Both groups say that future SSM optimizations to vLLM would lead to further inference speed improvement.

bushbaba · 4 months ago

Wonder if the name is inspired by my favorite snack, bamba. The best are the hazelnut bamba.

Btw bamba if given to kids at a young age can drastically reduce the chance of peanut allergies

flaviolivolsi · 4 months ago

Bamba means cocaine in Italian. Better not to give it to kids

ericol · 4 months ago

Well, have you ever heard of the Mitsubishi Pajero? [1]

https://en.wikipedia.org/wiki/Mitsubishi_Pajero

visarga · 4 months ago

Let me show you the etymology of Bamba:

SSM (state space model) -> SSSM (structured state space model) -> (it's like a snake ssss...) Mamba -> Bamba

zaptrem · 4 months ago

Where does the B come from?

anentropic · 4 months ago

> they added another trillion tokens and shrank the model from 18 GB to 9 GB through quantization, reducing its bit width from Mamba2’s 16-bit floating-point precision to 8-bits.

This sounds like what they call "Bamba-9B" is actually an 18B model quantised to 8 bits.

I thought generally we were naming models "nB" by their number of params and treating quantisation as a separate concern. Are there any other models that instead treat the name as an indicative memory requirement?

Is this an attempt to hide that it fares poorly vs other ~18B parameter models?

EDIT: no, I just misunderstood

> This sounds like what they call "Bamba-9B" is actually an 18B model quantised to 8 bits.

No it doesn't? The fact that it is 18 GB with 16 bit per parameter before quantization means that it is a 9B parameter model.

Ah thanks, I see where I got confused now.

Yeah, that's confusing, but the HuggingFace page says it has 9.78 B parameters.

https://huggingface.co/ibm-ai-platform/Bamba-9B-fp8

jmward01 · 4 months ago