Readit News logoReadit News
Posted by u/taiboku256 a year ago
Ask HN: Is anybody building an alternative transformer?
Curious if anybody out there is trying to build a new model/architecture that would succeed the transformer?

I geek out on this subject in my spare time. Curious if anybody else is doing so and if you're willing to share ideas?

czhu12 · a year ago
The MAMBA [1] model gained some traction as a potential successor. It's basically an RNN without the non linearity applied across hidden states, which makes it logarithmic time (instead of linear time) inference with a parallelizable scan [2].

It promises much faster inference with much lower compute costs, and I think up to 7B params, performs on par with transformers. I've yet to see a 40B+ model trained.

The researches of MAMBA went on to start a company called Cartesia [3], which is MAMBA applied to voice models

[1] https://jackcook.com/2024/02/23/mamba.html

[2] https://www.csd.uwo.ca/~mmorenom/HPC-Slides/Parallel_prefix_... <- Pulled up a random example from google, but Stanford CS149 has an entire lecture devoted to parallel scan.

[3] https://cartesia.ai/

kla-s · a year ago
Jamba 1.5 Large is 398B params (94B active) and weights are available.

https://arxiv.org/abs/2408.12570

Credit https://news.ycombinator.com/user?id=sanxiyn for making me aware

imtringued · a year ago
Mamba isn't really a competitor to transformers. Quadratic attention exists for a reason.

Mamba's strengths lie in being a better RNN as you said. Mamba is probably better than transformers for things like object permanence over a sequence of inputs, where each input is an image, for example.

However, it would still make sense for a transformer to actually process the image by cutting it up into patches and then performing quadratic attention on that and then feeding the transformer input into mamba to get the actual output e.g. a robot action while maintaining object permanence.

monroewalker · a year ago
Oh that would be awesome for that to work. Thanks for sharing
stavros · a year ago
If I'm not misremembering, Mistral released a model based on MAMBA, but I haven't heard much about it since.
bravura · a year ago
Check out "Attention as an RNN" by Feng et al (2024), with Bengio as a co-author. https://arxiv.org/pdf/2405.13956

Abstract: The advent of Transformers marked a significant breakthrough in sequence modelling, providing a highly performant architecture capable of leveraging GPU parallelism. However, Transformers are computationally expensive at inference time, limiting their applications, particularly in low-resource settings (e.g., mobile and embedded devices). Addressing this, we (1) begin by showing that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its many-to-one RNN output efficiently. We then (2) show that popular attention-based models such as Transformers can be viewed as RNN variants. However, unlike traditional RNNs (e.g., LSTMs), these models cannot be updated efficiently with new tokens, an important property in sequence modelling. Tackling this, we (3) introduce a new efficient method of computing attention’s many-tomany RNN output based on the parallel prefix scan algorithm. Building on the new attention formulation, we (4) introduce Aaren, an attention-based module that can not only (i) be trained in parallel (like Transformers) but also (ii) be updated efficiently with new tokens, requiring only constant memory for inferences (like traditional RNNs). Empirically, we show Aarens achieve comparable performance to Transformers on 38 datasets spread across four popular sequential problem settings: reinforcement learning, event forecasting, time series classification, and time series forecasting tasks while being more time and memory-efficient.

jmward01 · a year ago
I have an internal repo that does guided window attn. I figured out that One Weird Trick to get the model to learn how to focus so that you can move a fixed window around instead of full attn. I also built NNMemory (but that appears to be an idea others hae had now too [1]) and I have a completely bonkers mechanism for non-determanistic exit logic so that the model can spin until it thinks it has a good answer. I also built scale free connections between layers to completely remove residual connections. Plus some crazy things on sacrificial training (adding parameters that are removed after training in order to boost training performance with no prod penalty). There are more crazy things I have built but they aren't out there in the wild, yet. Some of the things I have built are in my repo. [2] I personally think we can get .5b models to outperform 8b+ SOTA models out there today (even the reasoning models coming out now)

The basic transformer block has been good at kicking things off, but it is now holding us back. We need to move to recurrent architectures again and switch to fixed guided attn windows + 'think' only layers like NNMemory. Attn is distracting and we know this as humans because we often close our eyes when we think hard about a problem on the page in front of us.

[1] https://arxiv.org/abs/2502.06049

[2] https://github.com/jmward01/lmplay

nextos · a year ago
The xLSTM could become a good alternative to transformers: https://arxiv.org/abs/2405.04517. On very long contexts, such as those arising in DNA models, these models perform really well.

There's a big state-space model comeback initiated by the S3-Mamba saga. RWKV, which is a hybrid between classical RNNs and transformers, is also worth mentioning.

bob1029 · a year ago
I was just about to post this. There was a MLST podcast about it a few days ago:

https://www.youtube.com/watch?v=8u2pW2zZLCs

Lots of related papers referenced in the description.

RossBencina · a year ago
One claim from that podcast was that the xLSTM attention mechanism is (in practical implementation) more efficient than (transformer) flash attention, and therefore promises to significantly reduces the time/cost of test-time compute.
PaulHoule · a year ago
Personally I think foundation models are for the birds, the cost of developing one is immense and the time involved is so great that you can't do many run-break-fix cycles so you will get nowhere on a shoestring. (Though maybe you can get somewhere on simple tasks and synthetic data)

Personally I am working on a reliable model trainer for classification and sequence labeling tasks that uses something like ModernBERT at the front end and some kind of LSTM on the back end.

People who hold court on machine learning forums will swear by fine-tuned BERT and similar things but they are not at all interested in talking about the reliable bit. I've read a lot of arXiv papers where somebody tries to fine-tune a BERT for a classification task, runs some arbitrarily chosen parameters they got out of another paper and it sort-of works some of the time.

It drives me up the wall that you can't use early stopping for BERT fine-tuning like I've been using on neural nets since 1990 or so and if I believe what I'm seeing I don't think the networks I've been using for BERT fine-tuning can really benefit from training sets with more than a few thousand examples, emphasis on the "few".

My assumption is that everybody else is going to be working on the flashy task of developing better foundation models and as long as they emit an embedding-per-token I can plug a better foundation model in and my models will perform better.

mindcrime · a year ago
> Personally I think foundation models are for the birds,

I might not quite that far, but I have publicly said (and will stand by the statement) that I think that training progressively larger and more complex foundation models is a waste of resources. But my view of AI is rooted in a neuro-symbolic approach, with emphasis on the "symbolic". I envision neural networks not as the core essence of an AI, but mainly as just adapters between different representations that are used by different sub-systems. And possibly as "scaffolding" where one can use the "intelligence" baked into an LLM as a bridge to get the overall system to where it can learn, and then eventually kick the scaffold down once it isn't needed anymore.

tlb · a year ago
We learned something pretty big and surprising from each new generation of LLM, for a small fraction of the time and cost of a new particle accelerator or space telescope. Compared to other big science projects, they're giving pretty good bang for the buck.
PaulHoule · a year ago
I can sure talk your ear off about that one as I went way too far into the semantic web rabbit hole.

Training LLMs to use 'tools' of various types is a great idea, as it is to run them inside frameworks that check that their output satisfies various constraints. Still certain problems like the NP-complete nature of SAT solving (and many intelligent systems problems, such as word problems you'd expect an A.I. to solve, boil down to SAT solving) and problems such as the halting problem, Godel's theorem and such are still problems. I understand Doug Hofstader has softened his positions lately, but I think many of the problems set up in this book

https://en.wikipedia.org/wiki/G%C3%B6del,_Escher,_Bach

(particularly the Achilles & Tortoise dialog) still stand today, as cringey as that book seems to me in 2025.

Deleted Comment

dr_dshiv · a year ago
Good old fashioned AI, amirite
NewUser76312 · a year ago
Yeah I've been wondering how one can contribute and build in the LLM and AI world without the resources to work on foundation models.

Because personally I'm not a product/GPT wrapper person - it just doesn't suit my interests.

So then what can one do that's meaningful and valuable? Probably something around finetuning?

happytoexplain · a year ago
I hate that popular domains take ownership of highly generic words. Many years ago, I struggled for a while to understand that when people say "frontend" they often mean a website frontend, even without any further context.
perrygeo · a year ago
The worst offender is "feature". In my domain (ML and geo) we have three definitions.

Feature could be referring to some addition to the user-facing product, a raster input to machine learning, or a vector entity in GeoJSON. Context is the only tool we have to make the distinction, it gets really confusing when you're working on features that involve querying the features with features.

janalsncm · a year ago
You can say the same thing about “model” even in ML. Depending on the context it can be quite confusing:

1) an architecture described in a paper

2) the trained weights of a specific instantiation of architecture

3) a chunk of code/neural net that accomplishes a task, agnostic to the above definitions

aqueueaqueue · a year ago
Inference has 2 meanings too
ArthurStacks · a year ago
That has been the case for about 30 years
solresol · a year ago
I tried... it started with the idea was that log loss might not be the best option for training, and maybe it should be a loss related to how wrong the predicted word was. Predicting "dog" instead of "cat" should be less penalised than predicting "running".

That turns out to be an ultrametric loss, and the derivative of an ultrametric loss is zero in a large region around any local minimum, so it can't be trained by gradient descent -- it has to be trained by search.

Punchline: it's about one million times less effective than a more traditional architecture. https://github.com/solresol/ultratree-results

janalsncm · a year ago
There are alternatives that optimize around the edges. Like Deepseek’s Multi-head Latent Attention, or Grouped Query Attention. DeepSeek also showed an optimization on Mixture of Experts. These are all clear improvements to the Vaswani architecture.

There are optimizations like extreme 1.58 bit quant that can be applied to anything.

There are architectures that stray farther. Like SSMs and some attempts at bringing the RNN back from the dead. And even text diffusion models that try to generate paragraphs like we generate images i.e. not word by word.

dr_dshiv · a year ago
Mixture of depths, too.