Funnily enough, the code was deleted in the repo, but can still be seen in the commits. It's what you would expect from the paper :D
On the general topic of non-attention LLMs, I recommend checking out the MesaNet [1], Rodimus [2], Gated DeltaNet [3], or Mamba2 [4]. They are currently SOTA.
On the general topic of non-attention LLMs, I recommend checking out the MesaNet [1], Rodimus [2], Gated DeltaNet [3], or Mamba2 [4]. They are currently SOTA.
However, I have yet to see a compelling non attention based model that achieves good performance on code, math, reasoning, or multi-turn QA tasks. I do not think we are getting rid of attention soon, I believe the ability to look back is crucial in certain tasks. [1] https://arxiv.org/abs/2506.05233 [2] https://arxiv.org/abs/2410.06577 [3] https://arxiv.org/abs/2412.06464 [4] https://arxiv.org/abs/2405.21060