The FFT Strikes Back: An Efficient Alternative to Self-Attention

xeonmc · 6 months ago

Basically leverages convolution theorem[0]: expensive convolutions in direct space becomes simple multiplications in reciprocal space, and vice versa.

Whereever you have a convolution operation on your data, transform them to the conjugate domain to turn it into multiplication.

In other words, work in the domain that is natural to your data.

[0] https://en.wikipedia.org/wiki/Convolution_theorem

snickmy · 6 months ago

this is a great way to put it, that said, it was not obvious to me that the attention space (how it is structured in LLMs) is a frequency domain

pavelstoev · 6 months ago

I wrote down the following in the internal Slack chat on 01.06.2025, but of course, the performance of the actual effort is much more than writing it down.

Large language models (LLMs) operate in a high-dimensional token space, where tokens (words, subwords, or characters) can be viewed as discrete signals covering the multi-dimensional knowledge space. So FFT analysis methods can be applied to reduce time domain complexity to frequency domain representation with an idea to reduce computational complexity. So we can map token signals into the frequency domain. This transformation allows us to analyze token dynamics, such as their frequency of occurrence, temporal correlations, and interactions across contexts, with computational efficiency. In this approach, embeddings are treated as signals, and their relationships in sequence are captured as patterns in the frequency domain. FFT could be used to decompose token streams into dominant frequency components, revealing periodic or recurrent patterns in language usage - these patterns are repeatable across human generated knowledge and generally follow a predefined set of rules so the signals are not just white noise, they are predictable. By analyzing these frequency components, predictions of the next token can be made by emphasizing high-energy components in the frequency spectrum, reducing noise and focusing on statistically probable outcomes. Using this method we can reduce computational overhead during training and inference by enabling lightweight spectral analysis rather than heavy attention mechanisms, especially for long-context or repetitive sequences. Also using classical signal filtering techniques (LPF, HPF, band pass) could help align model behavior with human linguistic patterns, refine token embeddings, and improve efficiency in both training and inference phases.

evanb · 6 months ago

A cartoon:

To form a coherent idea you need to coordinate a lot of tokens. In other words, ideas are long-distance correlations between tokens. Ideas are the long-wavelength features of streams of tokens.

Is it exactly right? No. But as a cartoon it can motivate exploring an idea like this.

kkylin · 6 months ago

Exactly. Exploiting the structure of the matrix (e.g., it is well approximated by a circulant matrix) is natural if there is structure to exploit. If everything in the preprint holds up, that might suggest some symmetries (e.g., approximate stationarity in time) in the data at hand.

ttoinou · 6 months ago

Yeah basic math space transformation sandwich : 1) turn data into another space 2) operate in that space 3) transform back into original space. To optimize this, optimize each step and work as much as possible in the most efficient space

kridsdale1 · 6 months ago

Remains to be seen how lossy the transformations are. We lose a lot of data in DSP (aka “media”) doing too much.

ambicapter · 6 months ago

> In other words, work in the domain that is natural to your data.

Why would multiplication be more "natural" to a domain than convolution, as opposed to just simpler to calculate?

MajimasEyepatch · 6 months ago

As an example, if you're trying to "smooth out" some signal, what you're really trying to do is remove the high frequency components. So using a Fourier transform to convert it to the frequency domain lets you directly work with the frequency data, rather than indirectly in the time/space/whatever domain. The fact that the operation is simpler in the frequency domain is a good hint that you've picked the "natural" space in which to look at the problem. Of course, there's no formal definition of "naturalness," and at the end of the day, you get the same result either way.

ndriscoll · 6 months ago

One way to think about things is in terms of diagonalization. A generic linear operator is a fairly complicated thing that mixes information from different dimensions. When you diagonalize an operator, it's the same operator, but you're choosing a coordinate system where it becomes clear that all it was really doing was stretching along each coordinate axis, so you've broken it down into something that acts independently in a simple way on each dimension. The Fourier transform is unitary, so the intuition is that you're basically doing something like a rigid transformation of space (e.g. a rotation), and when you look from the correct angle, you see that your original operator wasn't some complicated "add a stretched version of this dimension to this other dimension minus a stretched version of a third dimension", but just "stretch this by this, this by this, etc."

On the other hand, convolution itself is already "just" multiplication. e.g. multiplying polynomials is convolution of their coefficients (to get the x^n coefficient, you need to add up all the combinations of a_i a_j x^i x^j where i+j=n), and this point of view also applies to e.g. linear time-invariant systems[0] by thinking of your function as the weights of an infinite weighted sum (so sort of an infinite polynomial) of time-shift operators (and this point of view works for other groups, not just time shifts). So f(t) is then the "coefficient" for the t-shift, and multiplying two such weighted sums again has you convolve the coefficients (so your original functions). The jargon way to say this is that your space of G-invariant functions is secretly the free algebra generated by G (G being a group). From that point of view, convolution is the "natural" multiplication on G-invariant functions. One can then ask whether there's a Fourier transform for other groups, which leads to abstract harmonic analysis. e.g. the Mellin transform is the Fourier transform for scale/stretch invariant functions as opposed to shift invariant.

[0] The typical systems that one studies in signal processing contexts where convolution and Fourier transforms are your bread and butter: https://en.wikipedia.org/wiki/Linear_time-invariant_system

ted_dunning · 6 months ago

Here is an example that shows how high dimensional data can be sparse (aka simple aka natural) in some projections. Since FFTs are just a reprojection of your data, this provides useful intuition.

https://bsky.app/profile/bsky.tdunning.com/post/3lgvuzuju3k2...

nimish · 6 months ago

> just simpler to calculate

That's all that "natural" means in this context. It's like "elegant" -- if it just takes less effort to get the result u need then why wouldn't you take the easier route?

amelius · 6 months ago

I think they just meant simpler to calculate.

jacksnipe · 6 months ago

Because multiplication gives rise to simpler algebras than convolution does.

bigmattystyles · 6 months ago

Is reciprocal space always just 1/space as in frequency=1/t?

cowsandmilk · 6 months ago

In the case of FFTs, no. Which is why I prefer the term Fourier space. I don’t like frequency domain because I frequently work with 3-D and 5–D FFTs while I’ve always felt frequency is connected to single dimension FFT.

rocqua · 6 months ago

Ite called reciprocal because the fourier transformation is it's own inverse, and the input and output space have the same 'shape' (functions from the reals to the complex numbers).

So they are considered two sides of the same coin. And reciprocal in that sense.

nh23423fefe · 6 months ago

yes. usually 1/space is often called wavenumber (k).

bjourne · 6 months ago

Yeah, but the savings are theoretical. You turn a O(n^2) operation into O(nlog n). Sounds great until you realize that n is three on average. To boot, you have to use complex numbers for calculations which are also less numerically stable. So, to the best of my knowledge, FFT is not a win for ordinary convolution.

Maybe for self-attention and for their use cases n is much larger, I didn't read the article. But you still have to deal with complex numbers.

feoren · 6 months ago

> You turn a O(n^2) operation into O(nlog n). Sounds great until you realize that n is three on average.

Sure, but are long convolutions avoided precisely because they're expensive? This paper is talking about an alternative to an attention mechanism, which covers the entire context window, no? Isn't this paper saying: you could use a long convolution for this instead, and long convolutions don't have to be slow?

> you have to use complex numbers for calculations which are also less numerically stable

I haven't heard numerical stability being a big deal in neural nets; in fact don't people often use 16-bit floats as weights to save on space? Does the numerical stability of complex numbers exceed the precision dropped off by quantization anyway? Are complex numbers really inherently less numerically stable, or are we just not as good at using them yet?

ToValueFunfetti · 6 months ago

3^2 / (3*log(3)) = >6x performance improvement and, if three is a linear average, I'd expect the average improvement to be even higher. I know real world computation doesn't answer to the simple scaling equations and there may well be a constant factor >6 that eats the gains, but I don't think the two Big Os and a n=3 are sufficient to make your case.

j2kun · 6 months ago

There are integer-only FFT analogues that enable the same tricks. Cf. https://en.wikipedia.org/wiki/Discrete_Fourier_transform_ove...

raxxorraxor · 6 months ago

It is true it isn't numerically stable and a FFT isn't entirely reversible. I think to get an idea about how frequency data relates to attention is to look at JPEG. Images tend to make understanding the concept easier. For JPEG a cosine transformation is used instead of a Fourier transformation, but the principle is the same.

adgjlsfhk1 · 6 months ago

fft is unitary so it has really good numerical stability

yagizdegirmenci · 6 months ago

Google introduced this idea in 2022 with "FNet: Mixing Tokens with Fourier Transforms" [0].

Later they found out that, performance of their TPU(s) for matrix multiplication was faster than FFT in the most scenarios.

[0]: https://arxiv.org/abs/2105.03824

TheDudeMan · 6 months ago

Referenced in this paper:

"Overall, while approaches such as FNet, Performer, and sparse transformers demonstrate that either fixed or approximate token mixing can reduce computational overhead, our adaptive spectral filtering strategy uniquely merges the efficiency of the FFT with a learnable, input-dependent spectral filter. This provides a compelling combination of scalability and adaptability, which is crucial for complex sequence modeling tasks."

And a comparison section after that.

light_hue_1 · 6 months ago

Except that the paper is written as if they discovered that you can use an fft for attention. They even have a "proof". It's in the title. Then you discover everyone already knew this and all they do is as some extra learnable parameters.

Pretty lame.

bee_rider · 6 months ago

That seems like an odd comparison, specialty hardware is often better, right?

Hey, do DSPs have special hardware to help with FFTs? (I’m actually asking, this isn’t a rhetorical question, I haven’t used one of the things but it seems like it could vaguely be helpful).

SJC_Hacker · 6 months ago

Xilinx has a very highly optimized core for the FFT. You are restricted to power of 2 sizes. Which usually isn't a problem because its fairly common to zero pad an FFT anyway to avoid highly aliased (i.e. hard-edges) binning.

The downside of implementing directly in hardware, the size would be fixed.

gct · 6 months ago

They usually have dedicated acceleration hardware, yes: https://www.ti.com/lit/an/sprabb6b/sprabb6b.pdf?ts=174057874...

mshockwave · 6 months ago

yes, almost all DSPs I know have native HW supports for FFT, since it's the bread and butter for signal processing

thijson · 6 months ago

I remember hearing about logic to help with deinterleaving the results of the butterfly network after the FFT is done.

westurner · 6 months ago

(Discrete) Fast Fourier Transform implementations:

https://fftw.org/ ; FFTW: https://en.wikipedia.org/wiki/FFTW

gh topic: fftw: https://github.com/topics/fftw

xtensor-stack/xtensor-fftw is similar to numpy.fft: https://github.com/xtensor-stack/xtensor-fftw

Nvidia CuFFTW, and/amd-fftw, Intel MKL FFTW

NVIDIA CuFFT (GPU FFT) https://docs.nvidia.com/cuda/cufft/index.html

ROCm/rocFFT (GPU FFT) https://github.com/ROCm/rocFFT .. docs: https://rocm.docs.amd.com/projects/rocFFT/en/latest/

AMD FFT, Intel FFT: https://www.google.com/search?q=AMD+FFT , https://www.google.com/search?q=Intel+FFT

project-gemmi/benchmarking-fft: https://github.com/project-gemmi/benchmarking-fft

"An FFT Accelerator Using Deeply-coupled RISC-V Instruction Set Extension for Arbitrary Number of Points" (2023) https://ieeexplore.ieee.org/document/10265722 :

> with data loading from either specially designed vector registers (V-mode) or RAM off-the-core (R-mode). The evaluation shows the proposed FFT acceleration scheme achieves a performance gain of 118 times in V-mode and 6.5 times in R-mode respectively, with only 16% power consumption required as compared to the vanilla NutShell RISC-V microprocessor

"CSIFA: A Configurable SRAM-based In-Memory FFT Accelerator" (2024) https://ieeexplore.ieee.org/abstract/document/10631146

/? dsp hardware FFT: https://www.google.com/search?q=dsp+hardware+fft

unraveller · 6 months ago

GPU saw a 10% improvement over the TPU

>The TPU is so inefficient at FTs that the researchers did not use the FFT algorithm on sequences < 4096 elements, instead opting for a quadratic-scaling FT implementation using a pre-computed DFT matrix.

> on an Nvidia Quadro P6000 GPU, the FT was responsible for up to 30% of the inference time on the FNet architecture [0]

This company [0] claimed in 2021 they could squash inference time by 40% if google would use their light chips on TPU. Perhaps more if FFTNet does more heavy lifting.

[0]: https://scribe.rip/optalysys/attention-fourier-transforms-a-...

hinkley · 6 months ago

I have been entertaining myself a bit lately by thinking about the ways in which some improvements to a design are very, very interesting to people when it takes 1.2 machines to do a task, not worth paying attention to when it's 6 machines to do the task, and suddenly very interesting again when it's 120 machines to do the task. There's that weird saddle point in the middle where I cannot get anyone else interested in my 20% resource improvements. It's just crickets.

Eridrus · 6 months ago

4096 tokens is pretty short by today's standards for transformers too.

frodo8sam · 6 months ago

I would guess that the FFT scales better as you increase the number of tokens in the context window. Interesting Google's models outperform their competitors on context size.

Daniel_Van_Zant · 6 months ago

I'm glad someone else had the same thought. I have been wondering what their "secret sauce" is for a while given how their model doesn't degrade for long-context nearly as much as other LLMs that are otherwise competitive. It could also just be that they used longer-context training data than anyone else though.

j2kun · 6 months ago

> faster than FFT

Not only that, but FFT support on TPU has always been best effort. Last I tried this, there were serious precision issues.

amelius · 6 months ago

Reminds me how CNNs were also not implemented using FFTs.

DrNosferatu · 6 months ago

But this would only work on a very limited number of tokens, right?

roflmaostc · 6 months ago

Reference for the later part?

yagizdegirmenci · 6 months ago

The section "3.3 Implementation" is mostly about hardware level speedups, which basically says:

On GPU(s) FFT is consistently faster, but in TPU(s), for shorter sequences matrix multiplication was faster.

markisus · 6 months ago

The Fourier transform is taken along the “token” dimension. However, in many applications, this dimension is not meaningful. That’s why transformers are a great option for consuming data which is permutation invariant.

I would like to see additional experiments using the lesser known Fourier transform over finite groups [1], which is permutation invariant but shares many properties with the standard Fourier transform.

I also wonder if this becomes the next big thing for LLMs, how easy will it be for inference engines(eg vLLM, llama.cpp) to integrate it?

[1] https://en.wikipedia.org/wiki/Fourier_transform_on_finite_gr...

3vidence · 6 months ago

Not an expert in this space.

Aren't tokens transformed with position dependent information in most models?

I believe llama applies a rotation to the vector based on the position in the input.

markisus · 6 months ago

That's true in the realm of LLMs. But even in this case, the position information is added only into the first layer. Tokens in later layers can choose to "forget" this information. In addition there are applications of transformers in other domains. See https://github.com/cvg/LightGlue or https://facebookresearch.github.io/3detr/

Y_Y · 6 months ago

What's the finite group in this case?

markisus · 6 months ago

I’m thinking the integers mod 2^n where n is something computers are good at (8, 32, 64). You have hardware support the group operation.

pointlessone · 6 months ago

OK, I admit that the math flies way over my head and I barely understand the text around the math. Can someone please explain (in basic English) how this is equivalent to attention mechanism? What friquencies does it talk about? How does it encode positional relations between tokens?

iNic · 6 months ago

- The Fourier Transform is an invertable operator (i.e. it acts on functions, in the case of matrices both functions and operators are themselves matrices). It transforms into what we call frequency space.

- This is most intuitive for signal analysis or images [1].

- Frequency space is inherently "complex", i.e. represented by complex numbers.

- Frequencies have the advantage that they take a "global" view of the problem.

- This mechanism is not equivalent to the attention mechanism. There is definitely a trade-off.

- But it is possible that it captures many of the important relationships that attention capture.

- I do not have good intuition for modReLU right away, but it seems important because it modifies the frequencies but preserves the inverse Fourier transform.

[1]: https://homepages.inf.ed.ac.uk/rbf/HIPR2/fourier.htm

xpe · 6 months ago

Worth noting that frequency space is often considered one dimensional. Adding the phase is what gives the other dimension.

jampekka · 6 months ago

modReLU seems to just increase the magnitude of the input value, and rotate it to the original polar angle. With clipping off negative magnitudes.

Or equivalently rotates a (real) bias term with the input angle and adds that into the original.

  (abs(z) + c)*exp(i*arg(z)) = abs(z)*exp(i*arg(z)) + c*exp(i*arg(z)) = z + c*exp(i*arg(z))

jampekka · 6 months ago

The actual mechanism at least is quite simple. Essentially it takes the FFT of the input embeddings, multiplies it elementwise with weights that are gotten from the input embeddings using an MLP (plus a constant (but learnable) bias) and then runs it through an activation function and finally takes the inverse FFT.

The "frequencies" are probably something quite abstract. FFT is often used in ways where there aren't really clear frequency interpretation. The use is due to convenient mathematical properties (e.g. the convolution theorem).

Rather amazing if this really works well. Very elegant.

xeonmc · 6 months ago

Essentially leveraging Convolution theorem. Same philosophy pops up in many places, e.g. DFT calculations

pointlessone · 6 months ago

I’m still confused. Does it treat the input tokens as a sampled waveform?

I mean, say I have some text file in ASCII. Do I then just pretend it’s raw wav and do FFT on it? I guess it can give me some useful information (like does it look like any particular natural language or is it just random; sometimes used in encrytion analysis of simple substitution cyphers). It feels surprising that revers FFT can get a coherent output after fiddling with the distribution.

agoose77 · 6 months ago

I am not an expert by _any_ means, but to provide _some_ intuition — self-attention is ultimately just a parameterised token mixer (see https://medium.com/optalysys/attention-fourier-transforms-a-...) i.e. each vector in the output depends upon the corresponding input vector transformed by some function of all the other input vectors.

You can see conceptually how this is similar to a convolution with some simplification, e.g. https://openreview.net/pdf?id=8l5GjEqGiRG

Convolutions are often used in contexts where you want to account for global state in some way. - https://openreview.net/pdf?id=8l5GjEqGiRG

yorwba · 6 months ago

I don't see how you could fit causal masking into this framework without having to do n different FFTs, and there's no mention of positional embeddings either, so I guess the self-attention implementation being compared against is noncausal NoPE, which would make this a case of baseline sandbagging and maybe not so impressive.

If the results were close to state-of-the-art, probably the author would've mentioned it?

TheDudeMan · 6 months ago

They do show their model as winning every category in Long Range Arena (LRA) benchmark. Hopefully they have not excluded losing categories or better models.

yorwba · 6 months ago

Winning against their own baseline, not against the current best-performing model. Which apparently is S5 currently https://paperswithcode.com/sota/long-range-modeling-on-lra with 87.46 overall vs. 58.31 here.

wafngar · 6 months ago

Should be a relevant reference: https://arxiv.org/abs/2111.13587

Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, Bryan Catanzaro

johntb86 · 6 months ago

Does anyone have an intuition about why looking at things in the frequency domain is helpful here? The DC term I can see, but I wouldn't expect the input data is periodic enough that other frequencies would be meaningful.

semiinfinitely · 6 months ago

I see no mention of prior work Hyena Operator which already demonstrated O(n log n) full context mixing several years ago.

https://arxiv.org/abs/2302.10866

intalentive · 6 months ago

Hyena came out of Albert Gu's prior work in the same lab. https://arxiv.org/abs/2111.00396