So as far as I can understand, the biggest "bottleneck"/limiting factor with using FPGAs for LLMs is the available memory -- with current large models exceeding 40 GiB in parameter size, GPUs and TPUs with DRAM look like the only way to go forward for the months to come ... Thoughts?
An interesting twist is that this DRAM might not need to be a central pool where bandwidth must be shared globally -- e.g. the Tensortorrent strategy seems to be aiming for using smaller chips that each have their own memory. Splitting up memory should yield very high aggregate bandwidth even with slower DRAM, which is great as long as they can figure out the cross-chip data flow to avoid networking bottlenecks
That being said, the techniques discussed here are not totally irrelevant (yet). There still exists some hardware with fast instructions for float/int conversion, but lacking rsqrt, sqrt, pow, log instructions, which can all be approximated with this nice trick.