If you want to put this to test, try formulating a React component with autocomplete as a "math problem". Good luck.
(I studied maths, if anyone is questioning where my beliefs come from, that's because I actually used to think in maths while programming for a long time.)
It seems quite mad that we even need to debate this. Wind is free power and we have at least 2000 years of engineering to draw on how to use it.
Any propulsion unit needs to be effectively attached to a ship. Screws are attached longitudinally, low down and push. Sails are a bit more tricksy. A triangular sail mounted along the long axis will generally work best because it can handle more wind angles but a square sail mounted across the long axis will provide more power on a "reach" to a "run" (the wind is mostly from behind, so pushing).
The cutting edge of sailing ships that carried stuff are the tea clippers. Think "Cutty Sark" which is now a visitor attraction in London, Greenwich. Note the stay sails - the triangular sails at the front. Then note the three masts. Each mast has several main sails that are huge rectangles for "reaches" and additional extensions. There are even more triangular infill sails above the main sails.
It's quite hard to explain how wind and sails work but you need to understand that a sailing ship can sail "into the wind". Those triangles are better at it than those rectangles but those rectangles can get more power by being bigger. Even better, you can use the front triangular sails (stay sails) to moderate the wind to feed the other sails with less turbulent wind.
Wind is free power and it is so well understood. How on earth is this news?
I am aware I could Google it all or ask an LLM, but I'm still interested in a good human explanation.
- apply learned knowledge from its parameters to every part of the input representation („tokenized“, ie, chunkified text).
- apply mixing of the input representation with other parts of itself. This is called „attention“ for historical reasons. The original attention computes mixing of (roughly) every token (say N) with every other (N). Thus we pay a compute cost relative to N squared.
The attention cost therefore grows quickly in terms of compute and memory requirements when the input / conversation becomes long (or may even contain documents).
It is a very active field of research to reduce the quadratic part to something cheaper, but so far this has been rather difficult, because as you readily see this means that you have to give up mixing every part of the input with every other.
Most of the time mixing token representations close to each other is more important than those that are far apart, but not always. That’s why there are many attempts now to do away with most of the quadratic attention layers but keeping some.
What to do during mixing when you give up all-to-all attention is the big research question because many approaches seem to behave well only under some conditions and we haven’t established something as good and versatile as all-to-all attention.
If you forgo all-to-all you also open up so many options (eg. all-to-something followed by something-to-all as a pattern, where something serves as a sort of memory or state that summarizes all inputs at once. You can imagine that summarizing all inputs well is a lossy abstraction though, etc.)
I bet there's some good chance of getting wacky extremophiles though!
Hot spring baths usually top out around 42-43C
Though it is notable that contrary to many (on HN and Twitter) that Meta would stop publishing papers and be like other AI labs (e.g. OpenAI). They're continued their rapid pace of releasing papers AND open source models.
MSL is not only those few high profile hires.