I'm wondering what the higher convolution levels could look like, if this was a CNN analyzing an image. Something between a the complete Ableton/Logic export and a MIDI file. Being able to capture the "feel" of a song (or a section within a song) strikes me as an important milestone towards designing really good generative music.
How do you think about backtesting? There are a few short-only shops that specialize in finding frauds. If you get their historical 13-Fs, how would you score against them in terms of precision/recall?
And I guess more broadly, how does alpha with your system compare to a portfolio that holds all short positions by big long/short funds (ex thematic shorts)? Meaning, those guys have full-time humans that focus on this... can you beat them? Very interesting if so.
Benzinga / Motley Fool / Seeking Alpha / Business Wire / Forbes aren't places to find worthwhile information.
I.e. "this blog mentioned NYSE:TEVA, and the next day the stock moved materially, therefore site_ranking++". (You'd probably have some TF/IDF saliency metric too, so that a site that mentions all stocks is penalized.)
(If Tesla had to be “a better car that’s also electric”, I think this would need to be “a better TV that’s also private”.)
* FlashAttention: In my experience, the current best solution for n² attention, but it's very hard to scale up beyond the low tens of thousands of tokens. Memory use is O(n) but compute is O(n²). Code: https://github.com/HazyResearch/flash-attention
* Heinsen Routing: In my experience, the current best solution for n×m attention, i.e., mapping n tokens to m tokens. It's like a souped-up version of attention. I've used it to pull up more than a million tokens as context. Memory use and compute are O(nm). It works, but in my (limited) experience, it doesn't work out-of-the-box as well as FlashAttention for n² attention. Code: https://github.com/glassroom/heinsen_routing
* RWKV: A sort-of-recurrent model which claims to have performance comparable to n² attention in transformers. In my (limited) experience, it doesn't. Others seem to agree: https://twitter.com/arankomatsuzaki/status/16390003799784038... . Code: https://github.com/BlinkDL/RWKV-LM
* RMT (this method): I'm skeptical that the recurrent connections will work as well as n² attention or n×m routing in practice, but I'm going to give it a try. Code: https://github.com/booydar/t5-experiments/tree/scaling-repor...
In addition, the group that developed FlashAttention is working on state-space models (SSMs) that look promising to me. The idea is to approximate n² attention dynamically using only O(n log n) compute. There's no code available, but here's a blog post about it: https://hazyresearch.stanford.edu/blog/2023-03-27-long-learn... [CORRECTION: Code is available. See comment by lucidrains below. I'm hopeful this will go to the top of my list.]
If anyone here has other suggestions for working with long sequences (hundreds of thousands to millions of tokens), I'd love to learn about them.