Interestingly, there is a cuBLAS 13.1 whl on PyPI, not sure what that does.
For example, Alex Graves's (great! with attention) 2013 paper "Sequence Generation with Recurrent Neural Networks" has this line:
One difficulty when training LSTM with the full gradient is that the derivatives sometimes become excessively large, leading to numerical problems. To prevent this, all the experiments in this paper clipped the derivative of the loss with respect to the network inputs to the LSTM layers (before the sigmoid and tanh functions are applied) to lie within a predefined range.
with this footnote:
In fact this technique was used in all my previous papers on LSTM, and in my publicly available LSTM code, but I forgot to mention it anywhere—mea culpa.
That said, backpropagation seems important enough to me that I once did a specialized videocourse just about PyTorch (1.x) autograd.
In Thunder[1], a PyTorch to Python JIT compiler for optimizing DL models, we are maintaining a bytecode interpreter covering 3.10-3.12 (and 3.13 soon) for our jit. That allows to run Python code while re-directing arbitrary function calls and operations but is quite a bit slower than CPython.
While the bytecode changes (and sometimes it is a back-and-forth for example in the call handling), it seems totally good once you embrace that there will be differences between Python versions.
What has been a large change is the new zero cost (in the happy path) exception handling, but I can totally why Python did that change to that from setting up try-block frames.
I will say that I was happy not to support Python <= 3.9 as changes were a lot more involved there (the bytecode format itself etc.).
Of course, working on this has also means knowing otherwise useless Python trivia afterwards. One of my favorites is how this works:
l = [1, 2, 3]
l[-1] += l.pop()
print(l)
1. https://github.com/Lightning-AI/lightning-thunder/Additionally, since I'm streaming the LLM response, it won't take long to get your reply. Since it does it a chunk at a time, there's occasionally only parts of words that are said momentarily. Also of course depends on what model you use or what the context size is for how long you need to wait.
Is it just me (I'm far from an expert here) or is this code really weird? Why does ymm_one appear to contain the number zero? Why do we subtract what looks like it should be the inner product we want from ymm_one at the end?