Einops looks nice! It reminds me of https://github.com/deepmind/einshape which is another attempt at unifying reshape, squeeze, expand_dims, transpose, tile, flatten, etc under an einsum-inspired DSL.
Somebody also realized that much of the time you can use one single function to describe all 3 of the einops operations. I present to you, einop: https://github.com/cgarciae/einop
It doesn't do any automatic optimization of the loops like some of the projects linked in this thread, but, it provides all the tools needed for humans to express the code in a way that a good compiler can turn it into really good code.
Not OP, but from a cursory glance at the code, it seems to be achieved with the combination of splitting the matrices into chunks to fit them in the L1/L2 caches (line 2 in the code), using tricks like switching indexes to achieve better cache locality, and using SIMD + Fused Multiply Add to further speedup things
I'd really like to use einsum more often, because it allows me to code my expressions the same way I derive them on pen and paper. Unfortunately, as mentioned in the article, it's slow, because it converts your formula to a for loop.
So usually, I rewrite my formulas into messy combinations of broadcasts, transposes and array multiplications. Is there a package or an algorithm that does this conversion automatically? It seems to be a pretty straightforward problem, at least for most expressions I use.
The Tullio library in Julia is a pretty fantastic option for Einstein summation. It’s performance is great, it generates CUDA kernels, and does some clever tricks for automatic differentiation. It’s also a bit more readable than numpy’s einsum function, since you just write:
it's highly unlikely that the performance will be great on GPU if it generates its own kernels. cutensor has been hand optimized to do exactly these types of operations in a way emitting a kernel automatically can't do.
Just use C or C++ or fortran or julia or even lua.
The issue of "slow loops" is entirely self-inflicted by some languages. It's getting ridiculous to still worry about this shit in 2022.
Simple loops can and should be just as fast as vectorized programs. When they are slower, it is 100% due to deliberate decisions by the language dessigners
I would like to second that! I converted multiple of my lab mates into Einops just by having them browse through the tutorial in https://einops.rocks/ :)
Looking at the documentation - einops seems to only implement operations on a single tensor, so it's quite far from a general replacement for einsum, which can perform tensor-products on an arbitrary number of tensors.
There are many evaluation strategies for tensor contractions; naively translating into nested for loops and then doing direct random-access arithmetic is just one approach. If you like numpy's einsum, a drop in replacement is https://optimized-einsum.readthedocs.io/en/stable/. In general, it's an open question as to how to best "compile" a tensor contraction expression into something executable, for several reasons (there are a lot of possible orderings for the contractions, if one wishes to rely on BLAS routines for efficient primitives one will find that they don't exactly fit the problem of tensor contraction like a glove, and so on).
For example, the optimization of np.einsum has been passed upstream and most of the same features that can be found in this repository can be enabled with numpy.einsum(..., optimize=True).
So numpy should provide the same perf for most cases.
amazing at consolidating my code into something more readable
Readable if you're already familiar with einsum notation. Otherwise there's learning curve. An alternative to einsum is using multiple dot product and reshape ops, hopefully with each one having a comment - this would be a lot more readable imo.
But then to find some mistake you would have to corroborate n * 2 pieces of information, which is quite a bit more, before getting into the issue of whether or not the sequence of operations actually makes sense on top of all that.
I really don't find that einsum is less intuitive than reshape. Maybe they both take about the same time to learn the first time. And that's less time than it would take me to struggle through even a single function using reshape ops.
I have to imagine python has revolutionised physics education at university.
I taught myself C before arriving (and, eg., esp. via stanford's programming paradigms course) -- if I hadnt, I would be resigned with my classmates to C code projected on OHP slides.
If I had my way today, the whole of any applied science would be taught with both mathematical and programming (python for convenience) notation.
I used all of C + Python + NumPy a lot, just didn't know about this feature. Would have helped prototyping various bits and pieces at the Python level.
https://einops.rocks/pytorch-examples.html shows how it can be used to implement various neural network architectures in a more simplified manor.
It doesn't do any automatic optimization of the loops like some of the projects linked in this thread, but, it provides all the tools needed for humans to express the code in a way that a good compiler can turn it into really good code.
So usually, I rewrite my formulas into messy combinations of broadcasts, transposes and array multiplications. Is there a package or an algorithm that does this conversion automatically? It seems to be a pretty straightforward problem, at least for most expressions I use.
The issue of "slow loops" is entirely self-inflicted by some languages. It's getting ridiculous to still worry about this shit in 2022.
Simple loops can and should be just as fast as vectorized programs. When they are slower, it is 100% due to deliberate decisions by the language dessigners
https://stackoverflow.com/questions/1303182/how-does-blas-ge...
Why turn your back on all the carefully crafted optimizations?
https://stackoverflow.com/questions/18365073/why-is-numpys-e...
Did I misunderstood your comment?
The optimizer clearly tries to improve the performance, but in many cases, it doesn't seem to change anything. Let's simply multiply some matrices:
I can do or a naive But even with optimization, I see I'm not sure if I'm doing something wrong.Here's a good video that explains why its so good: https://www.youtube.com/watch?v=pkVwUVEHmfI
Also check out Lucid Rains Github, who uses it extensively to build transformer architectures from scratch: https://github.com/lucidrains \
* Example: https://github.com/lucidrains/alphafold2/blob/d59cb1ea536bc5...
Readable if you're already familiar with einsum notation. Otherwise there's learning curve. An alternative to einsum is using multiple dot product and reshape ops, hopefully with each one having a comment - this would be a lot more readable imo.
https://cadabra.science/
We wrote an article with it once, 40th order in the Lagrangian, perhaps 50k pages of calculations when all printed. Amazing tool! Thanks Kasper!
I taught myself C before arriving (and, eg., esp. via stanford's programming paradigms course) -- if I hadnt, I would be resigned with my classmates to C code projected on OHP slides.
If I had my way today, the whole of any applied science would be taught with both mathematical and programming (python for convenience) notation.