A basic introduction to NumPy's einsum

teddykoker · 4 years ago

Also see Einops: https://github.com/arogozhnikov/einops, which uses a einsum-like notation for various tensor operations used in deep learning.

https://einops.rocks/pytorch-examples.html shows how it can be used to implement various neural network architectures in a more simplified manor.

johnnymellor · 4 years ago

Einops looks nice! It reminds me of https://github.com/deepmind/einshape which is another attempt at unifying reshape, squeeze, expand_dims, transpose, tile, flatten, etc under an einsum-inspired DSL.

pizza · 4 years ago

Somebody also realized that much of the time you can use one single function to describe all 3 of the einops operations. I present to you, einop: https://github.com/cgarciae/einop

PartiallyTyped · 4 years ago

In addition, einsum in jax uses einops.

moelf · 4 years ago

https://github.com/mcabbott/Tullio.jl:

  - better syntax (because Julia has proper macro/metaprogramming)
  - faster
  - automatically works with GPU arrays.

dsharlet · 4 years ago

If you are looking for something like this in C++, here's my attempt at implementing it: https://github.com/dsharlet/array#einstein-reductions

It doesn't do any automatic optimization of the loops like some of the projects linked in this thread, but, it provides all the tools needed for humans to express the code in a way that a good compiler can turn it into really good code.

andi999 · 4 years ago

How is "This is 40-50x faster than a naive C implementation of nested loops on my machine, " that possible/Do you know how this was achieved?

mynameismon · 4 years ago

Not OP, but from a cursory glance at the code, it seems to be achieved with the combination of splitting the matrices into chunks to fit them in the L1/L2 caches (line 2 in the code), using tricks like switching indexes to achieve better cache locality, and using SIMD + Fused Multiply Add to further speedup things

cschmid · 4 years ago

I'd really like to use einsum more often, because it allows me to code my expressions the same way I derive them on pen and paper. Unfortunately, as mentioned in the article, it's slow, because it converts your formula to a for loop.

So usually, I rewrite my formulas into messy combinations of broadcasts, transposes and array multiplications. Is there a package or an algorithm that does this conversion automatically? It seems to be a pretty straightforward problem, at least for most expressions I use.

catgary · 4 years ago

The Tullio library in Julia is a pretty fantastic option for Einstein summation. It’s performance is great, it generates CUDA kernels, and does some clever tricks for automatic differentiation. It’s also a bit more readable than numpy’s einsum function, since you just write:

  @tullio C[i,j] := A[i,k] * B[k,j]

whimsicalism · 4 years ago

Are there any benchmarks comparing Tullio generated kernels from einsum with an equivalent NVIDIA kernel?

shaklee3 · 4 years ago

it's highly unlikely that the performance will be great on GPU if it generates its own kernels. cutensor has been hand optimized to do exactly these types of operations in a way emitting a kernel automatically can't do.

rowanG077 · 4 years ago

Damn that notation is super awesome. basically self explanatory.

enriquto · 4 years ago

Just use C or C++ or fortran or julia or even lua.

The issue of "slow loops" is entirely self-inflicted by some languages. It's getting ridiculous to still worry about this shit in 2022.

Simple loops can and should be just as fast as vectorized programs. When they are slower, it is 100% due to deliberate decisions by the language dessigners

thrownawayalre · 4 years ago

But BLAS is much more than 3 loops for matrix-matrix multiplication.

https://stackoverflow.com/questions/1303182/how-does-blas-ge...

Why turn your back on all the carefully crafted optimizations?

kzrdude · 4 years ago

I agree. At least some faster-cpython work is going on, does anyone know if there are aggregate updates on that, like performance progress reports?

upbeat_general · 4 years ago

This is 100% not the case with SIMD.

TheRealKing · 4 years ago

or Rust

thrownawayalre · 4 years ago

It's not true that einsum converts to a for loop, and it is sometimes faster than other numpy built-in functions.

https://stackoverflow.com/questions/18365073/why-is-numpys-e...

Did I misunderstood your comment?

cschmid · 4 years ago

If I understand https://github.com/numpy/numpy/blob/v1.22.0/numpy/core/einsu... and https://github.com/numpy/numpy/blob/v1.22.0/numpy/core/src/m... correctly, using einsum without the optimize flag seems to use a for loop in C to do the multiplication.

The optimizer clearly tries to improve the performance, but in many cases, it doesn't seem to change anything. Let's simply multiply some matrices:

  x, y = np.random.rand(200, 200, 200), np.random.rand(200, 200, 200)

I can do

  %timeit x@y
  40.3 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

or a naive

  %timeit np.einsum('bik,bkj->bij',x,y)
  1.53 s ± 21.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

But even with optimization, I see

  %timeit np.einsum('bik,bkj->bij',x,y, optimize=True)
  1.54 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I'm not sure if I'm doing something wrong.

savant_penguin · 4 years ago

Iirc pytorch's einsum used to be very slow

machinekob · 4 years ago

I would say https://github.com/arogozhnikov/einops should be pretty close to your requirments

polygamous_bat · 4 years ago

I would like to second that! I converted multiple of my lab mates into Einops just by having them browse through the tutorial in https://einops.rocks/ :)

jpeanuts · 4 years ago

Looking at the documentation - einops seems to only implement operations on a single tensor, so it's quite far from a general replacement for einsum, which can perform tensor-products on an arbitrary number of tensors.

braised_babbage · 4 years ago

There are many evaluation strategies for tensor contractions; naively translating into nested for loops and then doing direct random-access arithmetic is just one approach. If you like numpy's einsum, a drop in replacement is https://optimized-einsum.readthedocs.io/en/stable/. In general, it's an open question as to how to best "compile" a tensor contraction expression into something executable, for several reasons (there are a lot of possible orderings for the contractions, if one wishes to rely on BLAS routines for efficient primitives one will find that they don't exactly fit the problem of tensor contraction like a glove, and so on).

thrownawayalre · 4 years ago

The linked documentation states

    For example, the optimization of np.einsum has been passed upstream and most of the same features that can be found in this repository can be enabled with numpy.einsum(..., optimize=True).

So numpy should provide the same perf for most cases.

madisonmay · 4 years ago

For more efficient einsum, see projects like https://optimized-einsum.readthedocs.io/en/stable/path_findi....

imranq · 4 years ago

I've found einsum to be amazing at consolidating my code into something more readable, particularly for implementing architectures from scratch.

Here's a good video that explains why its so good: https://www.youtube.com/watch?v=pkVwUVEHmfI

Also check out Lucid Rains Github, who uses it extensively to build transformer architectures from scratch: https://github.com/lucidrains \

* Example: https://github.com/lucidrains/alphafold2/blob/d59cb1ea536bc5...

p1esk · 4 years ago

amazing at consolidating my code into something more readable

Readable if you're already familiar with einsum notation. Otherwise there's learning curve. An alternative to einsum is using multiple dot product and reshape ops, hopefully with each one having a comment - this would be a lot more readable imo.

pizza · 4 years ago

But then to find some mistake you would have to corroborate n * 2 pieces of information, which is quite a bit more, before getting into the issue of whether or not the sequence of operations actually makes sense on top of all that.

hakuseki · 4 years ago

I really don't find that einsum is less intuitive than reshape. Maybe they both take about the same time to learn the first time. And that's less time than it would take me to struggle through even a single function using reshape ops.

thetwentyone · 4 years ago

Also see: https://github.com/ahwillia/Einsum.jl

immmmmm · 4 years ago

If you're into tensor algebra i can only recommend the beautiful piece of Software Cadabra is:

https://cadabra.science/

We wrote an article with it once, 40th order in the Lagrangian, perhaps 50k pages of calculations when all printed. Amazing tool! Thanks Kasper!

physicsguy · 4 years ago

Oh god why did I not know about this when I did my theoretical physics PhD

mjburgess · 4 years ago

I have to imagine python has revolutionised physics education at university.

I taught myself C before arriving (and, eg., esp. via stanford's programming paradigms course) -- if I hadnt, I would be resigned with my classmates to C code projected on OHP slides.

If I had my way today, the whole of any applied science would be taught with both mathematical and programming (python for convenience) notation.

physicsguy · 4 years ago

I used all of C + Python + NumPy a lot, just didn't know about this feature. Would have helped prototyping various bits and pieces at the Python level.