Making Python faster with Rust

[jim@mbp ~]$ sh -x x + cat x1.c #include <stdio.h> #define NUM 1000000000 struct { int x; int y; } p[NUM]; int main() { int i,s; for (i=0; i<NUM; i++) { p[i].x = i; p[i].y = i; } s=0; for (i=0; i<NUM; i++) { s += p[i].x + p[i].y; } printf("s=%d\n", s); } + cc -o x1 x1.c + ./x1 s=1808348672 real 0m12.078s user 0m7.319s sys 0m4.363s + ./x1 s=1808348672 real 0m9.415s user 0m6.677s sys 0m2.685s + cat x2.c #include <stdio.h> #define NUM 1000000000 int x[NUM]; int y[NUM]; int main() { int i,s; for (i=0; i<NUM; i++) { x[i] = i; y[i] = i; } s=0; for (i=0; i<NUM; i++) { s += x[i] + y[i]; } printf("s=%d\n", s); } + cc -o x2 x2.c + ./x2 s=1808348672 real 0m9.753s user 0m6.713s sys 0m2.967s + ./x2 s=1808348672 real 0m9.642s user 0m6.674s sys 0m2.902s + cat x3.c #include <stdio.h> #define NUM 1000000000 struct { int x; int y; } p[NUM]; int main() { int i,s; for (i=0; i<NUM; i++) { p[i].x = i; } for (i=0; i<NUM; i++) { p[i].y = i; } s=0; for (i=0; i<NUM; i++) { s += p[i].x; } for (i=0; i<NUM; i++) { s += p[i].y; } printf("s=%d\n", s); } + cc -o x3 x3.c + ./x3 s=1808348672 real 0m13.844s user 0m11.095s sys 0m2.700s + ./x3 s=1808348672 real 0m13.686s user 0m11.038s sys 0m2.611s + cat x4.c #include <stdio.h> #define NUM 1000000000 int x[NUM]; int y[NUM]; int main() { int i,s; for (i=0; i<NUM; i++) x[i] = i; for (i=0; i<NUM; i++) y[i] = i; s=0; for (i=0; i<NUM; i++) s += x[i]; for (i=0; i<NUM; i++) s += y[i]; printf("s=%d\n", s); } + cc -o x4 x4.c + ./x4 s=1808348672 real 0m13.530s user 0m10.851s sys 0m2.633s + ./x4 s=1808348672 real 0m13.489s user 0m10.856s sys 0m2.603s

This is a great article but there's still a core problem there - why should developers have to choose between accessibility and performance?

So much scientific computing code suffers between core packages being split away from their core language - at what point do we stop and abandon python for languages which actually make sense? Obviously julia is the big example here, but its interest, development and ecosystem doesn't seem to be growing at a serious pace. Given that the syntax is moderately similar and the performance benefits are often 10x what's stopping people from switching???

fbdab103 · 2 years ago

Today, there is a Python package for everything. The ecosystem is possibly best in class for having a library available that will do X. You cannot separate the language from the ecosystem. Being better, faster, and stronger means little if I have to write all of my own supporting libraries.

Also, few scientific programmers have any notion of what C or Fortran is under the hood. Most are happy to stand on the shoulders of giants and do work with their specialized datasets. Which for the vast majority of researchers are not big data. If the one-time calculation takes 12 seconds instead of 0.1 seconds is not a problem worth optimizing.

koito17 · 2 years ago

>Today, there is a Python package for everything.

The same could be said about CPAN and NPM. Yet Perl is basically dead and JavaScript isn't used for any machine learning tasks as far as I'm aware. WebAssembly did help bring a niche array of audio and video codecs to the ecosystem[1][2], something I'm yet to see from Python.

I don't use Python, but with what little exposure I've had to it at work, its overall sluggish performance and need to set up a dozen virtualenvs -- only to dockerize everything in cursed ways when deploying -- makes me wonder how or why people bother with it at all beyond some 5-line script. Then again, Perl used to be THE glue language in the past and mod_perl was as big as FastAPI, and Perl users would also point out how CPAN was unparalleled in breadth and depth. I wonder if Python will follow a similar fate as Perl. One can hope :-)

[1] https://github.com/phoboslab/jsmpeg

[2] https://github.com/brion/ogv.js/

el_oni · 2 years ago

This is how I got into software development.

During my PhD I was running some simulations using poorly written python code. initially it would take several hours. In that time i could go to the lab, run some wetlab experiments and the results of my simulations would be there when i got back to the office. It was only taking python "home" and building some of my own projects that i learned how to 1. write more pythonic code and 2. write more performant code. Now i work for a software company.

If i'd have stayed in in academia I would probably still be writing quick and dirty code and not worrying about the runtime because as a researcher there is always something else you can be doing.

onetonfoot · 2 years ago

You can have your cake and eat it with the likes of

* PythonCall.jl - https://github.com/cjdoris/PythonCall.jl

* NodeCall.jl - https://github.com/sunoru/NodeCall.j

* RCall.jl - https://github.com/JuliaInterop/RCall.jl

I tend to use Julia for most things and then just dip into another language’s ecosystem if I can’t find something to do the job and it’s too complex to build myself

bakuninsbart · 2 years ago

Because professional software developers with a background in CS are a minority of people who program today. The learning curve of pointers, memory-allocation, binary operations, programming paradigms, O-Notation and other things you need to understand to efficiently code in something like C is a lot to ask of someone who is for example primarily a sociologist or biologist.

The use case btw. is often also very different. In most of academia, writing code is basically just a fancy mode of documentation for what is basically a glorified calculator. Readability trumps efficiency by a large margin every time.

visarga · 2 years ago

It also matters if you write code to run once or to serve in production, if it is experimental or stable.

If my script takes 3s to run and 5m to write in Python, vs 0.1s to run and 3h to write in C, I finish first with Python. I can try more ideas with Python.

Barrin92 · 2 years ago

tbf you don't need to go to C. You could write Common Lisp or Ocaml, both academic high level languages and very performant. Hell SBCL can get you to C range performance wise while you're writing dynamic, GCed code. Sure it's a little bit more involved than learning Python but not that much if you get 50x performance for free. Prevalence of Python is really baffling to me because compute resources cost money.

akasakahakada · 2 years ago

Not even readability. Academic code is mostly unreadable. If you need example: IBM Qiskit.

Everything is just a prove of concept and no one expect anything more than that.

kaba0 · 2 years ago

C is definitely not a good choice for this, I would hate to come back 2 days later to my computation and see “segfault” as the only output.

brahbrah · 2 years ago

They would have gotten the same performance in python with numpy if they did it like this instead of calling norm for every polygon

centers = np.array([p.center for p in ps]) norm(centers - point, axis=1)

They were just using numpy wrong. You can be slow in any language if you use the tools wrong

masklinn · 2 years ago

You made this assertion multiple times, but so far it’s been entirely unsupported in fact, despite TFA having made the entire code set available for you to test your hypothesis on.

tayo42 · 2 years ago

what is the difference?

though I do feel like i see this a lot with these kinds of "we re-wrote it in rust and everything is fast". comparing to a language with gc options often the scenario

on one hand, i feel like you should just learn how to use your stuff properly. on the other hand it is interesting to see that people who can't write fast code or use libraries properly are actually writing fast code. like fast code for the masses almost hah. though maybe theyll just run into the same issue when they misuse a library in rust

ThouYS · 2 years ago

everything. why are there still cobol programmers? why is c++ still the defacto native language (also in research)?

but also I don't see any problem there, I think the python + c++/rust idiom is actually pretty nice. I have a billion libs to choose from on either side. Great usability on the py side, and unbeatable performance on the c++ side

ubj · 2 years ago

One of Julia's Achilles heels is standalone, ahead-of-time compilation. Technically this is already possible [1], [2], but there are quite a few limitations when doing this (e.g. "Hello world" is 150 MB [6]) and it's not an easy or natural process.

The immature AoT capabilities are a huge pain to deal with when writing large code packages or even when trying to make command line applications. Things have to be recompiled each time the Julia runtime is shut down. The current strategy in the community to get around this seems to be "keep the REPL alive as long as possible" [3][4][5], but this isn't a viable option for all use cases.

Until Julia has better AoT compilation support, it's going to be very difficult to develop large scale programs with it. Version 1.9 has better support for caching compiled code, but I really wish there were better options for AoT compiling small, static, standalone executables and libraries.

[1]: https://julialang.github.io/PackageCompiler.jl/dev/

[2]: https://github.com/tshort/StaticCompiler.jl

[3]: https://discourse.julialang.org/t/ann-the-ion-command-line-f...

[4]: https://discourse.julialang.org/t/extremely-slow-execution-t...

[5]: https://discourse.julialang.org/t/extremely-slow-execution-t...

[6]: https://www.reddit.com/r/Julia/comments/ytegfk/size_of_a_hel...

meepmorp · 2 years ago

Thank you for settling a question for me - I was looking at julia's aot compilation abilities last week and the situation seemed like kind of a hassle.

shakow · 2 years ago

IME, for having used Julia quite extensively in Academia:

- the development experience is hampered by the slow start time;

- the ecosystem is quite brittle;

- the promised performances are quite hard to actually reach, profiling only gets you so far;

- the ecosystem is pretty young, and it shows (lack of docs, small community, ...)

> what's stopping people from switching???

All of the mentioned above, inertia, perfect is the enemy of good enough, the alternatives are far away from python ecosystem & community, performances are not often a show blocker.

ActorNightly · 2 years ago

I don't know whether this sentiment is just a byproduct of CS education, but for some reason people equate a programming language with the compute that goes on under the hood. Like if you write in Python, you are locked into the specific non optimized way of computing that Python does.

Its all machine code under the hood. Everything else on top is essentially description of more and more complex patterns of that code. So its a no brainer that a language that lets you describe those complex but repeating patterns in the most direct way is the most popular. When you use python, you are effectively using a framework on top of C to describe what you need, and then if you want to do something specialized for performance, you go back to the core fundamentals and write it in C.

visarga · 2 years ago

Julia doesn't get the latest models first, or have as big of a community.

def find_close_polygons( polygon_subset: List[Polygon], point: np.array, max_dist: float ) -> List[Polygon]: close_polygons = [] for poly in polygon_subset: if np.linalg.norm(poly.center - point) < max_dist: close_polygons.append(poly) return close_polygons

use pyo3::prelude::*; use ndarray_linalg::Norm; use numpy::PyReadonlyArray1; #[pyfunction] fn find_close_polygons( py: Python<'_>, polygons: Vec<PyObject>, point: PyReadonlyArray1<f64>, max_dist: f64, ) -> PyResult<Vec<PyObject>> { let mut close_polygons = vec![]; let point = point.as_array(); for poly in polygons { let center = poly .getattr(py, "center")? .extract::<PyReadonlyArray1<f64>>(py)? .as_array() .to_owned(); if (center - point).norm() < max_dist { close_polygons.push(poly) } } Ok(close_polygons) }

import numpy as np n_sides = 30 n_polygons = 10000 class Polygon: def __init__(self, x, y): self.x = x self.y = y self.center = np.array([self.x, self.y]).mean(axis=1) def find_close_polygons( polygon_subset: List[Polygon], point: np.array, max_dist: float ) -> List[Polygon]: close_polygons = [] for poly in polygon_subset: if np.linalg.norm(poly.center - point) < max_dist: close_polygons.append(poly) return close_polygons polygons = [Polygon(*np.random.rand(2, n_sides)) for _ in range(n_polygons)] point = np.array([0, 0]) max_dist = 0.5 %timeit find_close_polygons(polygons, point, max_dist)

def find_close_polygons( polygon_subset: List[Polygon], point: np.array, max_dist: float ) -> List[Polygon]: centers = np.array([polygon.center for polygon in polygon_subset]) mask = np.linalg.norm(centers - point[None], axis=1) < max_dist return [ polygon for polygon, is_pass in zip(polygon_subset, mask) if is_pass ]

Making Python (near infinitely) faster by using it as a glue language, and running all the computation outside Python :-P

INTPenis · 2 years ago

Yeah what's wrong with that? I think this sounds amazing. It gives you all the fast prototyping and simplicity of Python, but once you hit that bottleneck all you have to do is bring in a ringer to replace key components with a faster language.

No need to use Golang or Rust from the start, no need for those resources until you absolutely need the speed improvement. Sounds like a dream to a lot of people who find it much easier to develop in Python.

josephg · 2 years ago

It sounds amazing, but bear in mind there are a lot of code which can’t be sped up like this because:

- Some code doesn’t have obvious optimization hotspots, and is instead just generally slow everywhere.

- Most FFI boundaries incur their own performance cost. I’m not sure about Python, but I wouldn’t be surprised if FFI to rust in a hot loop is often slower than just writing the same code in Python directly. And it’s not always easy to refactor to avoid this.

- A lot of programs in languages like Python are slow because the working set size contains a lot of small objects, and the GC struggles. You can optimize code like this by moving large parts of the object graph into rust. But it can become a mess if the objects rust retains then need references to Python objects, in turn.

The optimization described in this blog post is the best case scenario for this sort of thing - the performance hotspot was clear, small, and CPU bound. When you can make optimizations like this you absolutely should. But your mileage may vary when you try this out on your own software.

alex_sf · 2 years ago

Python is a rough language to be productive in. It's a great scratchpad, but dynamic typing, exceptions/poor error handling, and a horrifying deployment and dependency system make me reach for something like Go in any case where I need something to be even vaguely reliable.

The more ML I do, the more disappointed I get.

internetter · 2 years ago

Right? I mean, it's not like we haven't been doing this already. All the computationally intensive python libraries are just a convenient wrapper for C anyway, the only reason python can be used for ML.

za3faran · 2 years ago

Then you give up the benefits of using a managed language, and you now have to maintain two stacks.

IMO/IME much better to go with a language where you don't have this dichotomy in the first place - e.g. Java or C#.

okaleniuk · 2 years ago

Also, there is a faster Python which is also Python. And the author considered it as well (both PyPy and Numba), it's just in this particular scenario they were not the best way to go.

baq · 2 years ago

This is basically what Python was first designed for and as evidenced by the article still excels at

flohofwoe · 2 years ago

Well, that's exactly where Python works well. As scripting glue sitting between or above native code which does the heavy-lifting.

est · 2 years ago

> and running all the computation outside Python

You don't have to move all the computation, just `for` loops will help alot.

import numpy as np import time vals = np.random.randn(1000000, 2) point = np.array([.2, .3]) s = time.time() for x in vals: np.linalg.norm(x - point) < 3 a = time.time() - s s = time.time() np.linalg.norm(vals - point, axis=1) < 3 b = time.time() - s print(a / b)

Good for you! You did everything right: measure always, fix the bottleneck if possible, rewrite if necessary.

A little tip, you don't have to compare actual distances, you can compare squared distances just as well. Then in `norm < max_dist`, you don't have to do a `sqrt()` for every `norm`. Saves a few CPU ticks as well.

I once rewrote a GDI+ point transformation routine in pure C# and got 200x speedup just because the routine was riddled with needless virtual constructors, copying type conversions, and something called CreateInstanceSlow. Ten years after, I gathered a few of these anecdotes and wrote the Geometry for Programmers book (https://www.manning.com/books/geometry-for-programmers) with its main message: when you know geometry behind your tools, you can either use them efficiently, or rewrite them completely.

The author did talk about it on reddit, but explained that for the purpose of the blog post they wanted to focus on the big stuff and profiling-guided optimisation of the process: https://reddit.com/r/rust/comments/125pbq0/blog_post_making_...

ZeroCool2u · 2 years ago

The ignore the square root while computing/comparing distances trick is a great one. That's how I got to the top of the performance leaderboard in my first algorithms class.

appeldorian · 2 years ago

I think a big mistake in the article, in a context where performance is the main objective, is that the author uses an array of structs (AoS), rather than a struct of arrays (SoA). An SoA makes it so that the data is ordered contiguously, which is easy to read for the CPU, while an AoS structure interleaves different data (namely the x and y in this case), which is very annoying for the CPU. A CPU likes to read chunks of data (for example 128 bits of data/read) and to process these with SIMD instructions, executing a multiple of calculations with one CPU cycle. This is completely broken when using an array of structs.

He uses the same data structure in both the Python and Rust code, so I imagine that he can get an extra 4x speedup at least if he rewrites his code with memory layout in mind.

tremon · 2 years ago

Apache Arrow (https://arrow.apache.org/overview/) is built exactly around this idea: it's a library for managing the in-memory representation of large datasets.

ohr · 2 years ago

Author here: I agree, that's a great perf advice (esp. when you can restructure your code).

I couldn't get into a this in the article (would be too long), but this is a great point and the original library does this in a lot of places.

One problem in our use case is that the actual structs members are pretty big & that we need to group/regroup them a lot.

The fastest approach for us was to do something like in the article for the initial filtering, then build a hashmap of SoAs with the needed data, and do the heavier math on that.

Yoric · 2 years ago

Having used this approach in a few languages, I agree that it's (sometimes) (much) better for performance, but it tends to wreak havoc on readability.

snicker7 · 2 years ago

Languages and with some sort of decent metaprogramming support can alleviate this sort of issue (see Zig, Julia, Jai, etc.)

prirun · 2 years ago

Modern CPU caches are usually loaded in 64-byte units - much larger than 128 bits. I just ran some tests with a C program on an Intel I5 with both AoS and SoA using a list of 1B points with 32-bit X and Y components. Looping through the list of points and totaling all X and Y components was the same speed with either AoS or SoA.

It's easy to make intuitive guesses about how things are working that seem completely reasonable. But you have to benchmark because modern CPUs are so complex that reasoning and intuition mostly don't work.

Programs used for testing are below. I ran everything twice because my system wasn't always idle, so take the lower of the 2 runs.

lights0123 · 2 years ago

It appears that you didn't enable optimizations. That's needed for SIMD, which can only be taken advantage of with contiguously packed data.

b0b10101 · 2 years ago

korijn · 2 years ago

A vectorized implementation of find_close_polygons wouldn't be very complex or hard to maintain at all, but the authors would also have to ditch their OOP class based design, and that's the real issue here. The object model doesn't lend itself to performant, vectorized numpy code.

pbowyer · 2 years ago

What's a good guide to learn how to make (and see) vectorized code? It's a mindshift and not one I find easy.

I think a great start is to make arrays of similar data. Instead of an array of (x,y,z) use an array for x, an array for y and another one for z. If you then square these and sum them for example, the compiler might figure out good optimizations for it if you write it as a simple loop.

Also read about SIMD instructions like AVX2. They are often used under the hood when possible, but just knowing what they require can help "triggering" them, depending on which language you use. In C++ for example, the compiler really looks for opportunities to use those instructions. You can tell the compiler did it, by looking in the assembly code if any XMM or YMM registers are being used (these are the names of the SIMD registers).

A more accruate keyword for googling is "SIMD". Single Instruction Multiple Data.

Numpy's tutorial for broadcasting is also a good starting point.

https://numpy.org/doc/stable/user/basics.broadcasting.html

The gist of it is that you give numpy two arrays, and what operation to apply. Then numpy will figure out what the for loop(s) should look like depending on the shape of the arrays.

You can look at various tutorials to see how it works. For example: https://jakevdp.github.io/PythonDataScienceHandbook/02.05-co...

rwalle · 2 years ago

Exactly, that is the real issue, vectorization might be good enough in terms of performance. It doesn't seem to be mentioned in the article at all.

hn92726819 · 2 years ago

It might have been added later, but the author mentions vectorization in the beginning:

> It’s worth noting that converting parts of / everything to vectorized numpy might be possible for this toy library, but will be nearly impossible for the real library while making the code much less readable and modifiable, and the gains are going to be limited (here’s a partially vertorized version, which is faster but far from the results we are going to achieve).

FreeHugs · 2 years ago

The most important part of the article seems to be that this Python code is taking "an avg of 293.41ms per iteration":

And after replacing it with this Rust code, it is taking "an avg of 23.44ms per iteration":

Why is the Rust version 13x faster than the Python version?

spi · 2 years ago

Yeah but the Python code is so bad that it's easy to get a 10x speedup using only numpy, as well. The current code essentially does:

(I've made up number of sides and number of polygons to get to the same order of magnitude of runtime; also I've pre-computed centers, as they are cached anyway in their code), which on my machine takes about 40ms to run. If we just change the function to:

then the same computation takes 4ms on my machine.

Doing a Python loop of numpy operations is a _bad_ idea... The new code hardly even takes more space than the original one.

(as someone else mentioned in the comments, you can also directly use the sum of the squares rather than `np.linalg.norm` to avoid taking square roots and save a few microseconds more, but well, we're not in that level of optimization here)

winrid · 2 years ago

Python's for loop implementation is slow, also. You can use built in utils like map() which are "native" and can be a lot faster than a for loop with a push:

https://levelup.gitconnected.com/python-performance-showdown...

Nope. Map() is same speed as for loop.

Benchmarking methodology in the link is not good. Author should use timeit() or cProfiler or so. 0.01s of difference is mostly due to fluctuation. The order of execution also matters. Say you want to test A and B function, you need actually to run A, B, B, A to see if the ordering brings the different.

I don't think it's the loop implementation. The stuff in the loop should take multiple orders of magnitude more time than the loop itself:

    for poly in polygon_subset:
        if np.linalg.norm(poly.center - point) < max_dist:
            close_polygons.append(poly)

tweakimp · 2 years ago

Its not the looping itself that is slow in the article you linked, its that every element is appended to the list. If you use a list comprehension its even faster and it still loops over all elements of the list.

hannofcart · 2 years ago

I was surprised that the Rust version is _only_ 13x as fast as the Python version.

IshKebab · 2 years ago

Probably because it wasn't pure Python to start with.

lenkite · 2 years ago

Shouldn't `close_polygons` be presized in both Python and Rust to avoid repeated allocation and copy ?

nickstinemates · 2 years ago

One carries the entire feature set of the python runtime, the other is compiled.

The time is spent in this 3-line loop:

I don't think the entire feature set of the Python runtime is involved in this.

hoseja · 2 years ago

The final code takes just 2.90ms per iteration.

The rest is not a fair comparison, because it rewrites the used libraries, not the application code.

You can always speed up an application if you rewrite the used libraries to match your specific use case.

ssivark · 2 years ago

majoe · 2 years ago

I had a similar problem, when I was working as a PhD student a few years ago, where I needed to match the voxel representation of a 3D printer with the tetrahedral mesh of our rendering application.

My first attempt in Python was both prohibitively slow and more complicated than necessary, because I tried to use vectorized numpy, where possible.

Since this was only a small standalone script, I rewrote it in Julia in a day. The end result was ca. 100x faster and the code a lot cleaner, because I could just implement the core logic for one tetrahedron and then use Julia's broadcast to apply it to the array of tetrahedrons.

Anyway, Julia's long startup time often prohibits it from being used inside other languages (even though the Python/Julia interoperability is good). On the contrary the Rust/Python interop presented here seems to be pretty great. Another reason I should finally invest the time to learn Rust.

hgomersall · 2 years ago

Numba is great if you want to write a naive loop approach in python.

dunefox · 2 years ago

Long startup time is relative. I believe it's much lower now than a couple of versions ago. 0.15s or so? Interop between python and rust will also take time.

xgdgsc · 2 years ago

Julia 1.9 is fast. And you can use https://github.com/Suzhou-Tongyuan/jnumpy to write python extension in Julia now. So I think after 1.9 release julia would be much more usable.

This was a silly and unnecessary optimization. He’s just using numpy wrong.

Instead of:

for p in ps: norm(p.center - point)

You should do:

You’ll get your same speed up in 2 lines without introducing a new dependency

_glass · 2 years ago

Isn't this the version of refenced on the github repo [0] which speeds up 6x instead of 101x?

  There's also a "v1.5" version which is 6x faster, and uses "vectorizing" (doing more of the work directly in numpy). This version is much harder to optimize further.

[0] https://github.com/ohadravid/poly-match

No, their v1.5 is still calling norm on every polygon. They’re still using it wrong

On Google colab

~296x faster, significantly faster than the solution in the article.

Deleted Comment

Dead Comment

jerf · 2 years ago

This is the major reason I don't really buy into things like JITs solving all performance problems (as long as you carefully write only and exactly the subset of the language they work well with) or NumPy not being affected by Python being slow. There's more code like this in the world than I think people realize.

Having to write in a subset of a language in order for it to perform decently is a big deal. Having no feedback given to the programmer when you deviate from the fast path makes it even harder to learn what the fast path is. The result is not that you get the ease of Python and the speed of C without having to understand much; the result is that you have to be a fairly deep expert in Python and understand the C bindings intimately and learn how to avoid doing what is the natural thing in Python, the Python covered in all the tutorials, or you end up writing your code to run at native Python speeds without even realizing it.

It's a feasible amount of knowledge to have, it's not like it's completely insane, but it's still rather a lot.

My career just brushed this world and I'm glad I bounced off of it. It would drive me insane to walk through this landmine field every day, and then worse, have to try to guide others through it all the while they are pointing at all the "common practices" that are also written by people utterly oblivious to all this.

This is a very valid point that I can’t disagree with. I’ve gone through the pain of learning that subset of the language decently well, but also been lucky enough to work at places that compensate very well for that knowledge.

oblio · 2 years ago

This is nice, how would you go about as a performance noob? I can't imagine there's a line in the docs saying "this is slow!".

LineProfiler is the best tool to learn how to write performant Python and do code optimization.

https://github.com/pyutils/line_profiler

You can literally see the hot spot of your code, then you can grind different algorithms or change the whole architecture to make it faster.

For example replace short for loops to list comprehensions, vectorize all numpy operations (only vectorize partially do not help the issue), using 'not any()' instead or 'all()' for boolean, etc.

Doing this for like 2 weeks, basically you can automatically recognize most bad code patterns at a glance.

So in addition to what akasaka said (another thumbs up for line profiler from me, great tool) this isn’t a problem with linalg.norm being slow. It’s plenty fast, but calling it thousands of separate times in a Python loop will be slow. This is more just about learning how to vectorize properly. If you’re working in numpy land and you’re calling a numpy function in a loop that’s iterating over more than a handful of items, chances are you’re not vectorizing properly