Good for you! You did everything right: measure always, fix the bottleneck if possible, rewrite if necessary.
A little tip, you don't have to compare actual distances, you can compare squared distances just as well. Then in `norm < max_dist`, you don't have to do a `sqrt()` for every `norm`. Saves a few CPU ticks as well.
I once rewrote a GDI+ point transformation routine in pure C# and got 200x speedup just because the routine was riddled with needless virtual constructors, copying type conversions, and something called CreateInstanceSlow. Ten years after, I gathered a few of these anecdotes and wrote the Geometry for Programmers book (https://www.manning.com/books/geometry-for-programmers) with its main message: when you know geometry behind your tools, you can either use them efficiently, or rewrite them completely.
The ignore the square root while computing/comparing distances trick is a great one. That's how I got to the top of the performance leaderboard in my first algorithms class.
I think a big mistake in the article, in a context where performance is the main objective, is that the author uses an array of structs (AoS), rather than a struct of arrays (SoA). An SoA makes it so that the data is ordered contiguously, which is easy to read for the CPU, while an AoS structure interleaves different data (namely the x and y in this case), which is very annoying for the CPU. A CPU likes to read chunks of data (for example 128 bits of data/read) and to process these with SIMD instructions, executing a multiple of calculations with one CPU cycle. This is completely broken when using an array of structs.
He uses the same data structure in both the Python and Rust code, so I imagine that he can get an extra 4x speedup at least if he rewrites his code with memory layout in mind.
Apache Arrow (https://arrow.apache.org/overview/) is built exactly around this idea: it's a library for managing the in-memory representation of large datasets.
Author here:
I agree, that's a great perf advice (esp. when you can restructure your code).
I couldn't get into a this in the article (would be too long), but this is a great point and the original library does this in a lot of places.
One problem in our use case is that the actual structs members are pretty big & that we need to group/regroup them a lot.
The fastest approach for us was to do something like in the article for the initial filtering, then build a hashmap of SoAs with the needed data, and do the heavier math on that.
Modern CPU caches are usually loaded in 64-byte units - much larger than 128 bits. I just ran some tests with a C program on an Intel I5 with both AoS and SoA using a list of 1B points with 32-bit X and Y components. Looping through the list of points and totaling all X and Y components was the same speed with either AoS or SoA.
It's easy to make intuitive guesses about how things are working that seem completely reasonable. But you have to benchmark because modern CPUs are so complex that reasoning and intuition mostly don't work.
Programs used for testing are below. I ran everything twice because my system wasn't always idle, so take the lower of the 2 runs.
[jim@mbp ~]$ sh -x x
+ cat x1.c
#include <stdio.h>
#define NUM 1000000000
struct {
int x;
int y;
} p[NUM];
int main() {
int i,s;
for (i=0; i<NUM; i++) {
p[i].x = i;
p[i].y = i;
}
s=0;
for (i=0; i<NUM; i++) {
s += p[i].x + p[i].y;
}
printf("s=%d\n", s);
}
+ cc -o x1 x1.c
+ ./x1
s=1808348672
real 0m12.078s
user 0m7.319s
sys 0m4.363s
+ ./x1
s=1808348672
real 0m9.415s
user 0m6.677s
sys 0m2.685s
+ cat x2.c
#include <stdio.h>
#define NUM 1000000000
int x[NUM];
int y[NUM];
int main() {
int i,s;
for (i=0; i<NUM; i++) {
x[i] = i;
y[i] = i;
}
s=0;
for (i=0; i<NUM; i++) {
s += x[i] + y[i];
}
printf("s=%d\n", s);
}
+ cc -o x2 x2.c
+ ./x2
s=1808348672
real 0m9.753s
user 0m6.713s
sys 0m2.967s
+ ./x2
s=1808348672
real 0m9.642s
user 0m6.674s
sys 0m2.902s
+ cat x3.c
#include <stdio.h>
#define NUM 1000000000
struct {
int x;
int y;
} p[NUM];
int main() {
int i,s;
for (i=0; i<NUM; i++) {
p[i].x = i;
}
for (i=0; i<NUM; i++) {
p[i].y = i;
}
s=0;
for (i=0; i<NUM; i++) {
s += p[i].x;
}
for (i=0; i<NUM; i++) {
s += p[i].y;
}
printf("s=%d\n", s);
}
+ cc -o x3 x3.c
+ ./x3
s=1808348672
real 0m13.844s
user 0m11.095s
sys 0m2.700s
+ ./x3
s=1808348672
real 0m13.686s
user 0m11.038s
sys 0m2.611s
+ cat x4.c
#include <stdio.h>
#define NUM 1000000000
int x[NUM];
int y[NUM];
int main() {
int i,s;
for (i=0; i<NUM; i++)
x[i] = i;
for (i=0; i<NUM; i++)
y[i] = i;
s=0;
for (i=0; i<NUM; i++)
s += x[i];
for (i=0; i<NUM; i++)
s += y[i];
printf("s=%d\n", s);
}
+ cc -o x4 x4.c
+ ./x4
s=1808348672
real 0m13.530s
user 0m10.851s
sys 0m2.633s
+ ./x4
s=1808348672
real 0m13.489s
user 0m10.856s
sys 0m2.603s
This is a great article but there's still a core problem there - why should developers have to choose between accessibility and performance?
So much scientific computing code suffers between core packages being split away from their core language - at what point do we stop and abandon python for languages which actually make sense? Obviously julia is the big example here, but its interest, development and ecosystem doesn't seem to be growing at a serious pace. Given that the syntax is moderately similar and the performance benefits are often 10x what's stopping people from switching???
Today, there is a Python package for everything. The ecosystem is possibly best in class for having a library available that will do X. You cannot separate the language from the ecosystem. Being better, faster, and stronger means little if I have to write all of my own supporting libraries.
Also, few scientific programmers have any notion of what C or Fortran is under the hood. Most are happy to stand on the shoulders of giants and do work with their specialized datasets. Which for the vast majority of researchers are not big data. If the one-time calculation takes 12 seconds instead of 0.1 seconds is not a problem worth optimizing.
The same could be said about CPAN and NPM. Yet Perl is basically dead and JavaScript isn't used for any machine learning tasks as far as I'm aware. WebAssembly did help bring a niche array of audio and video codecs to the ecosystem[1][2], something I'm yet to see from Python.
I don't use Python, but with what little exposure I've had to it at work, its overall sluggish performance and need to set up a dozen virtualenvs -- only to dockerize everything in cursed ways when deploying -- makes me wonder how or why people bother with it at all beyond some 5-line script. Then again, Perl used to be THE glue language in the past and mod_perl was as big as FastAPI, and Perl users would also point out how CPAN was unparalleled in breadth and depth. I wonder if Python will follow a similar fate as Perl. One can hope :-)
During my PhD I was running some simulations using poorly written python code. initially it would take several hours. In that time i could go to the lab, run some wetlab experiments and the results of my simulations would be there when i got back to the office. It was only taking python "home" and building some of my own projects that i learned how to 1. write more pythonic code and 2. write more performant code. Now i work for a software company.
If i'd have stayed in in academia I would probably still be writing quick and dirty code and not worrying about the runtime because as a researcher there is always something else you can be doing.
I tend to use Julia for most things and then just dip into another language’s ecosystem if I can’t find something to do the job and it’s too complex to build myself
Because professional software developers with a background in CS are a minority of people who program today. The learning curve of pointers, memory-allocation, binary operations, programming paradigms, O-Notation and other things you need to understand to efficiently code in something like C is a lot to ask of someone who is for example primarily a sociologist or biologist.
The use case btw. is often also very different. In most of academia, writing code is basically just a fancy mode of documentation for what is basically a glorified calculator. Readability trumps efficiency by a large margin every time.
It also matters if you write code to run once or to serve in production, if it is experimental or stable.
If my script takes 3s to run and 5m to write in Python, vs 0.1s to run and 3h to write in C, I finish first with Python. I can try more ideas with Python.
tbf you don't need to go to C. You could write Common Lisp or Ocaml, both academic high level languages and very performant. Hell SBCL can get you to C range performance wise while you're writing dynamic, GCed code. Sure it's a little bit more involved than learning Python but not that much if you get 50x performance for free. Prevalence of Python is really baffling to me because compute resources cost money.
You made this assertion multiple times, but so far it’s been entirely unsupported in fact, despite TFA having made the entire code set available for you to test your hypothesis on.
though I do feel like i see this a lot with these kinds of "we re-wrote it in rust and everything is fast". comparing to a language with gc options often the scenario
on one hand, i feel like you should just learn how to use your stuff properly. on the other hand it is interesting to see that people who can't write fast code or use libraries properly are actually writing fast code. like fast code for the masses almost hah. though maybe theyll just run into the same issue when they misuse a library in rust
everything. why are there still cobol programmers? why is c++ still the defacto native language (also in research)?
but also I don't see any problem there, I think the python + c++/rust idiom is actually pretty nice. I have a billion libs to choose from on either side. Great usability on the py side, and unbeatable performance on the c++ side
One of Julia's Achilles heels is standalone, ahead-of-time compilation. Technically this is already possible [1], [2], but there are quite a few limitations when doing this (e.g. "Hello world" is 150 MB [6]) and it's not an easy or natural process.
The immature AoT capabilities are a huge pain to deal with when writing large code packages or even when trying to make command line applications. Things have to be recompiled each time the Julia runtime is shut down. The current strategy in the community to get around this seems to be "keep the REPL alive as long as possible" [3][4][5], but this isn't a viable option for all use cases.
Until Julia has better AoT compilation support, it's going to be very difficult to develop large scale programs with it. Version 1.9 has better support for caching compiled code, but I really wish there were better options for AoT compiling small, static, standalone executables and libraries.
Thank you for settling a question for me - I was looking at julia's aot compilation abilities last week and the situation seemed like kind of a hassle.
IME, for having used Julia quite extensively in Academia:
- the development experience is hampered by the slow start time;
- the ecosystem is quite brittle;
- the promised performances are quite hard to actually reach, profiling only gets you so far;
- the ecosystem is pretty young, and it shows (lack of docs, small community, ...)
> what's stopping people from switching???
All of the mentioned above, inertia, perfect is the enemy of good enough, the alternatives are far away from python ecosystem & community, performances are not often a show blocker.
I don't know whether this sentiment is just a byproduct of CS education, but for some reason people equate a programming language with the compute that goes on under the hood. Like if you write in Python, you are locked into the specific non optimized way of computing that Python does.
Its all machine code under the hood. Everything else on top is essentially description of more and more complex patterns of that code. So its a no brainer that a language that lets you describe those complex but repeating patterns in the most direct way is the most popular. When you use python, you are effectively using a framework on top of C to describe what you need, and then if you want to do something specialized for performance, you go back to the core fundamentals and write it in C.
A vectorized implementation of find_close_polygons wouldn't be very complex or hard to maintain at all, but the authors would also have to ditch their OOP class based design, and that's the real issue here. The object model doesn't lend itself to performant, vectorized numpy code.
I think a great start is to make arrays of similar data. Instead of an array of (x,y,z) use an array for x, an array for y and another one for z. If you then square these and sum them for example, the compiler might figure out good optimizations for it if you write it as a simple loop.
Also read about SIMD instructions like AVX2. They are often used under the hood when possible, but just knowing what they require can help "triggering" them, depending on which language you use. In C++ for example, the compiler really looks for opportunities to use those instructions. You can tell the compiler did it, by looking in the assembly code if any XMM or YMM registers are being used (these are the names of the SIMD registers).
The gist of it is that you give numpy two arrays, and what operation to apply. Then numpy will figure out what the for loop(s) should look like depending on the shape of the arrays.
It might have been added later, but the author mentions vectorization in the beginning:
> It’s worth noting that converting parts of / everything to vectorized numpy might be possible for this toy library, but will be nearly impossible for the real library while making the code much less readable and modifiable, and the gains are going to be limited (here’s a partially vertorized version, which is faster but far from the results we are going to achieve).
Yeah but the Python code is so bad that it's easy to get a 10x speedup using only numpy, as well. The current code essentially does:
import numpy as np
n_sides = 30
n_polygons = 10000
class Polygon:
def __init__(self, x, y):
self.x = x
self.y = y
self.center = np.array([self.x, self.y]).mean(axis=1)
def find_close_polygons(
polygon_subset: List[Polygon], point: np.array, max_dist: float
) -> List[Polygon]:
close_polygons = []
for poly in polygon_subset:
if np.linalg.norm(poly.center - point) < max_dist:
close_polygons.append(poly)
return close_polygons
polygons = [Polygon(*np.random.rand(2, n_sides)) for _ in range(n_polygons)]
point = np.array([0, 0])
max_dist = 0.5
%timeit find_close_polygons(polygons, point, max_dist)
(I've made up number of sides and number of polygons to get to the same order of magnitude of runtime; also I've pre-computed centers, as they are cached anyway in their code), which on my machine takes about 40ms to run. If we just change the function to:
def find_close_polygons(
polygon_subset: List[Polygon], point: np.array, max_dist: float
) -> List[Polygon]:
centers = np.array([polygon.center for polygon in polygon_subset])
mask = np.linalg.norm(centers - point[None], axis=1) < max_dist
return [
polygon
for polygon, is_pass in zip(polygon_subset, mask)
if is_pass
]
then the same computation takes 4ms on my machine.
Doing a Python loop of numpy operations is a _bad_ idea... The new code hardly even takes more space than the original one.
(as someone else mentioned in the comments, you can also directly use the sum of the squares rather than `np.linalg.norm` to avoid taking square roots and save a few microseconds more, but well, we're not in that level of optimization here)
Python's for loop implementation is slow, also. You can use built in utils like map() which are "native" and can be a lot faster than a for loop with a push:
Benchmarking methodology in the link is not good. Author should use timeit() or cProfiler or so. 0.01s of difference is mostly due to fluctuation. The order of execution also matters. Say you want to test A and B function, you need actually to run A, B, B, A to see if the ordering brings the different.
Its not the looping itself that is slow in the article you linked, its that every element is appended to the list.
If you use a list comprehension its even faster and it still loops over all elements of the list.
Yeah what's wrong with that? I think this sounds amazing. It gives you all the fast prototyping and simplicity of Python, but once you hit that bottleneck all you have to do is bring in a ringer to replace key components with a faster language.
No need to use Golang or Rust from the start, no need for those resources until you absolutely need the speed improvement. Sounds like a dream to a lot of people who find it much easier to develop in Python.
It sounds amazing, but bear in mind there are a lot of code which can’t be sped up like this because:
- Some code doesn’t have obvious optimization hotspots, and is instead just generally slow everywhere.
- Most FFI boundaries incur their own performance cost. I’m not sure about Python, but I wouldn’t be surprised if FFI to rust in a hot loop is often slower than just writing the same code in Python directly. And it’s not always easy to refactor to avoid this.
- A lot of programs in languages like Python are slow because the working set size contains a lot of small objects, and the GC struggles. You can optimize code like this by moving large parts of the object graph into rust. But it can become a mess if the objects rust retains then need references to Python objects, in turn.
The optimization described in this blog post is the best case scenario for this sort of thing - the performance hotspot was clear, small, and CPU bound. When you can make optimizations like this you absolutely should. But your mileage may vary when you try this out on your own software.
Python is a rough language to be productive in. It's a great scratchpad, but dynamic typing, exceptions/poor error handling, and a horrifying deployment and dependency system make me reach for something like Go in any case where I need something to be even vaguely reliable.
Right? I mean, it's not like we haven't been doing this already. All the computationally intensive python libraries are just a convenient wrapper for C anyway, the only reason python can be used for ML.
Also, there is a faster Python which is also Python. And the author considered it as well (both PyPy and Numba), it's just in this particular scenario they were not the best way to go.
I had a similar problem, when I was working as a PhD student a few years ago, where I needed to match the voxel representation of a 3D printer with the tetrahedral mesh of our rendering application.
My first attempt in Python was both prohibitively slow and more complicated than necessary, because I tried to use vectorized numpy, where possible.
Since this was only a small standalone script, I rewrote it in Julia in a day. The end result was ca. 100x faster and the code a lot cleaner, because I could just implement the core logic for one tetrahedron and then use Julia's broadcast to apply it to the array of tetrahedrons.
Anyway, Julia's long startup time often prohibits it from being used inside other languages (even though the Python/Julia interoperability is good). On the contrary the Rust/Python interop presented here seems to be pretty great. Another reason I should finally invest the time to learn Rust.
Long startup time is relative. I believe it's much lower now than a couple of versions ago. 0.15s or so? Interop between python and rust will also take time.
Julia 1.9 is fast. And you can use https://github.com/Suzhou-Tongyuan/jnumpy to write python extension in Julia now. So I think after 1.9 release julia would be much more usable.
Isn't this the version of refenced on the github repo [0] which speeds up 6x instead of 101x?
There's also a "v1.5" version which is 6x faster, and uses "vectorizing" (doing more of the work directly in numpy). This version is much harder to optimize further.
No, their v1.5 is still calling norm on every polygon. They’re still using it wrong
On Google colab
import numpy as np
import time
vals = np.random.randn(1000000, 2)
point = np.array([.2, .3])
s = time.time()
for x in vals:
np.linalg.norm(x - point) < 3
a = time.time() - s
s = time.time()
np.linalg.norm(vals - point, axis=1) < 3
b = time.time() - s
print(a / b)
~296x faster, significantly faster than the solution in the article.
This is the major reason I don't really buy into things like JITs solving all performance problems (as long as you carefully write only and exactly the subset of the language they work well with) or NumPy not being affected by Python being slow. There's more code like this in the world than I think people realize.
Having to write in a subset of a language in order for it to perform decently is a big deal. Having no feedback given to the programmer when you deviate from the fast path makes it even harder to learn what the fast path is. The result is not that you get the ease of Python and the speed of C without having to understand much; the result is that you have to be a fairly deep expert in Python and understand the C bindings intimately and learn how to avoid doing what is the natural thing in Python, the Python covered in all the tutorials, or you end up writing your code to run at native Python speeds without even realizing it.
It's a feasible amount of knowledge to have, it's not like it's completely insane, but it's still rather a lot.
My career just brushed this world and I'm glad I bounced off of it. It would drive me insane to walk through this landmine field every day, and then worse, have to try to guide others through it all the while they are pointing at all the "common practices" that are also written by people utterly oblivious to all this.
This is a very valid point that I can’t disagree with. I’ve gone through the pain of learning that subset of the language decently well, but also been lucky enough to work at places that compensate very well for that knowledge.
You can literally see the hot spot of your code, then you can grind different algorithms or change the whole architecture to make it faster.
For example replace short for loops to list comprehensions, vectorize all numpy operations (only vectorize partially do not help the issue), using 'not any()' instead or 'all()' for boolean, etc.
Doing this for like 2 weeks, basically you can automatically recognize most bad code patterns at a glance.
So in addition to what akasaka said (another thumbs up for line profiler from me, great tool) this isn’t a problem with linalg.norm being slow. It’s plenty fast, but calling it thousands of separate times in a Python loop will be slow. This is more just about learning how to vectorize properly. If you’re working in numpy land and you’re calling a numpy function in a loop that’s iterating over more than a handful of items, chances are you’re not vectorizing properly
A little tip, you don't have to compare actual distances, you can compare squared distances just as well. Then in `norm < max_dist`, you don't have to do a `sqrt()` for every `norm`. Saves a few CPU ticks as well.
I once rewrote a GDI+ point transformation routine in pure C# and got 200x speedup just because the routine was riddled with needless virtual constructors, copying type conversions, and something called CreateInstanceSlow. Ten years after, I gathered a few of these anecdotes and wrote the Geometry for Programmers book (https://www.manning.com/books/geometry-for-programmers) with its main message: when you know geometry behind your tools, you can either use them efficiently, or rewrite them completely.
He uses the same data structure in both the Python and Rust code, so I imagine that he can get an extra 4x speedup at least if he rewrites his code with memory layout in mind.
I couldn't get into a this in the article (would be too long), but this is a great point and the original library does this in a lot of places.
One problem in our use case is that the actual structs members are pretty big & that we need to group/regroup them a lot.
The fastest approach for us was to do something like in the article for the initial filtering, then build a hashmap of SoAs with the needed data, and do the heavier math on that.
It's easy to make intuitive guesses about how things are working that seem completely reasonable. But you have to benchmark because modern CPUs are so complex that reasoning and intuition mostly don't work.
Programs used for testing are below. I ran everything twice because my system wasn't always idle, so take the lower of the 2 runs.
So much scientific computing code suffers between core packages being split away from their core language - at what point do we stop and abandon python for languages which actually make sense? Obviously julia is the big example here, but its interest, development and ecosystem doesn't seem to be growing at a serious pace. Given that the syntax is moderately similar and the performance benefits are often 10x what's stopping people from switching???
Also, few scientific programmers have any notion of what C or Fortran is under the hood. Most are happy to stand on the shoulders of giants and do work with their specialized datasets. Which for the vast majority of researchers are not big data. If the one-time calculation takes 12 seconds instead of 0.1 seconds is not a problem worth optimizing.
The same could be said about CPAN and NPM. Yet Perl is basically dead and JavaScript isn't used for any machine learning tasks as far as I'm aware. WebAssembly did help bring a niche array of audio and video codecs to the ecosystem[1][2], something I'm yet to see from Python.
I don't use Python, but with what little exposure I've had to it at work, its overall sluggish performance and need to set up a dozen virtualenvs -- only to dockerize everything in cursed ways when deploying -- makes me wonder how or why people bother with it at all beyond some 5-line script. Then again, Perl used to be THE glue language in the past and mod_perl was as big as FastAPI, and Perl users would also point out how CPAN was unparalleled in breadth and depth. I wonder if Python will follow a similar fate as Perl. One can hope :-)
[1] https://github.com/phoboslab/jsmpeg
[2] https://github.com/brion/ogv.js/
During my PhD I was running some simulations using poorly written python code. initially it would take several hours. In that time i could go to the lab, run some wetlab experiments and the results of my simulations would be there when i got back to the office. It was only taking python "home" and building some of my own projects that i learned how to 1. write more pythonic code and 2. write more performant code. Now i work for a software company.
If i'd have stayed in in academia I would probably still be writing quick and dirty code and not worrying about the runtime because as a researcher there is always something else you can be doing.
* PythonCall.jl - https://github.com/cjdoris/PythonCall.jl
* NodeCall.jl - https://github.com/sunoru/NodeCall.j
* RCall.jl - https://github.com/JuliaInterop/RCall.jl
I tend to use Julia for most things and then just dip into another language’s ecosystem if I can’t find something to do the job and it’s too complex to build myself
The use case btw. is often also very different. In most of academia, writing code is basically just a fancy mode of documentation for what is basically a glorified calculator. Readability trumps efficiency by a large margin every time.
If my script takes 3s to run and 5m to write in Python, vs 0.1s to run and 3h to write in C, I finish first with Python. I can try more ideas with Python.
Everything is just a prove of concept and no one expect anything more than that.
centers = np.array([p.center for p in ps]) norm(centers - point, axis=1)
They were just using numpy wrong. You can be slow in any language if you use the tools wrong
though I do feel like i see this a lot with these kinds of "we re-wrote it in rust and everything is fast". comparing to a language with gc options often the scenario
on one hand, i feel like you should just learn how to use your stuff properly. on the other hand it is interesting to see that people who can't write fast code or use libraries properly are actually writing fast code. like fast code for the masses almost hah. though maybe theyll just run into the same issue when they misuse a library in rust
but also I don't see any problem there, I think the python + c++/rust idiom is actually pretty nice. I have a billion libs to choose from on either side. Great usability on the py side, and unbeatable performance on the c++ side
The immature AoT capabilities are a huge pain to deal with when writing large code packages or even when trying to make command line applications. Things have to be recompiled each time the Julia runtime is shut down. The current strategy in the community to get around this seems to be "keep the REPL alive as long as possible" [3][4][5], but this isn't a viable option for all use cases.
Until Julia has better AoT compilation support, it's going to be very difficult to develop large scale programs with it. Version 1.9 has better support for caching compiled code, but I really wish there were better options for AoT compiling small, static, standalone executables and libraries.
[1]: https://julialang.github.io/PackageCompiler.jl/dev/
[2]: https://github.com/tshort/StaticCompiler.jl
[3]: https://discourse.julialang.org/t/ann-the-ion-command-line-f...
[4]: https://discourse.julialang.org/t/extremely-slow-execution-t...
[5]: https://discourse.julialang.org/t/extremely-slow-execution-t...
[6]: https://www.reddit.com/r/Julia/comments/ytegfk/size_of_a_hel...
- the development experience is hampered by the slow start time;
- the ecosystem is quite brittle;
- the promised performances are quite hard to actually reach, profiling only gets you so far;
- the ecosystem is pretty young, and it shows (lack of docs, small community, ...)
> what's stopping people from switching???
All of the mentioned above, inertia, perfect is the enemy of good enough, the alternatives are far away from python ecosystem & community, performances are not often a show blocker.
Its all machine code under the hood. Everything else on top is essentially description of more and more complex patterns of that code. So its a no brainer that a language that lets you describe those complex but repeating patterns in the most direct way is the most popular. When you use python, you are effectively using a framework on top of C to describe what you need, and then if you want to do something specialized for performance, you go back to the core fundamentals and write it in C.
Also read about SIMD instructions like AVX2. They are often used under the hood when possible, but just knowing what they require can help "triggering" them, depending on which language you use. In C++ for example, the compiler really looks for opportunities to use those instructions. You can tell the compiler did it, by looking in the assembly code if any XMM or YMM registers are being used (these are the names of the SIMD registers).
Numpy's tutorial for broadcasting is also a good starting point.
https://numpy.org/doc/stable/user/basics.broadcasting.html
You can look at various tutorials to see how it works. For example: https://jakevdp.github.io/PythonDataScienceHandbook/02.05-co...
> It’s worth noting that converting parts of / everything to vectorized numpy might be possible for this toy library, but will be nearly impossible for the real library while making the code much less readable and modifiable, and the gains are going to be limited (here’s a partially vertorized version, which is faster but far from the results we are going to achieve).
Doing a Python loop of numpy operations is a _bad_ idea... The new code hardly even takes more space than the original one.
(as someone else mentioned in the comments, you can also directly use the sum of the squares rather than `np.linalg.norm` to avoid taking square roots and save a few microseconds more, but well, we're not in that level of optimization here)
https://levelup.gitconnected.com/python-performance-showdown...
Benchmarking methodology in the link is not good. Author should use timeit() or cProfiler or so. 0.01s of difference is mostly due to fluctuation. The order of execution also matters. Say you want to test A and B function, you need actually to run A, B, B, A to see if the ordering brings the different.
You can always speed up an application if you rewrite the used libraries to match your specific use case.
No need to use Golang or Rust from the start, no need for those resources until you absolutely need the speed improvement. Sounds like a dream to a lot of people who find it much easier to develop in Python.
- Some code doesn’t have obvious optimization hotspots, and is instead just generally slow everywhere.
- Most FFI boundaries incur their own performance cost. I’m not sure about Python, but I wouldn’t be surprised if FFI to rust in a hot loop is often slower than just writing the same code in Python directly. And it’s not always easy to refactor to avoid this.
- A lot of programs in languages like Python are slow because the working set size contains a lot of small objects, and the GC struggles. You can optimize code like this by moving large parts of the object graph into rust. But it can become a mess if the objects rust retains then need references to Python objects, in turn.
The optimization described in this blog post is the best case scenario for this sort of thing - the performance hotspot was clear, small, and CPU bound. When you can make optimizations like this you absolutely should. But your mileage may vary when you try this out on your own software.
The more ML I do, the more disappointed I get.
IMO/IME much better to go with a language where you don't have this dichotomy in the first place - e.g. Java or C#.
You don't have to move all the computation, just `for` loops will help alot.
My first attempt in Python was both prohibitively slow and more complicated than necessary, because I tried to use vectorized numpy, where possible.
Since this was only a small standalone script, I rewrote it in Julia in a day. The end result was ca. 100x faster and the code a lot cleaner, because I could just implement the core logic for one tetrahedron and then use Julia's broadcast to apply it to the array of tetrahedrons.
Anyway, Julia's long startup time often prohibits it from being used inside other languages (even though the Python/Julia interoperability is good). On the contrary the Rust/Python interop presented here seems to be pretty great. Another reason I should finally invest the time to learn Rust.
Instead of:
for p in ps: norm(p.center - point)
You should do:
centers = np.array([p.center for p in ps]) norm(centers - point, axis=1)
You’ll get your same speed up in 2 lines without introducing a new dependency
On Google colab
~296x faster, significantly faster than the solution in the article.Deleted Comment
Dead Comment
Having to write in a subset of a language in order for it to perform decently is a big deal. Having no feedback given to the programmer when you deviate from the fast path makes it even harder to learn what the fast path is. The result is not that you get the ease of Python and the speed of C without having to understand much; the result is that you have to be a fairly deep expert in Python and understand the C bindings intimately and learn how to avoid doing what is the natural thing in Python, the Python covered in all the tutorials, or you end up writing your code to run at native Python speeds without even realizing it.
It's a feasible amount of knowledge to have, it's not like it's completely insane, but it's still rather a lot.
My career just brushed this world and I'm glad I bounced off of it. It would drive me insane to walk through this landmine field every day, and then worse, have to try to guide others through it all the while they are pointing at all the "common practices" that are also written by people utterly oblivious to all this.
https://github.com/pyutils/line_profiler
You can literally see the hot spot of your code, then you can grind different algorithms or change the whole architecture to make it faster.
For example replace short for loops to list comprehensions, vectorize all numpy operations (only vectorize partially do not help the issue), using 'not any()' instead or 'all()' for boolean, etc.
Doing this for like 2 weeks, basically you can automatically recognize most bad code patterns at a glance.