Good for you! You did everything right: measure always, fix the bottleneck if possible, rewrite if necessary.
A little tip, you don't have to compare actual distances, you can compare squared distances just as well. Then in `norm < max_dist`, you don't have to do a `sqrt()` for every `norm`. Saves a few CPU ticks as well.
I once rewrote a GDI+ point transformation routine in pure C# and got 200x speedup just because the routine was riddled with needless virtual constructors, copying type conversions, and something called CreateInstanceSlow. Ten years after, I gathered a few of these anecdotes and wrote the Geometry for Programmers book (https://www.manning.com/books/geometry-for-programmers) with its main message: when you know geometry behind your tools, you can either use them efficiently, or rewrite them completely.
The ignore the square root while computing/comparing distances trick is a great one. That's how I got to the top of the performance leaderboard in my first algorithms class.
I think a big mistake in the article, in a context where performance is the main objective, is that the author uses an array of structs (AoS), rather than a struct of arrays (SoA). An SoA makes it so that the data is ordered contiguously, which is easy to read for the CPU, while an AoS structure interleaves different data (namely the x and y in this case), which is very annoying for the CPU. A CPU likes to read chunks of data (for example 128 bits of data/read) and to process these with SIMD instructions, executing a multiple of calculations with one CPU cycle. This is completely broken when using an array of structs.
He uses the same data structure in both the Python and Rust code, so I imagine that he can get an extra 4x speedup at least if he rewrites his code with memory layout in mind.
Apache Arrow (https://arrow.apache.org/overview/) is built exactly around this idea: it's a library for managing the in-memory representation of large datasets.
Author here:
I agree, that's a great perf advice (esp. when you can restructure your code).
I couldn't get into a this in the article (would be too long), but this is a great point and the original library does this in a lot of places.
One problem in our use case is that the actual structs members are pretty big & that we need to group/regroup them a lot.
The fastest approach for us was to do something like in the article for the initial filtering, then build a hashmap of SoAs with the needed data, and do the heavier math on that.
Modern CPU caches are usually loaded in 64-byte units - much larger than 128 bits. I just ran some tests with a C program on an Intel I5 with both AoS and SoA using a list of 1B points with 32-bit X and Y components. Looping through the list of points and totaling all X and Y components was the same speed with either AoS or SoA.
It's easy to make intuitive guesses about how things are working that seem completely reasonable. But you have to benchmark because modern CPUs are so complex that reasoning and intuition mostly don't work.
Programs used for testing are below. I ran everything twice because my system wasn't always idle, so take the lower of the 2 runs.
[jim@mbp ~]$ sh -x x
+ cat x1.c
#include <stdio.h>
#define NUM 1000000000
struct {
int x;
int y;
} p[NUM];
int main() {
int i,s;
for (i=0; i<NUM; i++) {
p[i].x = i;
p[i].y = i;
}
s=0;
for (i=0; i<NUM; i++) {
s += p[i].x + p[i].y;
}
printf("s=%d\n", s);
}
+ cc -o x1 x1.c
+ ./x1
s=1808348672
real 0m12.078s
user 0m7.319s
sys 0m4.363s
+ ./x1
s=1808348672
real 0m9.415s
user 0m6.677s
sys 0m2.685s
+ cat x2.c
#include <stdio.h>
#define NUM 1000000000
int x[NUM];
int y[NUM];
int main() {
int i,s;
for (i=0; i<NUM; i++) {
x[i] = i;
y[i] = i;
}
s=0;
for (i=0; i<NUM; i++) {
s += x[i] + y[i];
}
printf("s=%d\n", s);
}
+ cc -o x2 x2.c
+ ./x2
s=1808348672
real 0m9.753s
user 0m6.713s
sys 0m2.967s
+ ./x2
s=1808348672
real 0m9.642s
user 0m6.674s
sys 0m2.902s
+ cat x3.c
#include <stdio.h>
#define NUM 1000000000
struct {
int x;
int y;
} p[NUM];
int main() {
int i,s;
for (i=0; i<NUM; i++) {
p[i].x = i;
}
for (i=0; i<NUM; i++) {
p[i].y = i;
}
s=0;
for (i=0; i<NUM; i++) {
s += p[i].x;
}
for (i=0; i<NUM; i++) {
s += p[i].y;
}
printf("s=%d\n", s);
}
+ cc -o x3 x3.c
+ ./x3
s=1808348672
real 0m13.844s
user 0m11.095s
sys 0m2.700s
+ ./x3
s=1808348672
real 0m13.686s
user 0m11.038s
sys 0m2.611s
+ cat x4.c
#include <stdio.h>
#define NUM 1000000000
int x[NUM];
int y[NUM];
int main() {
int i,s;
for (i=0; i<NUM; i++)
x[i] = i;
for (i=0; i<NUM; i++)
y[i] = i;
s=0;
for (i=0; i<NUM; i++)
s += x[i];
for (i=0; i<NUM; i++)
s += y[i];
printf("s=%d\n", s);
}
+ cc -o x4 x4.c
+ ./x4
s=1808348672
real 0m13.530s
user 0m10.851s
sys 0m2.633s
+ ./x4
s=1808348672
real 0m13.489s
user 0m10.856s
sys 0m2.603s
This is a great article but there's still a core problem there - why should developers have to choose between accessibility and performance?
So much scientific computing code suffers between core packages being split away from their core language - at what point do we stop and abandon python for languages which actually make sense? Obviously julia is the big example here, but its interest, development and ecosystem doesn't seem to be growing at a serious pace. Given that the syntax is moderately similar and the performance benefits are often 10x what's stopping people from switching???
Today, there is a Python package for everything. The ecosystem is possibly best in class for having a library available that will do X. You cannot separate the language from the ecosystem. Being better, faster, and stronger means little if I have to write all of my own supporting libraries.
Also, few scientific programmers have any notion of what C or Fortran is under the hood. Most are happy to stand on the shoulders of giants and do work with their specialized datasets. Which for the vast majority of researchers are not big data. If the one-time calculation takes 12 seconds instead of 0.1 seconds is not a problem worth optimizing.
Because professional software developers with a background in CS are a minority of people who program today. The learning curve of pointers, memory-allocation, binary operations, programming paradigms, O-Notation and other things you need to understand to efficiently code in something like C is a lot to ask of someone who is for example primarily a sociologist or biologist.
The use case btw. is often also very different. In most of academia, writing code is basically just a fancy mode of documentation for what is basically a glorified calculator. Readability trumps efficiency by a large margin every time.
everything. why are there still cobol programmers? why is c++ still the defacto native language (also in research)?
but also I don't see any problem there, I think the python + c++/rust idiom is actually pretty nice. I have a billion libs to choose from on either side. Great usability on the py side, and unbeatable performance on the c++ side
One of Julia's Achilles heels is standalone, ahead-of-time compilation. Technically this is already possible [1], [2], but there are quite a few limitations when doing this (e.g. "Hello world" is 150 MB [6]) and it's not an easy or natural process.
The immature AoT capabilities are a huge pain to deal with when writing large code packages or even when trying to make command line applications. Things have to be recompiled each time the Julia runtime is shut down. The current strategy in the community to get around this seems to be "keep the REPL alive as long as possible" [3][4][5], but this isn't a viable option for all use cases.
Until Julia has better AoT compilation support, it's going to be very difficult to develop large scale programs with it. Version 1.9 has better support for caching compiled code, but I really wish there were better options for AoT compiling small, static, standalone executables and libraries.
IME, for having used Julia quite extensively in Academia:
- the development experience is hampered by the slow start time;
- the ecosystem is quite brittle;
- the promised performances are quite hard to actually reach, profiling only gets you so far;
- the ecosystem is pretty young, and it shows (lack of docs, small community, ...)
> what's stopping people from switching???
All of the mentioned above, inertia, perfect is the enemy of good enough, the alternatives are far away from python ecosystem & community, performances are not often a show blocker.
I don't know whether this sentiment is just a byproduct of CS education, but for some reason people equate a programming language with the compute that goes on under the hood. Like if you write in Python, you are locked into the specific non optimized way of computing that Python does.
Its all machine code under the hood. Everything else on top is essentially description of more and more complex patterns of that code. So its a no brainer that a language that lets you describe those complex but repeating patterns in the most direct way is the most popular. When you use python, you are effectively using a framework on top of C to describe what you need, and then if you want to do something specialized for performance, you go back to the core fundamentals and write it in C.
A vectorized implementation of find_close_polygons wouldn't be very complex or hard to maintain at all, but the authors would also have to ditch their OOP class based design, and that's the real issue here. The object model doesn't lend itself to performant, vectorized numpy code.
Yeah but the Python code is so bad that it's easy to get a 10x speedup using only numpy, as well. The current code essentially does:
import numpy as np
n_sides = 30
n_polygons = 10000
class Polygon:
def __init__(self, x, y):
self.x = x
self.y = y
self.center = np.array([self.x, self.y]).mean(axis=1)
def find_close_polygons(
polygon_subset: List[Polygon], point: np.array, max_dist: float
) -> List[Polygon]:
close_polygons = []
for poly in polygon_subset:
if np.linalg.norm(poly.center - point) < max_dist:
close_polygons.append(poly)
return close_polygons
polygons = [Polygon(*np.random.rand(2, n_sides)) for _ in range(n_polygons)]
point = np.array([0, 0])
max_dist = 0.5
%timeit find_close_polygons(polygons, point, max_dist)
(I've made up number of sides and number of polygons to get to the same order of magnitude of runtime; also I've pre-computed centers, as they are cached anyway in their code), which on my machine takes about 40ms to run. If we just change the function to:
def find_close_polygons(
polygon_subset: List[Polygon], point: np.array, max_dist: float
) -> List[Polygon]:
centers = np.array([polygon.center for polygon in polygon_subset])
mask = np.linalg.norm(centers - point[None], axis=1) < max_dist
return [
polygon
for polygon, is_pass in zip(polygon_subset, mask)
if is_pass
]
then the same computation takes 4ms on my machine.
Doing a Python loop of numpy operations is a _bad_ idea... The new code hardly even takes more space than the original one.
(as someone else mentioned in the comments, you can also directly use the sum of the squares rather than `np.linalg.norm` to avoid taking square roots and save a few microseconds more, but well, we're not in that level of optimization here)
Python's for loop implementation is slow, also. You can use built in utils like map() which are "native" and can be a lot faster than a for loop with a push:
I had a similar problem, when I was working as a PhD student a few years ago, where I needed to match the voxel representation of a 3D printer with the tetrahedral mesh of our rendering application.
My first attempt in Python was both prohibitively slow and more complicated than necessary, because I tried to use vectorized numpy, where possible.
Since this was only a small standalone script, I rewrote it in Julia in a day. The end result was ca. 100x faster and the code a lot cleaner, because I could just implement the core logic for one tetrahedron and then use Julia's broadcast to apply it to the array of tetrahedrons.
Anyway, Julia's long startup time often prohibits it from being used inside other languages (even though the Python/Julia interoperability is good). On the contrary the Rust/Python interop presented here seems to be pretty great. Another reason I should finally invest the time to learn Rust.
A little tip, you don't have to compare actual distances, you can compare squared distances just as well. Then in `norm < max_dist`, you don't have to do a `sqrt()` for every `norm`. Saves a few CPU ticks as well.
I once rewrote a GDI+ point transformation routine in pure C# and got 200x speedup just because the routine was riddled with needless virtual constructors, copying type conversions, and something called CreateInstanceSlow. Ten years after, I gathered a few of these anecdotes and wrote the Geometry for Programmers book (https://www.manning.com/books/geometry-for-programmers) with its main message: when you know geometry behind your tools, you can either use them efficiently, or rewrite them completely.
He uses the same data structure in both the Python and Rust code, so I imagine that he can get an extra 4x speedup at least if he rewrites his code with memory layout in mind.
I couldn't get into a this in the article (would be too long), but this is a great point and the original library does this in a lot of places.
One problem in our use case is that the actual structs members are pretty big & that we need to group/regroup them a lot.
The fastest approach for us was to do something like in the article for the initial filtering, then build a hashmap of SoAs with the needed data, and do the heavier math on that.
Loading comment...
It's easy to make intuitive guesses about how things are working that seem completely reasonable. But you have to benchmark because modern CPUs are so complex that reasoning and intuition mostly don't work.
Programs used for testing are below. I ran everything twice because my system wasn't always idle, so take the lower of the 2 runs.
Loading comment...
So much scientific computing code suffers between core packages being split away from their core language - at what point do we stop and abandon python for languages which actually make sense? Obviously julia is the big example here, but its interest, development and ecosystem doesn't seem to be growing at a serious pace. Given that the syntax is moderately similar and the performance benefits are often 10x what's stopping people from switching???
Also, few scientific programmers have any notion of what C or Fortran is under the hood. Most are happy to stand on the shoulders of giants and do work with their specialized datasets. Which for the vast majority of researchers are not big data. If the one-time calculation takes 12 seconds instead of 0.1 seconds is not a problem worth optimizing.
Loading comment...
Loading comment...
Loading comment...
The use case btw. is often also very different. In most of academia, writing code is basically just a fancy mode of documentation for what is basically a glorified calculator. Readability trumps efficiency by a large margin every time.
Loading comment...
Loading comment...
Loading comment...
Loading comment...
centers = np.array([p.center for p in ps]) norm(centers - point, axis=1)
They were just using numpy wrong. You can be slow in any language if you use the tools wrong
Loading comment...
Loading comment...
but also I don't see any problem there, I think the python + c++/rust idiom is actually pretty nice. I have a billion libs to choose from on either side. Great usability on the py side, and unbeatable performance on the c++ side
The immature AoT capabilities are a huge pain to deal with when writing large code packages or even when trying to make command line applications. Things have to be recompiled each time the Julia runtime is shut down. The current strategy in the community to get around this seems to be "keep the REPL alive as long as possible" [3][4][5], but this isn't a viable option for all use cases.
Until Julia has better AoT compilation support, it's going to be very difficult to develop large scale programs with it. Version 1.9 has better support for caching compiled code, but I really wish there were better options for AoT compiling small, static, standalone executables and libraries.
[1]: https://julialang.github.io/PackageCompiler.jl/dev/
[2]: https://github.com/tshort/StaticCompiler.jl
[3]: https://discourse.julialang.org/t/ann-the-ion-command-line-f...
[4]: https://discourse.julialang.org/t/extremely-slow-execution-t...
[5]: https://discourse.julialang.org/t/extremely-slow-execution-t...
[6]: https://www.reddit.com/r/Julia/comments/ytegfk/size_of_a_hel...
Loading comment...
- the development experience is hampered by the slow start time;
- the ecosystem is quite brittle;
- the promised performances are quite hard to actually reach, profiling only gets you so far;
- the ecosystem is pretty young, and it shows (lack of docs, small community, ...)
> what's stopping people from switching???
All of the mentioned above, inertia, perfect is the enemy of good enough, the alternatives are far away from python ecosystem & community, performances are not often a show blocker.
Its all machine code under the hood. Everything else on top is essentially description of more and more complex patterns of that code. So its a no brainer that a language that lets you describe those complex but repeating patterns in the most direct way is the most popular. When you use python, you are effectively using a framework on top of C to describe what you need, and then if you want to do something specialized for performance, you go back to the core fundamentals and write it in C.
Loading comment...
Loading comment...
Loading comment...
Loading comment...
Doing a Python loop of numpy operations is a _bad_ idea... The new code hardly even takes more space than the original one.
(as someone else mentioned in the comments, you can also directly use the sum of the squares rather than `np.linalg.norm` to avoid taking square roots and save a few microseconds more, but well, we're not in that level of optimization here)
https://levelup.gitconnected.com/python-performance-showdown...
Loading comment...
Loading comment...
Loading comment...
Loading comment...
Loading comment...
Loading comment...
Loading comment...
Loading comment...
Loading comment...
Loading comment...
Loading comment...
My first attempt in Python was both prohibitively slow and more complicated than necessary, because I tried to use vectorized numpy, where possible.
Since this was only a small standalone script, I rewrote it in Julia in a day. The end result was ca. 100x faster and the code a lot cleaner, because I could just implement the core logic for one tetrahedron and then use Julia's broadcast to apply it to the array of tetrahedrons.
Anyway, Julia's long startup time often prohibits it from being used inside other languages (even though the Python/Julia interoperability is good). On the contrary the Rust/Python interop presented here seems to be pretty great. Another reason I should finally invest the time to learn Rust.
Loading comment...
Loading comment...
Loading comment...
Instead of:
for p in ps: norm(p.center - point)
You should do:
centers = np.array([p.center for p in ps]) norm(centers - point, axis=1)
You’ll get your same speed up in 2 lines without introducing a new dependency
Loading comment...
Loading comment...
Loading comment...