Almost always when I start prototyping something in python, I wish that I stopped half-way where I am now and switched to something else.
Most recent example - converting huge amount of xml files to parquet. I started very fast with python + pyarrow, but when I realized that parallelizing execution would help enormously, I hit GIL or picking/unpickling/multiprocessing costs.
It did work in python, in the end, but I feel that writing that in Rust/C# (even if I don't know Rust besides tutorials) in the end would be much more performant.
Sounds like if this is the tooling and the task at hand, about the most complex things that should be passing through the pickler are partitioned lists of filenames rather than raw data. E.g. you can have each partition generate a parquet for combining in a final step (pyarrow.concat_tables() looks useful), or if it were some other format you were working with, potentially sending flat arrays back to the parent process as giant bytestrings or similar
This is not to say the limitations don't suck, just that very often there are simple approaches to avoid most of the pain
I recently needed to do something similar and used Apache Arrow’s Go library for Parquet. It’s horrifically slow and somehow manages to leak memory. It’s also undocumented. If anybody knows a good Parquet library for Go please let me know.
Yeah, IME Python can be about 100x slower than a native solution. The original solution was a combination of a C library and Python wrapper code, so 4x makes sense for eliminating the Python part.
That is a lot of text for not determining why the new solution is faster. The only relevant part:
> Before our migration, the old pipeline utilized a C library accessed through a Python service, which buffered and bundled data. This was really the critical aspect that was causing our latency.
How much speed up would there have been if they moved to a Rust wrapper around the same C library?
Using something other than Python is almost always going to be faster. This Reddit post does not give any insights into which aspects of Python lead to small/large performance hits. They show that it was the right solution for them with ample documentation which is great, but they don't provide any generalizable information.
what i was thinking as well... i love these performance improvement posts but at the same time had to think what kind of choice was it to originally reach for python in the first place if the task was to do a lot of heavy concurrent task management???
Most recent example - converting huge amount of xml files to parquet. I started very fast with python + pyarrow, but when I realized that parallelizing execution would help enormously, I hit GIL or picking/unpickling/multiprocessing costs.
It did work in python, in the end, but I feel that writing that in Rust/C# (even if I don't know Rust besides tutorials) in the end would be much more performant.
2. Do not use loops, use itertools. (Python 3.12 got a nice 'batched' function, btw.)
3. Preallocate memory, where possible.
4. multiprocessing.shared_memory may help.
5. Something like cython may help.
6. Combination of multiprocessing with asyncio (or multithreading) may help.
7. memmap file access might help.
You see, you have a lot of options before you need to learn another language.
> pickling
Sounds like if this is the tooling and the task at hand, about the most complex things that should be passing through the pickler are partitioned lists of filenames rather than raw data. E.g. you can have each partition generate a parquet for combining in a final step (pyarrow.concat_tables() looks useful), or if it were some other format you were working with, potentially sending flat arrays back to the parent process as giant bytestrings or similar
This is not to say the limitations don't suck, just that very often there are simple approaches to avoid most of the pain
Deleted Comment
Deleted Comment
But, I have the same sentiment. Although, I find writing quick C++ extensions (swig is incredible) is a good balance.
It’s worth giving it a try if you haven’t before.
There's also the fact that having everything be compiled simultaneously results in the optimizer being able to get a lot more work done.
The moment you are using "pure" python on large datasets the whole thing starts to crumble.
> Before our migration, the old pipeline utilized a C library accessed through a Python service, which buffered and bundled data. This was really the critical aspect that was causing our latency.
How much speed up would there have been if they moved to a Rust wrapper around the same C library?
Using something other than Python is almost always going to be faster. This Reddit post does not give any insights into which aspects of Python lead to small/large performance hits. They show that it was the right solution for them with ample documentation which is great, but they don't provide any generalizable information.
Just comes down to whether you need speed of building it or speed of program
Guess I'm never satisfied