Removing the GIL sounds like it will make typical Python programs slower and will introduce a lot of complexity?
What is the real world benefit we will get in return?
In the rare case where I need to max out more than one CPU core, I usually implement that by having the OS run multiple instances of my program and put a bit of parallelization logic into the program itself. Like in the mandelbrot example the author gives, I would simply tell each instance of the program which part of the image it will calculate.
There is an argument that if you need in process multithreading you should use a different language. But a lot of people need to use python because everything else they’re doing is in python.
There are quite a few common cases where in process multi threading is useful. The main ones are where you have large inputs or large outputs to the work units. In process is nice because you can move the input or output state to the work units instead of having to copy it.
One very common case is almost all gui applications. Where you want to be able to do all work on background threads and just move data back and forth from the coordinating ui thread. JavaScript’s lack of support here, outside of a native language compiled into emscripten, is one reason web apps are so hard to make jankless. The copies of data across web workers or python processes are quite expensive as far as things go.
Once a week or so, I run into a high compute python scenario where the existing forms of multiprocessing fail me. Large shared inputs and or don’t want the multiprocess overhead; but GIL slows everything down.
> Where you want to be able to do all work on background threads and just move data back and forth from the coordinating ui thread. JavaScript’s lack of support here, outside of a native language compiled into emscripten, is one reason web apps are so hard to make jankless
I thought transferring array buffers through web workers didn’t involve any copies of you actually transferred ownership:
worker.postMessage(view.buffer, [view.buffer]);
I can understand that web workers might be more annoying to orchestrate than native threads and the like but I’m not sure that it lacks the primitives to make it possible. More likely it’s really hard to have a pauseless GC for JS (Python predominantly relies on reference counting and uses gc just to catch cycles).
As always, it depends a lot on what you're doing, and a lot of people are using Python for AI.
One of the drawbacks of multi-processing versus multi-threading is that you cannot share memory (easily, cheaply) between processes. During model training, and even during inference, this becomes a problem.
For example, imagine a high volume, low latency, synchronous computer vision inference service. If you're handling each request in a different process, then you're going to have to jump through a bunch of hoops to make this performant. For example, you'll need to use shared memory to move data around, because images are large, and sockets are slow. Another issue is that each process will need a different copy of the model in GPU memory, which is a problem in a world where GPU memory is at a premium. You could of course have a single process for the GPU processing part of your model, and then automatically batch inputs into this process, etc. etc. (and people do) but all this is just to work around the lack of proper threading support in Python.
By the way, if anyone is struggling with these challenges today, I recommend taking a peek at nvidia's Triton inference server (https://github.com/triton-inference-server/server), which handles a lot of these details for you. It supports things like zero-copy sharing of tensors between parts of your model running in different processes/threads and does auto-batching between requests as well. Especially auto-batching gave us big throughput increase with a minor latency penalty!
> For example, imagine a high volume, low latency, synchronous computer vision inference service.
I'm not in this space and this is probably too simplistic, but I would think pairing asyncio to do all IO (reading / decoding requests and preparing them for inference) coupled with asyncio.to_thread'd calls to do_inference_in_C_with_the_GIL_released(my_prepared_request), would get you nearly all of the performance benefit using current Python.
The biggest use case (that I am aware of) of GIL-less Python is for parallel feeding data into ML model training.
* PyTorch currently uses `multiprocessing` for that, but it is fraught with bugs and with less than ideal performance, which is sorely needed for ML training (it can starve the GPU).
* Tensorflow just discards Python for data loading. Its data loaders are actually in C++ so it has no performance problems. But it is so inflexible that it is always painful for me to load data in TF.
Given how hot ML is, and how Python is currently the major language for ML, it makes sense for them to optimize for this.
> Removing the GIL sounds like it will make typical Python programs slower and will introduce a lot of complexity?
This was the original reason for CPython to retain GIL for very long time, and probably true for most of that time. That's why the eventual GIL removal had to be paired with other important performance improvements like JIT, which was only implemented after some feasible paths were found and explicitly funded by a big sponsor.
My hunch is that in just a few years time single core computers will be almost extinct. Removing the GIL now feels to me like good strategic preparation for the near future.
I can't think of any actual computer outside of embedded that has been single core for at least a decade. The Core Duo and Athlon X2 were released almost 20 years ago now and within a few years basically everything was multicore.
(When did we get old?)
If you mean that single core workloads will be extinct, well, that's a harder sell.
> What is the real world benefit we will get in return?
If you have many CPU cores and an embarrassingly parallel algorithm, multi-threaded Python can now approach the performance of a single-threaded compiled language.
The question really is if one couldn't make multiprocess better instead of multithreaded. I did a ton of MPI work with python ten years ago already.
What's more I am now seeing in Julia that multithreading doesn't scale to larger core counts (like 128) due to the garbage collector. I had to revert to multithreaded again.
That's not really correct. Python is by far the slowest mainstream language. It is embarrassingly slow. Further more, several mainstream compiled languages are already multicore compatible and have been for decades. So comparing against a single-threaded language or program doesn't make sense.
All this really means is that Python catches up on decades old language design.
However, it simply adds yet another design input. Python's threading, multiprocessing, and asyncio paradigms were all developed to get around the limitations of Python's performance issues and the lack of support for multicore. So my question is, how does this change affect the decision tree for selecting which paradigm(s) to use?
What you’re describing is basically using MPI in some way, shape or form. This works, but also can introduce a lot of complexity. If your program doesn’t need to communicate, then it’s easy. But that’s not the case for all programs. Especially once we’re talking about simulations and other applications running on HPC systems.
Sometimes it’s also easier to split work using multiple threads. Other programming languages let you do that and actually use multiple threads efficiently. In Python, the benefit was just too limited due to the GIL.
> Removing the GIL sounds like it will make typical Python programs slower and will introduce a lot of complexity?
There is a lot of Python code that either explicitly (or implicitly) relies on the GIL for correctness in multithreaded programs.
I myself have even written such code, explicitly relying on the GIL as synchronization primitive.
Removing the GIL will break that code in subtle and difficult to track down ways.
The good news is that a large percentage of this code will stay running on older versions of python (2.7 even) and so will always have a GIL around.
Some of it however will end up running on no-GIL python and I don't envy the developers who will be tasked tracking down the bugs - but probably they will run on modern versions of python using --with-gil or whatever other flag is provided to enable the GIL.
The benefit to the rest of the world then is that future programs will be able to take advantage of multiple cores with shared memory, without needing to jump through the hoops of multi-process Python.
Python has been feeling the pain of the GIL in this area for many years already, and removing the GIL will make Python more viable for a whole host of applications.
> What is the real world benefit we will get in return?
None. I've been using Python "in anger" for twenty years and the GIL has been a problem zero times. It seems to me that removing the GIL will only make for more difficulty in debugging.
JIT is in weird place now because according to a recent talk from PyCON US by it’s author, the tier 2 optimizer that prepares code for JIT reduces performance by 20% and JIT just recovers that performance loss.
A lot of langugage is still not optimized by tier 2 [1] and even less has copy and patch templates for JIT. And JIT itself currently has some memory management issues to iron out.
I have seen some stuff about that, however that is to be expected.
Many of these kind of changes take time, and require multiple interactions.
See Go or .NET tooling bootstraping, all the years that took MaximeVM to evolve into GraalVM, Swift evolution versus Objective-C, Java/Kotlin AOT evolution story on Android, and so on.
If only people that really deeply care get to compile from source to try out the JIT and give feedback, it will have even less people trying it out than those that bother with PyPy.
Linux isn’t one target, but a matrix of targets: providing binaries for Linux means picking architectures, libc versions and variants, OpenSSL versions and forks, etc. against which to canonicalize. This also has downstream implications, e.g. a CPython binary with a static build of OpenSSL might contain an OpenSSL vulnerability that CPython is now on the hook for remediating (rather than delegating that remediation responsibility to downstream distributions).
Some of this complexity is also true for Windows, but Linux’s (good!) diversity makes it a bigger challenge.
not sure if there will distributed in homebrew et al but at least "Pre-built binaries marked as free-threaded can be installed as part of the official Windows and macOS installers" [0]
The performance degradation with nogil is quoted as 20%. It can easily be as much as 50%.
The JIT does not seem to help much. All in all a very disappointing release that may be a reflection of the social and corporate issues in CPython.
A couple of people have discovered that they can milk CPython by promising features, silencing those who are not 100% enthusiastic and then underdeliver. Marketing takes care of the rest.
Could you clarify what/who you mean by the final sentence. It gives the impression you didn't take on board the articles mention about the three phases.
Why are you disappointed? Do you think that the progress should be faster? Or that this work should never have been started? Or that they should wait until it works better before integrering in master and having it in a mainline release?
I remember first discussions about removing the GIL back in 2021 and a lot of initial confusion about what the implications would be. This is a great summary if, like me, you weren’t satisfied with the initial explanations given at the time.
You can find forum and Reddit posts going back 15-20 years of people attempting to remove the GIL, Guido van Rossum just made the requirement that single core performance cannot be hurt by removing it, this made ever previous attempt fail in the end
This sounds silly but I’ve actually turned off garbage collection in short running, small memory programs and gotten a big speed boost.
I wonder if that’s something they could automate? I’m sure there are some weird risks with that. Maybe a small program ends up eating all your memory in some edge case?
You'd run into the halting problem. Maybe for some small subset of programs it'd be possible to prove a short runtime, but in general it wouldn't be, so this type of automation wouldn't be impossible.
It sounds like maybe you want GCs to be very tunable? That way, developers and operators can change how it runs for a given workload. That's actually one of the (few) reasons I like Java, its ability to monitor and tune its GC is awesome.
No one GC is going to be optimal for all workloads/usages. But it seems like the prevailing thought is to change the code to suit the GC where absolutely necessary, instead of tuning the GC to the workload. I'm not sure why that is?
The first alloc will be really early on in even trivial Python programs, so that’s not a good heuristic. It’s hard to pick a good heuristic in this area because the line between startup time and runtime is so blurry in platforms that support runtime/deferred import and eval.
Where it says that? Its simply that Python releases features in yearly cycles and thats what was completed for release.
Idea is to let people experiment with no-GIL to see what it breaks while maintainers and outside contractors improve the performance in future versions.
No that was not the idea. The feature went in under the assumption that the single thread slowdown would be offset by other minor speed improvements.
That was literally the official reason why it was accepted. Now we have slowdowns ranging from 20-50% compared to Python 3.9.
What outside contractors would fix the issue? The Python ruling class has chased away most people who actually have a clue about the Python C-API, which are now replaced by people pretending to understand the C-API and generating billable hours.
> What happens if multiple threads try to access / edit the same object at the same time? Imagine one thread is trying to add to a dict while another is trying to read from it. There are two options here
Why not just ignore this fact, like C and C++? Worst case this is a datarace, best case the programmer either puts the lock or writes a thread safe dict themselves? What am I missing here?
Let me preface this by saying I have no source to prove what I’m about to say, but Guido van Rossum aimed to create a programming language that feels more like a natural language without being just a toy language.
He envisioned a real programming language that could be used by non-programmers, and for this to work it couldn’t contain the usual footguns.
One could argue that he succeeded, considering how many members of the scientific community, who don’t primarily see themselves as programmers, use Python.
But the performance motivations for removing the GIL seem at odds with this. I feel like the subset of Python users who care about the GIL and the subset who are "non-programmers" are entirely disjoint.
I guess I am one of these people and have zero idea on how to write code in parallel programming. Would GIL removal benefit me, given that it seems like it would hurt performance?
I am scientific Python programmer. 99% to 100% of my programs require parallelism, but it is ALWAYS embarrassingly trivial parallelism, nothing is ever mutated and I never need locks. Right now I am forced to use multiprocessing to get the performance, with all problems of multiprocessing, the major one being that I need to use more memory. For me using multithreading could mean the difference between running out of memory and not running out memory. The GIL removal matters for people like me, the proponents of the GIL removal comes from the scientific community.
It's harder to ignore the problem in Python, because reference counting turns every read into a write, incrementing and then decrementing the recount of the object you read. For example, calling "my_mutex.lock()" has already messed with the recount on my_mutex before any locking happens. If races between threads could corrupt those refcounts, there's no way you could code around that. Right now the GIL protects those refcounts, so without a GIL you need a big change to make them all atomic.
Hm I get the memory safety part, but could you elaborate on the compromised VM part? I'm not sure I understand that.. Specifically I don't understand what you mean by VM here
There is an expectation of not dealing with data races from Python code. Apart from being a language where people expect these things to not be an issue it is also the behaviour with thw Gil in place and would be breaking
Memory safety, heap integrity and GC correctness i guess. If you ignore facts like C your language will be as safe as C, except it's worse because at least C doesn't silently rearrange your heap in background.
What is the real world benefit we will get in return?
In the rare case where I need to max out more than one CPU core, I usually implement that by having the OS run multiple instances of my program and put a bit of parallelization logic into the program itself. Like in the mandelbrot example the author gives, I would simply tell each instance of the program which part of the image it will calculate.
There are quite a few common cases where in process multi threading is useful. The main ones are where you have large inputs or large outputs to the work units. In process is nice because you can move the input or output state to the work units instead of having to copy it.
One very common case is almost all gui applications. Where you want to be able to do all work on background threads and just move data back and forth from the coordinating ui thread. JavaScript’s lack of support here, outside of a native language compiled into emscripten, is one reason web apps are so hard to make jankless. The copies of data across web workers or python processes are quite expensive as far as things go.
Once a week or so, I run into a high compute python scenario where the existing forms of multiprocessing fail me. Large shared inputs and or don’t want the multiprocess overhead; but GIL slows everything down.
I thought transferring array buffers through web workers didn’t involve any copies of you actually transferred ownership:
I can understand that web workers might be more annoying to orchestrate than native threads and the like but I’m not sure that it lacks the primitives to make it possible. More likely it’s really hard to have a pauseless GC for JS (Python predominantly relies on reference counting and uses gc just to catch cycles).One of the drawbacks of multi-processing versus multi-threading is that you cannot share memory (easily, cheaply) between processes. During model training, and even during inference, this becomes a problem.
For example, imagine a high volume, low latency, synchronous computer vision inference service. If you're handling each request in a different process, then you're going to have to jump through a bunch of hoops to make this performant. For example, you'll need to use shared memory to move data around, because images are large, and sockets are slow. Another issue is that each process will need a different copy of the model in GPU memory, which is a problem in a world where GPU memory is at a premium. You could of course have a single process for the GPU processing part of your model, and then automatically batch inputs into this process, etc. etc. (and people do) but all this is just to work around the lack of proper threading support in Python.
By the way, if anyone is struggling with these challenges today, I recommend taking a peek at nvidia's Triton inference server (https://github.com/triton-inference-server/server), which handles a lot of these details for you. It supports things like zero-copy sharing of tensors between parts of your model running in different processes/threads and does auto-batching between requests as well. Especially auto-batching gave us big throughput increase with a minor latency penalty!
I'm not in this space and this is probably too simplistic, but I would think pairing asyncio to do all IO (reading / decoding requests and preparing them for inference) coupled with asyncio.to_thread'd calls to do_inference_in_C_with_the_GIL_released(my_prepared_request), would get you nearly all of the performance benefit using current Python.
* PyTorch currently uses `multiprocessing` for that, but it is fraught with bugs and with less than ideal performance, which is sorely needed for ML training (it can starve the GPU).
* Tensorflow just discards Python for data loading. Its data loaders are actually in C++ so it has no performance problems. But it is so inflexible that it is always painful for me to load data in TF.
Given how hot ML is, and how Python is currently the major language for ML, it makes sense for them to optimize for this.
This was the original reason for CPython to retain GIL for very long time, and probably true for most of that time. That's why the eventual GIL removal had to be paired with other important performance improvements like JIT, which was only implemented after some feasible paths were found and explicitly funded by a big sponsor.
I can't think of any actual computer outside of embedded that has been single core for at least a decade. The Core Duo and Athlon X2 were released almost 20 years ago now and within a few years basically everything was multicore.
(When did we get old?)
If you mean that single core workloads will be extinct, well, that's a harder sell.
Single core computers are already functionally extinct, but single-threaded programs are not.
Deleted Comment
If you have many CPU cores and an embarrassingly parallel algorithm, multi-threaded Python can now approach the performance of a single-threaded compiled language.
What's more I am now seeing in Julia that multithreading doesn't scale to larger core counts (like 128) due to the garbage collector. I had to revert to multithreaded again.
The real difference is the lower communication overhead between threads vs. processes thanks to a shared address space.
All this really means is that Python catches up on decades old language design.
However, it simply adds yet another design input. Python's threading, multiprocessing, and asyncio paradigms were all developed to get around the limitations of Python's performance issues and the lack of support for multicore. So my question is, how does this change affect the decision tree for selecting which paradigm(s) to use?
Sometimes it’s also easier to split work using multiple threads. Other programming languages let you do that and actually use multiple threads efficiently. In Python, the benefit was just too limited due to the GIL.
There is a lot of Python code that either explicitly (or implicitly) relies on the GIL for correctness in multithreaded programs.
I myself have even written such code, explicitly relying on the GIL as synchronization primitive.
Removing the GIL will break that code in subtle and difficult to track down ways.
The good news is that a large percentage of this code will stay running on older versions of python (2.7 even) and so will always have a GIL around.
Some of it however will end up running on no-GIL python and I don't envy the developers who will be tasked tracking down the bugs - but probably they will run on modern versions of python using --with-gil or whatever other flag is provided to enable the GIL.
The benefit to the rest of the world then is that future programs will be able to take advantage of multiple cores with shared memory, without needing to jump through the hoops of multi-process Python.
Python has been feeling the pain of the GIL in this area for many years already, and removing the GIL will make Python more viable for a whole host of applications.
None. I've been using Python "in anger" for twenty years and the GIL has been a problem zero times. It seems to me that removing the GIL will only make for more difficulty in debugging.
Deleted Comment
Deleted Comment
Naturally I can easily compile my own Python 3.13 version, no biggie.
However from my experience, this makes many people that could potentially try it out and give feedback, don't care and rather wait.
A lot of langugage is still not optimized by tier 2 [1] and even less has copy and patch templates for JIT. And JIT itself currently has some memory management issues to iron out.
[1]: https://github.com/python/cpython/issues/118093
Talk by Brandt Butcher was there but it was made private:
https://www.youtube.com/watch?v=wr0fVU3Ajwc
Many of these kind of changes take time, and require multiple interactions.
See Go or .NET tooling bootstraping, all the years that took MaximeVM to evolve into GraalVM, Swift evolution versus Objective-C, Java/Kotlin AOT evolution story on Android, and so on.
If only people that really deeply care get to compile from source to try out the JIT and give feedback, it will have even less people trying it out than those that bother with PyPy.
Dead Comment
Some of this complexity is also true for Windows, but Linux’s (good!) diversity makes it a bigger challenge.
Check out IndyGreg's portable Python builds. They're used by Rye and uv.
https://gregoryszorc.com/docs/python-build-standalone/main/
[0] https://docs.python.org/3.13/whatsnew/3.13.html#free-threade...
Deleted Comment
The JIT does not seem to help much. All in all a very disappointing release that may be a reflection of the social and corporate issues in CPython.
A couple of people have discovered that they can milk CPython by promising features, silencing those who are not 100% enthusiastic and then underdeliver. Marketing takes care of the rest.
You can find forum and Reddit posts going back 15-20 years of people attempting to remove the GIL, Guido van Rossum just made the requirement that single core performance cannot be hurt by removing it, this made ever previous attempt fail in the end
I.e. https://www.artima.com/weblogs/viewpost.jsp?thread=214235
I wonder if that’s something they could automate? I’m sure there are some weird risks with that. Maybe a small program ends up eating all your memory in some edge case?
It sounds like maybe you want GCs to be very tunable? That way, developers and operators can change how it runs for a given workload. That's actually one of the (few) reasons I like Java, its ability to monitor and tune its GC is awesome.
No one GC is going to be optimal for all workloads/usages. But it seems like the prevailing thought is to change the code to suit the GC where absolutely necessary, instead of tuning the GC to the workload. I'm not sure why that is?
Idea is to let people experiment with no-GIL to see what it breaks while maintainers and outside contractors improve the performance in future versions.
That was literally the official reason why it was accepted. Now we have slowdowns ranging from 20-50% compared to Python 3.9.
What outside contractors would fix the issue? The Python ruling class has chased away most people who actually have a clue about the Python C-API, which are now replaced by people pretending to understand the C-API and generating billable hours.
https://discuss.python.org/t/incremental-gc-and-pushing-back...
> What happens if multiple threads try to access / edit the same object at the same time? Imagine one thread is trying to add to a dict while another is trying to read from it. There are two options here
Why not just ignore this fact, like C and C++? Worst case this is a datarace, best case the programmer either puts the lock or writes a thread safe dict themselves? What am I missing here?
One could argue that he succeeded, considering how many members of the scientific community, who don’t primarily see themselves as programmers, use Python.
That the worst case being memory unsafety and a compromised VM is not acceptable? Especially for a language as open to low-skill developers as Python?
Deleted Comment