One thing I'm curious about here is the operational impact.
In production systems we often see Python services scaling horizontally
because of the GIL limitations. If true parallelism becomes common,
it might actually reduce the number of containers/services needed
for some workloads.
But that also changes failure patterns — concurrency bugs,
race conditions, and deadlocks might become more common in
systems that were previously "protected" by the GIL.
It will be interesting to see whether observability and
incident tooling evolves alongside this shift.
This is surely why Facebook was interested in funding this work. It is common to have N workers or containers of Python because you are generally restricted to one CPU core per Python process (you can get a bit higher if you use libs that unlock the GIL for significant work). So the only scaling option is horizontal because vertical scaling is very limited. The main downside of this was memory usage. You would have to load all of your code and libraries N types and in-process caches would become less effective. So by being able to vertically scale a Python process much further you can run less and save a lot of memory.
Generally speaking the optimal horizontal scaling is as little as you have to. You may want a bit of horizontal scaling for redundancy and geo distribution, but past that vertically scaling to fewer larger process tend to be more efficient, easier to load balance and a handful of other benefits.
> The main downside of this was memory usage. You would have to load all of your code and libraries N types and in-process caches would become less effective.
You can load modules and then fork child processes. Children will share memory with each other (if they need to modify any shared memory, they get copy-on-write pages allocated by the kernel) and you'll save quite a lot on memory.
But python can fork itself and run multiple processes into one single container. Why would there be a need to run several containers to run several processes?
There's even the multiprocessing module in the stdlib to achieve this.
Threads are cheap, you can do N work simultaneously with N threads in one process, without serialization, IPC or process creation overhead.
With multiprocessing, processes are expensive and work hogs each process. You must serialize data twice for IPC, that's expensive and time consuming.
You shouldn't have to break out multiple processes, for example, to do some simple pure-Python math in parallel. It doesn't make sense to use multiple processes for something like that because the actual work you want to do will be overwhelmed by the IPC overhead.
There are also limitations, only some data can be sent to and from multiple processes. Not all of your objects can be serialized for IPC.
Forking and multi threading do not coexist. Even if one of your transitive dependencies decides to launch a thread that’s 99% idle, it becomes unsafe to fork.
For big things the current way works fine. Having a separate container/deployment for celery, the web server, etc is nice so you can deploy and scale separately. Mostly it works fine, but there are of course some drawbacks. Like prometheus scraping of things then not able to run a web server in parallel etc is clunky to work around.
And for smaller projects it's such an annoyance. Having a simple project running, and having to muck around to get cron jobs, background/async tasks etc. to work in a nice way is one of the reasons I never reach for python in these instances. I hope removing the GIL makes it better, but also afraid it will expose a whole can of worms where lots of apps, tools and frameworks aren't written with this possibility in mind.
As much as I dislike Java the language, this is somewhere where the difference between CPython and JVM languages (and probably BEAM too) is hugely stark. Want to know if garbage collection or memory allocation is a problem in your long running Python program? I hope you're ready to be disappointed and need to roll a lot of stuff yourself. On the JVM the tooling for all kinds of observability is immensely better. I'm not hopeful that the gap is really going to close.
A lot of that has already been solved for by scaling workers to cores along with techniques like greenlets/eventlets that support concurrency without true multithreading to take better advantage of CPU capacity.
But you are still more or less limited to one CPU core per Python process. Yes, you can use that core more effectively, but you still can't scale up very effectively.
Should have funded the entire GIL-removal effort by selling carbon credits. Here's an industry waiting to happen: issue carbon credits for optimizing CPU and GPU resource usage in established libraries.
I wonder about the total energy cost of apps like Teams, Slack, Discord, etc... Hundreds of millions of users, an app running constantly in the background. I wouldn't be surprised if the global power consumption on the clients side reached the gigawatt. Add the increased wear on the components, the cost of hardware upgrades, etc...
All that to avoid hiring a few developers to make optimized native clients on the most popular platforms. Popular apps and websites should lose or get carbon credits on optimization. What is negligible for a small project becomes important when millions of users get involved, and especially background apps.
> Similarly, workloads where threads frequently access and modify the same objects show reduced improvements or even degradation due to lock contention.
Perhaps I'm stating the obvious, but you deal with this with lock-free data structures, immutable data, siloing data per thread, fine-grain locks, etc.
It'd be nice if Python std lib had more thread safe primitives/structures (compared to something like Java where there's tons of thread safe data structures)
Imo the GIL was used as an excuse for a long time to avoid building those out.
> It'd be nice if Python std lib had more thread safe primitives/structures (compared to something like Java where there's tons of thread safe data structures)
Hence why basic Python structures under free-threaded Python are all thread-safe structures, and explains why they are slower than GIL-variant.
Our experience on memory usage, in comparison, has been generally positive.
Previously we had to use ProcessPoolExecutor which meant maintaining multiple copies of the runtime and shared data in memory and paying high IPC costs, being able to switch to ThreadPoolExecutor was hugely beneficially in terms of speed and memory.
It almost feels like programming in a modern (circa 1996) environment like Java.
Swapping ProcessPoolExecutor for ThreadPoolExecutor gives real memory and IPC wins, but it trades process isolation for new failure modes because many C extensions and native libraries still assume the GIL and are not thread safe.
Measure aggressively and test under real concurrency: use tracemalloc to find memory hotspots, py-spy or perf to profile contention, and fuzz C extension paths with stress tests so bugs surface in the lab not in production. Watch per thread stack overhead and GC behavior, design shared state as immutable or sharded, keep critical sections tiny, and if process level isolation is still required stick with ProcessPoolExecutor or expose large datasets via read only mmap.
From [2603.04782] "Unlocking Python's Cores: Hardware Usage and Energy Implications of Removing the GIL" (2026) https://arxiv.org/abs/2603.04782 :
> Abstract: [...] The results highlight a trade-off. For parallelizable workloads operating on independent data, the free-threaded build reduces execution time by up to 4 times, with a proportional reduction in energy consumption, and effective multi-core utilization, at the cost of an increase in memory usage. In contrast, sequential workloads do not benefit from removing the GIL and instead show a 13-43% increase in energy consumption
Might be worth noting that this seems to be just running some tests using the current implementation, and these are not necessarily general implications of removing the GIL.
5.4: Energy consumption going down because of parallelism over multiple cores seems odd. What were those cores doing before? Better utilization causing some spinlocks to be used less or something?
5.5: Fine-grained lock contention significantly hurts energy consumption.
I'm not sure of the exact relationship, but power consumption increases greater than linear with clock speed. If you have 4 cores running at the same time, there's more likely to be thermal throttling → lower clock speeds → lower energy consumption.
Greater power draw though; remember that energy is the integral of power over time.
By running more tasks in parallel across different cores they can each run at lower clock speed and potentially still finish before a single core at higher clock speeds can execute them sequentially.
Running a program either on 1 core or on N cores, ideally does not change the energy.
On N cores, the power is N times greater and the time is N times smaller, so the energy is constant.
In reality, the scaling is never perfect, so the energy increases slightly when a program is run on more cores.
Nevertheless, as another poster has already written, if you have a deadline, then you can greatly decrease the power consumption by running on more cores.
To meet the deadline, you must either increase the clock frequency or increase the number of cores. The latter increases the consumed energy only very slightly, while the former increases the energy many times.
So for maximum energy efficiency, you have to first increase the number of cores up to the maximum, while using the lowest clock frequency. Only when this is not enough to reach the desired performance, you increase the clock frequency as little as possible.
5.4 is the essential reason why multithreading has become the main method to increase CPU performance after 2004. For reaching a given level of performance, increasing the number of cores at the same clock frequency needs much less energy than increasing the clock frequency at the same number of cores.
5.5 depends a lot on the implementation used for locks. High energy consumption due to contention normally indicates bad lock implementations.
In the best implementations, there is no actual contention. A waiting core only reads a private cache line, which consumes very little energy, until the thread that had hold the lock immediately before it modifies the cache line, which causes an exit from the waiting loop. In such implementations there is no global lock variable. There is only a queue associated with a resource and the threads insert themselves in the queue when they want to use the shared resource, providing to the previous thread the address where to signal that it has completed its use of the resource, so the single shared lock variable is replaced with per-thread variables that accomplish its function, without access contention.
While this has been known for several decades, one can still see archaic lock implementations where multiple cores attempt to read or write the same memory locations, which causes data transfers between the caches of various cores, at a very high power consumption.
Moreover, even if you use optimum lock implementations, mutual exclusion is not the best strategy for accessing a shared data resource. Even optimistic access, which is usually called "lock-free", is typically a bad choice.
In my opinion, the best method of cooperation between multiple threads is to use correctly implemented shared buffers or message queues.
By correctly implemented, I mean using neither mutual exclusion nor optimistic access (which may require retries), but using dynamic partitioning of the shared buffers/queues, which is done using an atomic fetch-and-add instruction and which ensures that when multiple threads access simultaneously the shared buffers or queues they access non-overlapping ranges. This is better than mutual exclusion because the threads are never stalled and this is better than "lock-free", i.e. optimistic access, because retries are never needed.
In production systems we often see Python services scaling horizontally because of the GIL limitations. If true parallelism becomes common, it might actually reduce the number of containers/services needed for some workloads.
But that also changes failure patterns — concurrency bugs, race conditions, and deadlocks might become more common in systems that were previously "protected" by the GIL.
It will be interesting to see whether observability and incident tooling evolves alongside this shift.
Generally speaking the optimal horizontal scaling is as little as you have to. You may want a bit of horizontal scaling for redundancy and geo distribution, but past that vertically scaling to fewer larger process tend to be more efficient, easier to load balance and a handful of other benefits.
You can load modules and then fork child processes. Children will share memory with each other (if they need to modify any shared memory, they get copy-on-write pages allocated by the kernel) and you'll save quite a lot on memory.
There's even the multiprocessing module in the stdlib to achieve this.
With multiprocessing, processes are expensive and work hogs each process. You must serialize data twice for IPC, that's expensive and time consuming.
You shouldn't have to break out multiple processes, for example, to do some simple pure-Python math in parallel. It doesn't make sense to use multiple processes for something like that because the actual work you want to do will be overwhelmed by the IPC overhead.
There are also limitations, only some data can be sent to and from multiple processes. Not all of your objects can be serialized for IPC.
And for smaller projects it's such an annoyance. Having a simple project running, and having to muck around to get cron jobs, background/async tasks etc. to work in a nice way is one of the reasons I never reach for python in these instances. I hope removing the GIL makes it better, but also afraid it will expose a whole can of worms where lots of apps, tools and frameworks aren't written with this possibility in mind.
As much as I dislike Java the language, this is somewhere where the difference between CPython and JVM languages (and probably BEAM too) is hugely stark. Want to know if garbage collection or memory allocation is a problem in your long running Python program? I hope you're ready to be disappointed and need to roll a lot of stuff yourself. On the JVM the tooling for all kinds of observability is immensely better. I'm not hopeful that the gap is really going to close.
Not by much. The cases where you can replace processes with threads and save memory are rather limited.
Unless you mean you have multiple worker processes (or GIL-free threads).
All that to avoid hiring a few developers to make optimized native clients on the most popular platforms. Popular apps and websites should lose or get carbon credits on optimization. What is negligible for a small project becomes important when millions of users get involved, and especially background apps.
Perhaps I'm stating the obvious, but you deal with this with lock-free data structures, immutable data, siloing data per thread, fine-grain locks, etc.
Basically you avoid locks as much as possible.
Imo the GIL was used as an excuse for a long time to avoid building those out.
Hence why basic Python structures under free-threaded Python are all thread-safe structures, and explains why they are slower than GIL-variant.
Previously we had to use ProcessPoolExecutor which meant maintaining multiple copies of the runtime and shared data in memory and paying high IPC costs, being able to switch to ThreadPoolExecutor was hugely beneficially in terms of speed and memory.
It almost feels like programming in a modern (circa 1996) environment like Java.
Measure aggressively and test under real concurrency: use tracemalloc to find memory hotspots, py-spy or perf to profile contention, and fuzz C extension paths with stress tests so bugs surface in the lab not in production. Watch per thread stack overhead and GC behavior, design shared state as immutable or sharded, keep critical sections tiny, and if process level isolation is still required stick with ProcessPoolExecutor or expose large datasets via read only mmap.
> Abstract: [...] The results highlight a trade-off. For parallelizable workloads operating on independent data, the free-threaded build reduces execution time by up to 4 times, with a proportional reduction in energy consumption, and effective multi-core utilization, at the cost of an increase in memory usage. In contrast, sequential workloads do not benefit from removing the GIL and instead show a 13-43% increase in energy consumption
5.4: Energy consumption going down because of parallelism over multiple cores seems odd. What were those cores doing before? Better utilization causing some spinlocks to be used less or something?
5.5: Fine-grained lock contention significantly hurts energy consumption.
Greater power draw though; remember that energy is the integral of power over time.
On N cores, the power is N times greater and the time is N times smaller, so the energy is constant.
In reality, the scaling is never perfect, so the energy increases slightly when a program is run on more cores.
Nevertheless, as another poster has already written, if you have a deadline, then you can greatly decrease the power consumption by running on more cores.
To meet the deadline, you must either increase the clock frequency or increase the number of cores. The latter increases the consumed energy only very slightly, while the former increases the energy many times.
So for maximum energy efficiency, you have to first increase the number of cores up to the maximum, while using the lowest clock frequency. Only when this is not enough to reach the desired performance, you increase the clock frequency as little as possible.
5.5 depends a lot on the implementation used for locks. High energy consumption due to contention normally indicates bad lock implementations.
In the best implementations, there is no actual contention. A waiting core only reads a private cache line, which consumes very little energy, until the thread that had hold the lock immediately before it modifies the cache line, which causes an exit from the waiting loop. In such implementations there is no global lock variable. There is only a queue associated with a resource and the threads insert themselves in the queue when they want to use the shared resource, providing to the previous thread the address where to signal that it has completed its use of the resource, so the single shared lock variable is replaced with per-thread variables that accomplish its function, without access contention.
While this has been known for several decades, one can still see archaic lock implementations where multiple cores attempt to read or write the same memory locations, which causes data transfers between the caches of various cores, at a very high power consumption.
Moreover, even if you use optimum lock implementations, mutual exclusion is not the best strategy for accessing a shared data resource. Even optimistic access, which is usually called "lock-free", is typically a bad choice.
In my opinion, the best method of cooperation between multiple threads is to use correctly implemented shared buffers or message queues.
By correctly implemented, I mean using neither mutual exclusion nor optimistic access (which may require retries), but using dynamic partitioning of the shared buffers/queues, which is done using an atomic fetch-and-add instruction and which ensures that when multiple threads access simultaneously the shared buffers or queues they access non-overlapping ranges. This is better than mutual exclusion because the threads are never stalled and this is better than "lock-free", i.e. optimistic access, because retries are never needed.
Unlocking Python’s Cores: Hardware Usage and Energy Implications of Removing the GIL
I am curious about the NumPy workload choice made, due to more limited impact on CPython performance.