The current issue the JVM has is that all threads have a corresponding operating system thread. That, unfortunately, is really heavy memory wise and on the OS context switcher.
Loom allows java to have threads as light weight as a goroutine. It's going to change the way everything works. You might still have a dedicated CPU bound thread pool (the common fork join pool exists and probably should be used for that). But otherwise, you'll just spin up virtual threads and do away with all the consternation over how to manage thread pools and what a thread pool should be used for.
> That, unfortunately, is really heavy memory wise and on the OS context switcher.
So, there was a time where a broad statement like that was pretty solid. These days, I don't think so. The default stack size (on 64-bit Linux) is 1MB, and you can manipulate that to be smaller if you want. That's also the virtual memory. The actually memory usage depends on your application. There was a time where 1MB was a lot of memory, but these days, for a lot of contexts, it's kind of peanuts unless you have literally millions of threads (and even then...). Yes, you can be more memory efficient, but it wouldn't necessarily help that much. Similarly, at least in the case of blocking IO (which is normally why you'd have so many threads), the overhead on the OS context switcher isn't necessarily that significant, as most threads will be blocked at any given time, and you're already going to have a context switch from the kernel to userspace. Depending on circumstance, using polling IO models can lead to more context switching, not less.
There's certainly circumstances where threads significantly impede your application's efficiency, but if you are really in that situation you likely already know it. In the broad set of use cases though, switching from a thread-based concurrency model to something else isn't going to be the big win people think it will be.
> So, there was a time where a broad statement like that was pretty solid.
That time is approaching 20 years old at this point, too. Native threads haven't been "expensive" for a very, very long time now.
Maybe if you're in the camp of disabling overcommit it matters, but otherwise the application of green threads is definitely a specialized niche, not generally useful.
> In the broad set of use cases though, switching from a thread-based concurrency model to something else isn't going to be the big win people think it will be.
I'd go even further and say it'll be a net-loss in most cases, especially with modern complications like heterogeneous compute. If you're use case is specifically spinning up thousands of threads for IO (aka, you're a server & nothing else), then sure. But if you aren't there's no win here, just complications (like times when you need native thread isolation for FFI reasons, like using OpenGL)
Your words might be true, but the world jumped on async wagon long time ago and going all in. Nobody likes threads, everyone wants lightweight threads. Emulating lightweight threads with promises (optionally hidden behind async/await transformations) is very popular. So demand for this feature is here.
I don't know why, I, personally, never needed that feature and good old threads were always enough for me. It's weird for me to watch non-JDBC drivers with async interface, when it was a common knowledge that JDBC data source should use something like 10-20 threads maximum (depending on DB CPU count), anything more is a sign of bad database design. And running 10-20 threads, obviously, is not an issue.
But demand is here. And probably lightweight threads is a better approach than async/await transformations.
You are ignoring the downside to green threads which is that it’s cooperative. If the thread doesn’t yield control back to the event loop then the real OS thread backing the loop is now stuck.
Which leads to dirty things like inserting sleep 0 at the top of loops and dealing with really unbalanced scheduling of threads don’t hit yields often enough. Plus with loom it might not be obvious that some function is a yield since it’s meant to be transparent so if you grab a lock and yield you make everyone wait until your scheduled again.
Green threads are great! I love them and they’re the only real solutions to really concurrent IO heavy workloads but it’s not a panacea and trades one kind of discipline for another.
Which is why the advice would be "Don't use virtual threads for CPU work".
It just so happens that a large number of JVM users are working with IO bound problems. Once you start talking about CPU bound problems the JVM tends not to be the thing most people reach for.
Loom doesn't remove the CPU bound solution by adding the IO solution. Instead, it adds a good IO solution and keeps the old CPU solution when needed.
In fact, there's already a really good pool in the JVM for common CPU bound tasks. `Forkjoin.common()`.
Sleep 0 sounds like quite a hack, Go has the neater https://pkg.go.dev/runtime#Gosched instead, and I assume there will be a Java equivalent as well. And if most stdlib methods and all blocking methods call it, it's going to be pretty difficult to hang a green thread.
FWIW, while you are probably correct in the context of Loom--a specific implementation that I honestly haven't looked at much--you shouldn't generalize to "green threads" of all forms as you not only can totally implement this well but Erlang does so: as you are working with a byte code and a JIT anyway, you instrument the code to check occasionally if it was preempted (I believe Erlang does this for every potentially-backward jump, which is sufficient to guarantee even a broken loop can be preempted).
Agreed, but you have other single-threaded server languages like NodeJS which have the same problem (a new request can only be handled if the current request gives up control, usually waiting for IO) and people have figured out how to handle it.
I see Project Loom as really providing all the benefits of single threaded languages like Node (i.e. tons of scalability), but with an easier programming model that threads provide as opposed to using async/await.
When you have a runtime, you have proper information whether there is work being done on a given virtual thread - So in case of Loom, afaik any blocking call will turn into non-blocking auto-magically (other than FFI, but that is very rare in Java), since the JVM is free to wait on that asynchronously behind the scenes and do some other work in the meantime.
> Are we coming full circle going back a variant of the original Java green threads?
There are basically two kinds of green threads:
(1) N:1, where one OS thread hosts all the application threads, and
(2) M:N, where M application threads are hosted on N OS threads.
Original Java (and Ruby, and lots of other systems before every microcomputer was a multicore parallel system) green threads were N:1, which provide concurrency but not parallelism, which is fine when your underlying system can't do real parallelism anyway.)
Wanting to take advantage of multicore systems (at least, in the Ruby case, for underlying native code) drove a transition to native threads (which you could call an N:N threading model, as application and OS threads are identical.)
But this limits the level of concurrency to the level of parallelism, which can be a regression compared to N:1 models for applications where the level of concurrency that is useful is greater than the level of parallelism available.
What lots of newer systems are driving toward, to solve that, are M:N models, which can leverage all available parallelism but also provide a higher degree of concurrency.
Longer answer: devs back in the day couldn't really grok the difference between green and real threads. Java made its bones as an enterprise language, which can have smart programmers, but they will conversely not be closer-to-metal knowledgewise. Too many devs back in the day expected a java thread to be a real thread, so java re-engineered to accomodate this.
I think the JDK/JVM teams also viewed it as a maturation of the JVM to be directly using OS resources so closely across platforms, rather than "hacking" it with green threads.
These days, our high performance fanciness means the devs are demanding green thread analogues, and go/elixir/others are seemingly superior because of those.
So to remain competitive in the marketplace, Java now needs threads that aren't threads even though Java used to have threads that weren't threads.
The new Loom threads will be much lighter weight than the original Java green threads. Further, the entire IO infrastructure of the JVM is being reworked for Loom to make sure the OS doesn't block the VM's thread. What's more, Loom does M:N threading.
Not quite. The original green threads were seen as more of a hack until Solaris supported true threads. Green threads could only support one CPU core, and so without a major redesign, it was a dead end.
I have discovered ReactiveX for Java and Reactor in particular.
I am working with Kafka and MongoDB and it is normal for my app to have a million in flight transactions at various stages of completion.
In the past it required a lot of planning (and a lot of code) but Reactor let's me build these processes as pipelines with whatever concurrency or scheduler I desire, at any stage of the processing.
We are even doing tricks like merging unrelated queries to MongoDB so that sometimes thousands of same queries are executed together (one query with huge in() or one bulk write rather than separate ones).
This is improving our throughputs by orders of magnitude while the pipeline pulls millions of documents per second from the database.
I just don't see how Loom helps.
Loom could help if you had blocking APIs to start, but you get much better results if you just resolve to use async, non-blocking wrapped in ReactiveX.
Loom will help folks who prefer writing straightforward Java code instead of some random reactive library with obscure exception handling and poor to impossible debuggability.
Now I get it is hard for many folks to understand that part. Just like at my workplace people think it is impossible to write micro service without SpringBoot.
> Loom could help if you had blocking APIs to start, but you get much better results if you just resolve to use async, non-blocking wrapped in ReactiveX.
There might be billions of lines of legacy code which would adapt to Loom with minimal changes but impossible to turn in ReactiveX etc without enormous investment and risk. Your ideas are rather simplistic for real world.
Yup, Loom will simplify a lot the Producer-Consumer pattern on I/O operation. With virtual threads, it's basically free to block on consumer threads, so you would need only 1 bounded pool for the consumers.
Currently for efficiency, you would need at least 2 pools: 1 small bounded pool for dequeuing the requests and create the IO operation, and 1 unbounded pool for actually executing the IO operation.
For the team that I am in, I can see a huge productivity boost if my teammates can write in direct style instead of wrapping their heads around monads.
I've not looked into the goroutine implementation, so I couldn't tell you how it compares to what I've read loom is doing.
Loom is looking to have some extremely compact stacks which means each new "virtual thread" as they are calling them will end up having bytes worth of memory allocated.
Another thing coming with loom that go lacks is "structured concurrency". It's the notion that you might have a group of tasks that need to finish before moving on from a method (rather than needing to worry about firing and forgetting causing odd things to happen at odd times).
Unlike go routine, Loom virtual threads are not preempted by the scheduler. I believe you may be able to explicitly preempt a virtual thread but the last time i checked it was not part of the public API
The biggest difference is probably that the JVM will support both OS and lightweight threads. That's really useful for certain things talking to the GPU in a single thread context.
Wouldn't any linux/nptl thread require at at least the register-state of the entire x86 (or ARM) CPU?
I don't think goroutines would need such information. A goroutine knows that "int foobar;" is currently being stored in "rbx", and that "int foobar" is currently saved on the stack. Therefore, rbx doesn't need to be saved.
------
Linux/NPTL threads don't know when they are interrupted. So all register state (including AVX512 state if those are being used) needs to be saved. AVX512 x 32 is 2kB alone.
Even if AVX512 isn't being used by a thread (Linux detects all AVX512 registers to be all-zero), RAX through R15 is 128-bytes, plus SSE-registers (another 128-bytes) or ~256 bytes of space that the goroutines don't need. Plus whatever other process-specific information needs to be saved off (CPU time and other such process / thread details that Linux needs to decide which threads to process next)
Author mentions scala. Both ZIO[1] and Cats-Effect[2] provide fibers (coroutines) over these specific threadpool designs today, without the need for Project Loom, and give the user the capability of selecting the pool type to use without explicit reference. They are unusable from Java, sadly, as the schedulers and ExecutionContexts and runtime are implicitly provided in sealed companion objects and are therefore private and inaccessible to Java code, even when compiling with ScalaThenJava. Basically, you cannot run an IO from Java code.
You can expose a method on the scala side to enter the IO world that will take your arguments and run them in the IO environment, returning a result to you, or notifying some Java class using Observer/Observable. This can, of course take Java lambdas and datatypes, thus keeping your business code in Java should you so desire. It's clunky, though, and I wish Java had easy IO primitives like Scala.
I'm wary of unbounded thread pools. Production has a funny way of showing that threads always consume resources. A fun example is file descriptors. An unexpected database reboot is often a short outage, but it's crazy how quickly unbounded thread pools can amplify errors and delay recovery.
Anyway, they have their place, but if you've got a fancy chain of micro services calling out to wherever, think hard before putting those calls in an unbounded thread pool.
And you should be wary! Prefer instead a bounded thread pool with a bounded queue of tasks waiting for service, and also decide explicitly what should happen when the queue fills up or wait times become too high (whatever "too high" means for the application).
Unbounded thread pools are bad, bounded thread pool executors with unbounded work queues are bad, and bounded thread pools with bounded queues, FIFO policies, and silent drops are also bad. There are many bad ways to do this.
Another tip - If you have a dynamically-sized thread pool, make it use a minimum of two threads. Otherwise developers will get used to guaranteed serialization of tasks, and you'll never be able to change it.
This seems like good advice in general. Is any of it really specific to the JVM? If I was doing thread pooling with CPU and IO bound tasks, I would approach threading in a similar way in C++.
It'll depend on if your language has either coroutines or lightweight threads.
Threadpooling only matters if you have neither of those things.
Otherwise, you should be using one or the other over a thread pool. You might still spin up a threadpool for CPU bound operations, but you wouldn't have one dedicated to IO.
As of C++ 20, there are coroutines which you should be looking at (IMO).
Threadpools are probably better on CPU-bound bound (or CPU-ish bound tasks: like RAM-bound) without any I/O.
Coroutines / Goroutines and the like are probably better on I/O bound tasks where the CPU-effort in task-switching is significant.
--------
For example: Matrix Multiplication is better with a Threadpool. Handling 1000 simultaneous connections when you get Slashdotted (or "Hacker News hug of death") is better solved with coroutines.
If your app is fully non-blocking, doesn't it make sense to just do everything on the one pool, CPU bound tasks and IO polling.
Rather than passing messages between threads.
"Fully non-blocking" means "does no work". Ignoring the process' spawning thread, if your app performs CPU-bound tasks on a bounded thread pool, you will be leaving I/O throughput on the table as the number of tasks increases, since I/O-bound tasks will block on waiting for a thread.
> you're almost always going to have some sort of singleton object somewhere in your application which just has these three pools, pre-configured for use
I'm bemused by this statement, and I can't figure out whether this is an assertion rooted in supreme confidence, or just idle, wishful thinking.
That being said, giving threading advice in a virtualized and containerized world is tricky. And while these three categories seem sensible, mapping the functions of any non-trivial system onto them is going to be difficult, unless the system was specifically designed around it.
With Python at first I was scared of GIL being single threaded, now I'm used to it and it works great. Thousands of threads used to be normal for my old Java projects but seems crazy to me now.
The current issue the JVM has is that all threads have a corresponding operating system thread. That, unfortunately, is really heavy memory wise and on the OS context switcher.
Loom allows java to have threads as light weight as a goroutine. It's going to change the way everything works. You might still have a dedicated CPU bound thread pool (the common fork join pool exists and probably should be used for that). But otherwise, you'll just spin up virtual threads and do away with all the consternation over how to manage thread pools and what a thread pool should be used for.
So, there was a time where a broad statement like that was pretty solid. These days, I don't think so. The default stack size (on 64-bit Linux) is 1MB, and you can manipulate that to be smaller if you want. That's also the virtual memory. The actually memory usage depends on your application. There was a time where 1MB was a lot of memory, but these days, for a lot of contexts, it's kind of peanuts unless you have literally millions of threads (and even then...). Yes, you can be more memory efficient, but it wouldn't necessarily help that much. Similarly, at least in the case of blocking IO (which is normally why you'd have so many threads), the overhead on the OS context switcher isn't necessarily that significant, as most threads will be blocked at any given time, and you're already going to have a context switch from the kernel to userspace. Depending on circumstance, using polling IO models can lead to more context switching, not less.
There's certainly circumstances where threads significantly impede your application's efficiency, but if you are really in that situation you likely already know it. In the broad set of use cases though, switching from a thread-based concurrency model to something else isn't going to be the big win people think it will be.
That time is approaching 20 years old at this point, too. Native threads haven't been "expensive" for a very, very long time now.
Maybe if you're in the camp of disabling overcommit it matters, but otherwise the application of green threads is definitely a specialized niche, not generally useful.
> In the broad set of use cases though, switching from a thread-based concurrency model to something else isn't going to be the big win people think it will be.
I'd go even further and say it'll be a net-loss in most cases, especially with modern complications like heterogeneous compute. If you're use case is specifically spinning up thousands of threads for IO (aka, you're a server & nothing else), then sure. But if you aren't there's no win here, just complications (like times when you need native thread isolation for FFI reasons, like using OpenGL)
I don't know why, I, personally, never needed that feature and good old threads were always enough for me. It's weird for me to watch non-JDBC drivers with async interface, when it was a common knowledge that JDBC data source should use something like 10-20 threads maximum (depending on DB CPU count), anything more is a sign of bad database design. And running 10-20 threads, obviously, is not an issue.
But demand is here. And probably lightweight threads is a better approach than async/await transformations.
The default thread stack size is 8 or 10 MB on most Linux.
The exception is Alpine that's below 1 MB.
Granted there are scenarios where you want 100,000 "threads of execution." And that clearly is going to be impractical for system threads.
But if your worried about the overhead of your pool of 50 threads, stop it.
Which leads to dirty things like inserting sleep 0 at the top of loops and dealing with really unbalanced scheduling of threads don’t hit yields often enough. Plus with loom it might not be obvious that some function is a yield since it’s meant to be transparent so if you grab a lock and yield you make everyone wait until your scheduled again.
Green threads are great! I love them and they’re the only real solutions to really concurrent IO heavy workloads but it’s not a panacea and trades one kind of discipline for another.
It just so happens that a large number of JVM users are working with IO bound problems. Once you start talking about CPU bound problems the JVM tends not to be the thing most people reach for.
Loom doesn't remove the CPU bound solution by adding the IO solution. Instead, it adds a good IO solution and keeps the old CPU solution when needed.
In fact, there's already a really good pool in the JVM for common CPU bound tasks. `Forkjoin.common()`.
I see Project Loom as really providing all the benefits of single threaded languages like Node (i.e. tons of scalability), but with an easier programming model that threads provide as opposed to using async/await.
There are basically two kinds of green threads:
(1) N:1, where one OS thread hosts all the application threads, and
(2) M:N, where M application threads are hosted on N OS threads.
Original Java (and Ruby, and lots of other systems before every microcomputer was a multicore parallel system) green threads were N:1, which provide concurrency but not parallelism, which is fine when your underlying system can't do real parallelism anyway.)
Wanting to take advantage of multicore systems (at least, in the Ruby case, for underlying native code) drove a transition to native threads (which you could call an N:N threading model, as application and OS threads are identical.)
But this limits the level of concurrency to the level of parallelism, which can be a regression compared to N:1 models for applications where the level of concurrency that is useful is greater than the level of parallelism available.
What lots of newer systems are driving toward, to solve that, are M:N models, which can leverage all available parallelism but also provide a higher degree of concurrency.
Longer answer: devs back in the day couldn't really grok the difference between green and real threads. Java made its bones as an enterprise language, which can have smart programmers, but they will conversely not be closer-to-metal knowledgewise. Too many devs back in the day expected a java thread to be a real thread, so java re-engineered to accomodate this.
I think the JDK/JVM teams also viewed it as a maturation of the JVM to be directly using OS resources so closely across platforms, rather than "hacking" it with green threads.
These days, our high performance fanciness means the devs are demanding green thread analogues, and go/elixir/others are seemingly superior because of those.
So to remain competitive in the marketplace, Java now needs threads that aren't threads even though Java used to have threads that weren't threads.
The new Loom threads will be much lighter weight than the original Java green threads. Further, the entire IO infrastructure of the JVM is being reworked for Loom to make sure the OS doesn't block the VM's thread. What's more, Loom does M:N threading.
Same concept, very different implementation.
I am working with Kafka and MongoDB and it is normal for my app to have a million in flight transactions at various stages of completion.
In the past it required a lot of planning (and a lot of code) but Reactor let's me build these processes as pipelines with whatever concurrency or scheduler I desire, at any stage of the processing.
We are even doing tricks like merging unrelated queries to MongoDB so that sometimes thousands of same queries are executed together (one query with huge in() or one bulk write rather than separate ones).
This is improving our throughputs by orders of magnitude while the pipeline pulls millions of documents per second from the database.
I just don't see how Loom helps.
Loom could help if you had blocking APIs to start, but you get much better results if you just resolve to use async, non-blocking wrapped in ReactiveX.
Now I get it is hard for many folks to understand that part. Just like at my workplace people think it is impossible to write micro service without SpringBoot.
> Loom could help if you had blocking APIs to start, but you get much better results if you just resolve to use async, non-blocking wrapped in ReactiveX.
There might be billions of lines of legacy code which would adapt to Loom with minimal changes but impossible to turn in ReactiveX etc without enormous investment and risk. Your ideas are rather simplistic for real world.
Currently for efficiency, you would need at least 2 pools: 1 small bounded pool for dequeuing the requests and create the IO operation, and 1 unbounded pool for actually executing the IO operation.
I've not looked into the goroutine implementation, so I couldn't tell you how it compares to what I've read loom is doing.
Loom is looking to have some extremely compact stacks which means each new "virtual thread" as they are calling them will end up having bytes worth of memory allocated.
Another thing coming with loom that go lacks is "structured concurrency". It's the notion that you might have a group of tasks that need to finish before moving on from a method (rather than needing to worry about firing and forgetting causing odd things to happen at odd times).
I don't think goroutines would need such information. A goroutine knows that "int foobar;" is currently being stored in "rbx", and that "int foobar" is currently saved on the stack. Therefore, rbx doesn't need to be saved.
------
Linux/NPTL threads don't know when they are interrupted. So all register state (including AVX512 state if those are being used) needs to be saved. AVX512 x 32 is 2kB alone.
Even if AVX512 isn't being used by a thread (Linux detects all AVX512 registers to be all-zero), RAX through R15 is 128-bytes, plus SSE-registers (another 128-bytes) or ~256 bytes of space that the goroutines don't need. Plus whatever other process-specific information needs to be saved off (CPU time and other such process / thread details that Linux needs to decide which threads to process next)
You can expose a method on the scala side to enter the IO world that will take your arguments and run them in the IO environment, returning a result to you, or notifying some Java class using Observer/Observable. This can, of course take Java lambdas and datatypes, thus keeping your business code in Java should you so desire. It's clunky, though, and I wish Java had easy IO primitives like Scala.
1. https://github.com/zio/zio
2. https://typelevel.org/cats-effect/versions
Anyway, they have their place, but if you've got a fancy chain of micro services calling out to wherever, think hard before putting those calls in an unbounded thread pool.
Care to elaborate please? Seems like the author is recommending unbounded thread pools with bounded queues for blocking IO. Isn't that pretty similar?
Threadpooling only matters if you have neither of those things.
Otherwise, you should be using one or the other over a thread pool. You might still spin up a threadpool for CPU bound operations, but you wouldn't have one dedicated to IO.
As of C++ 20, there are coroutines which you should be looking at (IMO).
https://en.cppreference.com/w/cpp/language/coroutines
Coroutines / Goroutines and the like are probably better on I/O bound tasks where the CPU-effort in task-switching is significant.
--------
For example: Matrix Multiplication is better with a Threadpool. Handling 1000 simultaneous connections when you get Slashdotted (or "Hacker News hug of death") is better solved with coroutines.
Deleted Comment
Ha! Maybe in 20 years. Sadly, I'm still writing new code targeting C++98 on one project. The most current project I'm a part of is on C++11.
Not for languages with go/coroutines (e.g. Go, Clojure, Crystal) as those were designed specifically to help with the thread-per-IO constraint.
I'm bemused by this statement, and I can't figure out whether this is an assertion rooted in supreme confidence, or just idle, wishful thinking.
That being said, giving threading advice in a virtualized and containerized world is tricky. And while these three categories seem sensible, mapping the functions of any non-trivial system onto them is going to be difficult, unless the system was specifically designed around it.