Hey, I've been following up on Legion and Regent quite a bit, excellent work there.
Do you have a set of benchmarks that others can reimplement to compare the approaches?
I've added Dataflow Parallelism to my own multithreading runtime[1] but I didn't had dataflow focused benchmarks yet, well I could add Cholesky Decomposition but it's quite involved.
I expect the people from TaskFlow[2] and Habanero[3] (via Data-Driven Futures) would be quite interested as well in a common set of dataflow parallelism benchmarks.
By the way if you didn't read the DaCe paper[4] you absolutely should, seems like the age of Dataflow parallelism and properly optimizing for data is coming.
Please correct me if I'm wrong, but I think all of those system only work inside of a single process. Legion/Regent support distributed multi-node multi-process execution both on supercomputers and in the cloud.
I've always wondered why there isn't a general purpose purely functional language that computes a graph of deps and implicitly parallelizes all operations. For some things, like a map over a list, I understand that the overhead of distributing the work is greater than just using one thread, but for things known at compile time (like deps between variables), the cost should be zero to distribute.
Prior to working on Legion, I worked on a programming language called Sequoia that matches much of what you are describing [1]. In many ways Sequoia was the spiritual ancestor of Legion/Regent.
The main difference between Legion and TensorFlow is how and when the dataflow graph is constructed. In TensorFlow the graph is constructed lazily (no execution is performed until you've asked for it), it's optimized, and then distributed to processors (GPUs/TPUs) for execution. In Legion, the graph is built, distributed, and executed on the fly. What this means is that Legion can react to things like dynamic control flow (e.g. branches inside of loops) and analyze dependences at runtime to find task parallelism, in a similar way to how your out-of-order CPU extracts instruction level parallelism from a program. Doing things in the TensorFlow model works better when you can see your whole program up front and can "statically" optimize and schedule it because it has lower overheads, but it also has limits to the kinds of programs it can handle; the Legion approach works better when you have dynamic data-dependent behavior in your program and you need to react to it on the fly, but it does have some overhead to the dynamic analysis.
TL;DR is TF executor parallelism is too pessimistic to fully exploit the parallelization opportunities in the problem space. Regent is built on top of Legion which is a cutting edge dataflow library, and is designed to provide the guarantees to achieve that speedup.
I don't understand why something like this would need a separate language. Switching languages means starting over in many ways with regards to tools libraries and semantics. A graph of tasks can be made with a cdecl library.
That's an assertion, but not anything to back it up.
First, threads have been implemented as libraries many times. Second, if checks need to happen theg can happen at debug run time if they can't happen at compile time. I don't know what specifically has to be integrated into a language here that makes throwing away the enormous amount already built in other languages.
An often-overlooked quality of any new language is the set of things that you can't do in it. Some features can only be accomplished when certain negative guarantees can be made about programs. And it's really hard to implement negative guarantees as a library.
It's worth noting that Regent is the language that implements the Legion programming model. The Legion runtime system is just a C++ library with bindings for C, Fortran, Python, Terra, and Lua. Writing Regent code is much higher productivity than writing to the C++ Legion library directly, but if you want to you can drop down and write your tasks in any of the other languages that Legion supports. You can even mix and match tasks written in different languages.
Do you have a set of benchmarks that others can reimplement to compare the approaches?
I've added Dataflow Parallelism to my own multithreading runtime[1] but I didn't had dataflow focused benchmarks yet, well I could add Cholesky Decomposition but it's quite involved.
I expect the people from TaskFlow[2] and Habanero[3] (via Data-Driven Futures) would be quite interested as well in a common set of dataflow parallelism benchmarks.
By the way if you didn't read the DaCe paper[4] you absolutely should, seems like the age of Dataflow parallelism and properly optimizing for data is coming.
[1]: https://github.com/mratsim/weave#dataflow-parallelism
[2]: https://github.com/taskflow/taskflow
[3]: https://github.com/habanero-rice/hclib
[4]: https://github.com/spcl/dace, https://arxiv.org/abs/1902.10345
[1]: http://theory.stanford.edu/~aiken/publications/papers/sc06.p...
TL;DR is TF executor parallelism is too pessimistic to fully exploit the parallelization opportunities in the problem space. Regent is built on top of Legion which is a cutting edge dataflow library, and is designed to provide the guarantees to achieve that speedup.
It's the same reason.
Your dataflow semantics need to be part of the language semantics, otherwise they're bound to be loosely defined and even more loosely enforced.
First, threads have been implemented as libraries many times. Second, if checks need to happen theg can happen at debug run time if they can't happen at compile time. I don't know what specifically has to be integrated into a language here that makes throwing away the enormous amount already built in other languages.