Interesting talk. He mentions Futhark a few times, but fails to point out that his ideal way of programming is almost 1:1 how it would be done in Futhark.
It is a joke, but an SQL engine can be massively parallel. You just don't know it, it just gives you what you want. And in many ways the operations resembles what you do for example in CUDA.
CUDA backend for DuckDB or Trino would be one of my go-to projects if i was laid off.
What could be good is relational + array model. I have some ideas on https://tablam.org, and building not just the language but the optimizer in tandem I think will be very nice.
It solves all the warts of sql while still being true to its declarative execution. Trailing commas, from statement first and reads as a a composable pipeline, temporary variables for expressions, intuitive grouping.
Even in this thread people underestimate how good e.g. DuckDB can be if you swallow its quirks. Yeah SQL has many problems, but with a slightly extended language with QoL features and seamless parallelism DuckDB is extremely productive if you want to crunch bunch of numbers in the order of minutes, hours etc (not real time).
Sometimes I have a problem, I just generate bunch of "possible solutions" with a constraint solver (e.g. Minizinc) which generates GBs of CSVs describing bunch of solutions, then let DuckDB analyze which ones are suitable, DuckDB is amazing.
More generally, the key here is that the more magic you want in the execution of your code, the more declarative you want the code to be. And SQL is pretty much the poster child declarative language out there.
Term rewriting languages probably work better at this than I would expect? It is kind of sad how little experience with that sort of thing that I have built up. And I think I'm above a large percentage of developers out there.
Raph and I also talked about this subject here: https://www.popovit.ch/interviews/raph-levien-simd The discussion covers things at a relatively basic level as we wanted it to be accessible to a wide audience. So we explain SIMD vs SIMT, predication, multiversioning, and some more.
Raph is a super nice guy and a pleasure to talk to. I'm glad we have people like him around!
There were a few languages designed specifically for parallel computing spurred by DARPA's High Productivity Computing Systems project. While Fortress is dead, Chapel is still being developed.
Those languages were not effective in practice. The kind of loop parallelism that most people focus on is the least interesting and effective kind outside of niche domains. The value was low.
Hardware architectures like Tera MTA were much more capable but almost no one could write effective code for them even though the language was vanilla C++ with a couple extra features. Then we learned how to write similar software architecture on standard CPUs. The same problem of people being bad at it remained.
The common thread in all of this is people. Humans as a group are terrible at reasoning about non-trivial parallelism. The tools almost don't matter. Reasoning effectively about parallelism involves manipulating a space that is quite evidently beyond most human cognitive abilities to reason about.
Parallelism was never about the language. Most people can't build the necessary mental model in any language.
This was, I think, the greatest strength of MapReduce. If you could write a basic program you could understand the map, combine, shuffle and reduce operations. MR and Hadoop etc. would take care of recovering from operational failures like disk or network outages by idempotencies in the workings behind the scenes, and programmers could focus on how data was being transformed, joined, serialized, etc.
To your point, we also didn't need a new language to adopt this paradigm. A library and a running system were enough (though, semantically, it did offer unique language-like characteristics).
Sure, it's a bit antiquated now that we have more sophisticated iterations for the subdomains it was most commonly used for, but it hit a kind of sweet spot between parallelism utility and complexity of knowledge or reasoning required of its users.
That's why programming languages are important for solving this problem.
The syntax and semantics should constrain the kinds of programs that are easy to write in the language to ones that the compiler can figure out how to run in parallel correctly and efficiently.
That's how you end up with something like Erlang or Elixir.
Maybe we can find better abstractions. Software transactional memory seems like a promising candidate, for example. Sawzall/Dremel and SQL seem to also be capable of expressing some interesting things. And, as RoboToaster mentions, in VHDL and Verilog, people have successfully described parallel computations containing billions of concurrent processes, and even gotten them to work properly.
the distinction matters less and less. Inside the GPU there is already plenty of locality to exploit (catches, schedulers, warps). nvlink is a switch memory access network, so that already gets you some fairly large machines with multiple kinds of locality.
throwing infiniband or IP on top is really structurally more of the same.
It seems like there are two sides to this problem, both of which are hard and go hand in hand. There is the HCI problem of having abstractions are rich enough to handle problems like parsing and scheduling on the GPU. Then you need a sufficiently smart compiler problem of lowering these problems to the GPU. But of course, there's a limit to how smart a compiler can be, which loops back to your abstraction design.
Overall, it seems to be a really interesting problem!
I think a good parallel language will be the one that takes your code written with tasks and channels, understands its logic, rewrites and compiles it in the most efficient way. I don't feel that I have to write something harder than that as a pity human.
Lower-level programming language, which is either object-oriented like python or after compilation a real-time system transposition would assemble the microarchitecture to an x86 chip.
His example is:
It would be written in Futhark something like this:I haven't studied it in depth, but it's pretty readable.
The example you showed is very much how I think about PRQL pipelines. Syntax is slightly different but semantics are very similar.
At first I thought that PRQL doesn't have scan but actually loop fulfills the same function. I'm going to look more into comparing those.
It is a joke, but an SQL engine can be massively parallel. You just don't know it, it just gives you what you want. And in many ways the operations resembles what you do for example in CUDA.
CUDA backend for DuckDB or Trino would be one of my go-to projects if i was laid off.
What could be good is relational + array model. I have some ideas on https://tablam.org, and building not just the language but the optimizer in tandem I think will be very nice.
• Datalog is much, much better on these axes.
• Tutorial D is also better than SQL.
It solves all the warts of sql while still being true to its declarative execution. Trailing commas, from statement first and reads as a a composable pipeline, temporary variables for expressions, intuitive grouping.
Sometimes I have a problem, I just generate bunch of "possible solutions" with a constraint solver (e.g. Minizinc) which generates GBs of CSVs describing bunch of solutions, then let DuckDB analyze which ones are suitable, DuckDB is amazing.
Term rewriting languages probably work better at this than I would expect? It is kind of sad how little experience with that sort of thing that I have built up. And I think I'm above a large percentage of developers out there.
Raph is a super nice guy and a pleasure to talk to. I'm glad we have people like him around!
Hardware architectures like Tera MTA were much more capable but almost no one could write effective code for them even though the language was vanilla C++ with a couple extra features. Then we learned how to write similar software architecture on standard CPUs. The same problem of people being bad at it remained.
The common thread in all of this is people. Humans as a group are terrible at reasoning about non-trivial parallelism. The tools almost don't matter. Reasoning effectively about parallelism involves manipulating a space that is quite evidently beyond most human cognitive abilities to reason about.
Parallelism was never about the language. Most people can't build the necessary mental model in any language.
To your point, we also didn't need a new language to adopt this paradigm. A library and a running system were enough (though, semantically, it did offer unique language-like characteristics).
Sure, it's a bit antiquated now that we have more sophisticated iterations for the subdomains it was most commonly used for, but it hit a kind of sweet spot between parallelism utility and complexity of knowledge or reasoning required of its users.
The syntax and semantics should constrain the kinds of programs that are easy to write in the language to ones that the compiler can figure out how to run in parallel correctly and efficiently.
That's how you end up with something like Erlang or Elixir.
throwing infiniband or IP on top is really structurally more of the same.
Chapel definitely can target a single GPU.
Overall, it seems to be a really interesting problem!
going the other direction, making channel runtimes run SIMD, is trivial
Disclaimer: I did not watch the video yet