I love how many python to native/gpu code projects there are now. It's nice to see a lot of competition in the space. An alternative to this one could be Taichi Lang [0] it can use your gpu through Vulkan so you don't have to own Nvidia hardware. Numba [1] is another alternative that's very popular. I'm still waiting on a Python project that compiles to pure C (unlike Cython [2] which is hard to port) so you can write homebrew games or other embedded applications.
Does it really matters in performance. I see python in these kind of setups as orchestrators of computing apis/engines. For example from python you instruct to compute following list etc. No hard computing in python. Performance not so much of an issue.
CuPy is also great – makes it trivial to port existing numerical code from NumPy/SciPy to CUDA, or to write code than can run either on CPU or on GPU.
I recently saw a 2-3 orders of magnitude speed-up of some physics code when I got a mid-range nVidia card and replaced a few NumPy and SciPy calls with CuPy.
I’m a huge Taichi stan. So much easier and more elegant than numba. The support for data classes and data_oriented classes is excellent. Being able to define your own memory layouts is extremely cool. Great documentation. Really really recommend!
I would like to warn people away from taichi if possible. At least back in 1.7.0 there were some bugs in the code that made it very difficult to work with.
Do you have any more specifics about these limitations? I'm considering trying Taichi for a project because it seems to be GPU vendor agnostic (unlike CuPy).
Python is designed from the ground up to be hostile to efficient compiling. It has really poorly designed internals, and proudly exposes them to application code, in a documented way and everything.
Only a restricted subset of Python is efficiently compilable.
I’m happy to have ways to run my python code that will execute much faster on GPUs, but I don’t want anything that is tied to a particular GPU family. I’ll happily use CUDA if I’m on NVIDIA, but I want something that is also performant on other architectures as well.
I'm not looking for an argument, but my knee jerk reaction to seeing 4 or 5 different answers to the question of getting python to C... Why not just learn C?
I really like how nvidia started doing more normal open source and not locking stuff behind a login to their website. It makes it so much easier now that you can just pip install all the cuda stuff for torch and other libraries without authenticating and downloading from websites and other nonsense. I guess they realized that it was dramatically reducing the engagement with their work. If it’s open source anyway then you should make it as accessible as possible.
I would argue that this isn't "normal open source", though it is indeed not locked behind a login on their website. The license (1) is feels very much proprietary, even if the source code is available.
"2.7 You may not use the Software for the purpose of developing competing products or technologies or assist a third party in such activities."
vs
"California’s public policy provides that every contract that restrains anyone from engaging in a lawful profession, trade, or business of any kind is, to that extent, void, except under limited statutory exceptions."
In a similar fashion, you'll see that JAX has frontend code being open-sourced, while device-related code is distributed as binaries. For example, if you're on Google's TPU, you'll see libtpu.so, and on macOS, you'll see pjrt_plugin_metal_1.x.dylib.
The main optimizations (scheduler, vectorizer, etc.) are hidden behind these shared libraries. If open-sourced, they might reveal hints about proprietary algorithms and provide clues to various hardware components, which could potentially be exploited.
That's what open-source means. Source code is open for reading. It has nothing to do with Licensing. You can have any type of license on top of that based on your business needs
there has been a rise of "open"-access and freeware software/services in this space. see hugging face and certain models tied to accounts that accept some eula before downloading model weights, or weird wrapper code by ai library creators which makes it harder to run offline (ultralytics library comes to mind for instance).
i like the value they bring, but the trend is against the existing paradigm of how python ecosystem used to be.
I was playing around with taichi a little bit for a project. Taichi lives in a similar space, but has more than an NVIDIA backend. But its development has stalled, so I’m considering switching to warp now.
It’s quite frustrating that there’s seemingly no long-lived framework that allows me to write simple numba-like kernels and try them out in NVIDIA GPUs and Apple GPUs. Even with taichi, the Metal backend was definitely B-tier or lower: Not offering 64 bit ints, and randomly crashing/not compiling stuff.
Here’s hoping that we’ll solve the GPU programming space in the next couple years, but after ~15 years or so of waiting, I’m no longer holding my breath.
This is the library I've always wanted. Look at that Julia set. Gorgeous. Thanks for this. I'm sorry to hear about the dev issues. I wish I could help.
I’ve been in love with Taichi for about a year now. Where’s the news source on development being stalled? It seemed like things were moving along at pace last summer and fall at least if I recall correctly.
There's only been 7 commits to master in the last 6 month, half of those purely changes to test or documentation, so it kind of sounds like you're both right.
> Here’s hoping that we’ll solve the GPU programming space in the next couple years, but after ~15 years or so of waiting, I’m no longer holding my breath.
It feels like the ball is entirely in Apple's court. Well-designed and Open Source GPGPU libraries exist, even ones that Apple has supported in the past. Nvidia supports many of them, either through CUDA or as a native driver.
the problem with the GPGPU space is that everything except CUDA is so fractally broken that everything eventually converges to the NVIDIA stuff that actually works.
yes, the heterogeneous compute frameworks are largely broken, except for OneAPI, which does work, but only on CUDA. SPIR-V, works best on CUDA. OpenCL: works best on CUDA.
Even once you get past the topline "does it even attempt to support that", you'll find that AMD's runtimes are broken too. Their OpenCL runtime is buggy and has a bunch of paper features which don't work, and a bunch of AMD-specific behavior and bugs that aren't spec-compliant. So basically you have to have an AMD-specific codepath anyway to handle the bugs. Same for SPIR-V: the biggest thing they have working against them is that AMD's Vulkan Compute support is incomplete and buggy too.
If you are going to all that effort anyway, why are you (a) targeting AMD at all, and (b) why don't you just use CUDA in the first place? So everyone writes more CUDA and nothing gets done. Cue some new whippersnapper who thinks they're gonna cure all AMD's software problems in a month, they bash into the brick wall, write blog post, becomes angry forums commenter, rinse and repeat.
And now you have another abandoned cross-platform project that basically only ever supported NVIDIA anyway.
Intel, bless their heart, is actually trying and their stuff largely does just work, supposedly, although I'm trying to get their linux runtime up and running on a Serpent Canyon NUC with A770m and am having a hell of a time. But supposedly it does work especially on windows (and I may just have to knuckle under and use windows, or put a pcie card in a server pc). But they just don't have the marketshare to make it stick.
AMD is stuck in this perpetual cycle of expecting anyone else but themselves to write the software, and then not even providing enough infrastructure to get people to the starting line, and then surprised-pikachu nothing works, and surprise-pikachu they never get any adoption. Why has nvidia done this!?!? /s
The other big exception is Metal, which both works and has an actual userbase. The reason they have Metal support for cycles and octane is because they contribute the code, that's really what needs to happen (and I think what Intel is doing - there's just a lot of work to come from zero). But of course Metal is apple-only, so really ideally you would have a layer that goes over the top...
> yes, the heterogeneous compute frameworks are largely broken, except for OneAPI, which does work, but only on CUDA.
> Intel, bless their heart, is actually trying and their stuff largely does just work, supposedly, .... But they just don't have the marketshare to make it stick.
Isn't OneAPI basically SYCL? And there are different SYCL runtimes that run on Intel, Cuda, and Rocm? So what's lost if you use oneAPI instead of Cuda, and run it on Nvidia GPUs? In an ideal world that should work about the same as Cuda, but can also be run on other hardware in the future, but since the world rarely is ideal I would welcome any insight into this. Is writing oneAPI code mainly for use on Nvidia a bad idea? Or how about AdaptiveCpp (previously hipSYCL/Open SYCL)?
I've been meaning to try some GPGPU programming again (I did some OpenCL 10+ years ago), but the landscape is pretty confusing and I agree that it's very tempting to just pick Cuda and get started with something that works, but if at all possible I would prefer something that is open and not locked to one hardware vendor.
Ive dredged though Julia, Numba, Jax, Futhark, looking a way to have good CPU performance in absence of GPU, and I'm not really happy with any of them. Especially given how many want you to lug LLVM along with.
A recent simulation code when pushed with gcc openmp-simd matched performance on a 13900K vs jax.jit on a rtx 4090. This case worked because the overall computation can be structured into pieces that fit in L1/L2 cache, but I had to spend a ton of time writing the C code, whereas jax.jit was too easy.
So I'd still like to see something like this but which really works for CPU as well.
Agreed, JAX is specialized for GPU computation -- I'd really like similar capabilities with more permissive constructs, maybe even co-effect tagging of pieces of code (which part goes on GPU, which part goes on CPU), etc.
I've thought about extending JAX with custom primitives and a custom lowering process to support constructs which work on CPU (but don't work on GPU) -- but if I did that, and wanted a nice programmable substrate -- I'd need to define my own version of abstract tracing (because necessarily, permissive CPU constructs might imply array type permissiveness like dynamic shapes, etc).
You start heading towards something that looks like Julia -- the problem (for my work) with Julia is that it doesn't support composable transformations like JAX does.
Julia + JAX might be offered as a solution -- but it's quite unsatisfying to me.
What does this mean? I've mainly heard the term "spatial computing" in the context of the Vision Pro release. It doesn't seem like this was intended for AR/VR
Thats the part I don't get. When you are developing AI, how much code are you really running on GPUs? How bad would it be to write it for something else if you could get 10% more compute per dollar?
Those 10% are going to matter when you have an established business case and you can start optimizing. The AI space is not like that at all, nobody cares about losing money 10% faster. You can not risk a 100% slowdown running into issues with an exotic platform for a 10% speedup.
[0] https://www.taichi-lang.org/
[1] http://numba.pydata.org/
[2] https://cython.readthedocs.io/en/stable/
In case you haven't tried it yet, Pythran is an interesting one to play with: https://pythran.readthedocs.io
Also, not compiling to C but to native code still would be Mojo: https://www.modular.com/max/mojo
I recently saw a 2-3 orders of magnitude speed-up of some physics code when I got a mid-range nVidia card and replaced a few NumPy and SciPy calls with CuPy.
Only a restricted subset of Python is efficiently compilable.
Dead Comment
Otherwise, I’m not going to bother.
(1) https://github.com/NVIDIA/warp/blob/main/LICENSE.md
"2.7 You may not use the Software for the purpose of developing competing products or technologies or assist a third party in such activities."
vs
"California’s public policy provides that every contract that restrains anyone from engaging in a lawful profession, trade, or business of any kind is, to that extent, void, except under limited statutory exceptions."
https://leginfo.legislature.ca.gov/faces/billNavClient.xhtml...
(Owner/Partner who sold business, may voluntarily agree to a noncompete, (which is now federally https://www.ftc.gov/legal-library/browse/rules/noncompete-ru... banned) is the only exception I found).
I'm not a lawyer. Any lawyers around? Could the 2nd provision invalidate the 1st, or not?
Most of the magic is behind closed source components, and it’s posted with a fairly restrictive license.
The main optimizations (scheduler, vectorizer, etc.) are hidden behind these shared libraries. If open-sourced, they might reveal hints about proprietary algorithms and provide clues to various hardware components, which could potentially be exploited.
https://github.com/NVIDIA/warp?tab=License-1-ov-file#readme
Looks more "source available" to me.
i like the value they bring, but the trend is against the existing paradigm of how python ecosystem used to be.
Deleted Comment
It’s quite frustrating that there’s seemingly no long-lived framework that allows me to write simple numba-like kernels and try them out in NVIDIA GPUs and Apple GPUs. Even with taichi, the Metal backend was definitely B-tier or lower: Not offering 64 bit ints, and randomly crashing/not compiling stuff.
Here’s hoping that we’ll solve the GPU programming space in the next couple years, but after ~15 years or so of waiting, I’m no longer holding my breath.
https://github.com/taichi-dev/taichi
It feels like the ball is entirely in Apple's court. Well-designed and Open Source GPGPU libraries exist, even ones that Apple has supported in the past. Nvidia supports many of them, either through CUDA or as a native driver.
Deleted Comment
yes, the heterogeneous compute frameworks are largely broken, except for OneAPI, which does work, but only on CUDA. SPIR-V, works best on CUDA. OpenCL: works best on CUDA.
Even once you get past the topline "does it even attempt to support that", you'll find that AMD's runtimes are broken too. Their OpenCL runtime is buggy and has a bunch of paper features which don't work, and a bunch of AMD-specific behavior and bugs that aren't spec-compliant. So basically you have to have an AMD-specific codepath anyway to handle the bugs. Same for SPIR-V: the biggest thing they have working against them is that AMD's Vulkan Compute support is incomplete and buggy too.
https://devtalk.blender.org/t/was-gpu-support-just-outright-...
https://render.otoy.com/forum/viewtopic.php?f=7&t=75411 ("As of right now, the Vulkan drivers on AMD and Intel are not mature enough to compile (much less ship) Octane for Vulkan")
If you are going to all that effort anyway, why are you (a) targeting AMD at all, and (b) why don't you just use CUDA in the first place? So everyone writes more CUDA and nothing gets done. Cue some new whippersnapper who thinks they're gonna cure all AMD's software problems in a month, they bash into the brick wall, write blog post, becomes angry forums commenter, rinse and repeat.
And now you have another abandoned cross-platform project that basically only ever supported NVIDIA anyway.
Intel, bless their heart, is actually trying and their stuff largely does just work, supposedly, although I'm trying to get their linux runtime up and running on a Serpent Canyon NUC with A770m and am having a hell of a time. But supposedly it does work especially on windows (and I may just have to knuckle under and use windows, or put a pcie card in a server pc). But they just don't have the marketshare to make it stick.
AMD is stuck in this perpetual cycle of expecting anyone else but themselves to write the software, and then not even providing enough infrastructure to get people to the starting line, and then surprised-pikachu nothing works, and surprise-pikachu they never get any adoption. Why has nvidia done this!?!? /s
The other big exception is Metal, which both works and has an actual userbase. The reason they have Metal support for cycles and octane is because they contribute the code, that's really what needs to happen (and I think what Intel is doing - there's just a lot of work to come from zero). But of course Metal is apple-only, so really ideally you would have a layer that goes over the top...
> Intel, bless their heart, is actually trying and their stuff largely does just work, supposedly, .... But they just don't have the marketshare to make it stick.
Isn't OneAPI basically SYCL? And there are different SYCL runtimes that run on Intel, Cuda, and Rocm? So what's lost if you use oneAPI instead of Cuda, and run it on Nvidia GPUs? In an ideal world that should work about the same as Cuda, but can also be run on other hardware in the future, but since the world rarely is ideal I would welcome any insight into this. Is writing oneAPI code mainly for use on Nvidia a bad idea? Or how about AdaptiveCpp (previously hipSYCL/Open SYCL)?
I've been meaning to try some GPGPU programming again (I did some OpenCL 10+ years ago), but the landscape is pretty confusing and I agree that it's very tempting to just pick Cuda and get started with something that works, but if at all possible I would prefer something that is open and not locked to one hardware vendor.
https://learn.microsoft.com/en-us/windows/win32/direct3darti...
Its basically the software implementation of DirectX
Architectural elements of _all_ graphics cards.
> The efficiency of executing threads in groups, which is known as warps in NVIDIA and wavefronts in AMD, is crucial for maximizing core utilization.
[0] https://www.xda-developers.com/how-does-a-graphics-card-actu...
A recent simulation code when pushed with gcc openmp-simd matched performance on a 13900K vs jax.jit on a rtx 4090. This case worked because the overall computation can be structured into pieces that fit in L1/L2 cache, but I had to spend a ton of time writing the C code, whereas jax.jit was too easy.
So I'd still like to see something like this but which really works for CPU as well.
I've thought about extending JAX with custom primitives and a custom lowering process to support constructs which work on CPU (but don't work on GPU) -- but if I did that, and wanted a nice programmable substrate -- I'd need to define my own version of abstract tracing (because necessarily, permissive CPU constructs might imply array type permissiveness like dynamic shapes, etc).
You start heading towards something that looks like Julia -- the problem (for my work) with Julia is that it doesn't support composable transformations like JAX does.
Julia + JAX might be offered as a solution -- but it's quite unsatisfying to me.
What's this community's take on Triton? https://openai.com/index/triton/
Are there better alternatives?
What does this mean? I've mainly heard the term "spatial computing" in the context of the Vision Pro release. It doesn't seem like this was intended for AR/VR