Nvidia Warp: A Python framework for high performance GPU simulation and graphics

I really like how nvidia started doing more normal open source and not locking stuff behind a login to their website. It makes it so much easier now that you can just pip install all the cuda stuff for torch and other libraries without authenticating and downloading from websites and other nonsense. I guess they realized that it was dramatically reducing the engagement with their work. If it’s open source anyway then you should make it as accessible as possible.

foresterre · a year ago

I would argue that this isn't "normal open source", though it is indeed not locked behind a login on their website. The license (1) is feels very much proprietary, even if the source code is available.

(1) https://github.com/NVIDIA/warp/blob/main/LICENSE.md

bionhoward · a year ago

Agreed, especially given this

"2.7 You may not use the Software for the purpose of developing competing products or technologies or assist a third party in such activities."

"California’s public policy provides that every contract that restrains anyone from engaging in a lawful profession, trade, or business of any kind is, to that extent, void, except under limited statutory exceptions."

https://leginfo.legislature.ca.gov/faces/billNavClient.xhtml...

(Owner/Partner who sold business, may voluntarily agree to a noncompete, (which is now federally https://www.ftc.gov/legal-library/browse/rules/noncompete-ru... banned) is the only exception I found).

I'm not a lawyer. Any lawyers around? Could the 2nd provision invalidate the 1st, or not?

dagmx · a year ago

This isn’t open source. It s the equivalent of headers being available to a dylib, just that they happen to be a python API.

Most of the magic is behind closed source components, and it’s posted with a fairly restrictive license.

boywitharupee · a year ago

In a similar fashion, you'll see that JAX has frontend code being open-sourced, while device-related code is distributed as binaries. For example, if you're on Google's TPU, you'll see libtpu.so, and on macOS, you'll see pjrt_plugin_metal_1.x.dylib.

The main optimizations (scheduler, vectorizer, etc.) are hidden behind these shared libraries. If open-sourced, they might reveal hints about proprietary algorithms and provide clues to various hardware components, which could potentially be exploited.

fragmede · a year ago

And people say nvida doesn't have a moat.

jjmarr · a year ago

It being on GitHub doesn't mean it's open-source.

https://github.com/NVIDIA/warp?tab=License-1-ov-file#readme

Looks more "source available" to me.

nitinreddy88 · a year ago

That's what open-source means. Source code is open for reading. It has nothing to do with Licensing. You can have any type of license on top of that based on your business needs

markhahn · a year ago

it's not open source and can only be used with nvidia gpus (by license).

rldjbpin · a year ago

there has been a rise of "open"-access and freeware software/services in this space. see hugging face and certain models tied to accounts that accept some eula before downloading model weights, or weird wrapper code by ai library creators which makes it harder to run offline (ultralytics library comes to mind for instance).

i like the value they bring, but the trend is against the existing paradigm of how python ecosystem used to be.

Deleted Comment

water-your-self · a year ago

Accessible, as long as you purchase their very contested hardware.

I was playing around with taichi a little bit for a project. Taichi lives in a similar space, but has more than an NVIDIA backend. But its development has stalled, so I’m considering switching to warp now.

It’s quite frustrating that there’s seemingly no long-lived framework that allows me to write simple numba-like kernels and try them out in NVIDIA GPUs and Apple GPUs. Even with taichi, the Metal backend was definitely B-tier or lower: Not offering 64 bit ints, and randomly crashing/not compiling stuff.

Here’s hoping that we’ll solve the GPU programming space in the next couple years, but after ~15 years or so of waiting, I’m no longer holding my breath.

https://github.com/taichi-dev/taichi

zelphirkalt · a year ago

But when it is made by Nvidia, that means it will not work on AMD GPUs. I do not consider this kind of thing a solution, but rather a vendor lock in.

panagathon · a year ago

This is the library I've always wanted. Look at that Julia set. Gorgeous. Thanks for this. I'm sorry to hear about the dev issues. I wish I could help.

szvsw · a year ago

I’ve been in love with Taichi for about a year now. Where’s the news source on development being stalled? It seemed like things were moving along at pace last summer and fall at least if I recall correctly.

w-m · a year ago

https://github.com/taichi-dev/taichi/discussions/8506

contravariant · a year ago

There's only been 7 commits to master in the last 6 month, half of those purely changes to test or documentation, so it kind of sounds like you're both right.

talldayo · a year ago

> Here’s hoping that we’ll solve the GPU programming space in the next couple years, but after ~15 years or so of waiting, I’m no longer holding my breath.

It feels like the ball is entirely in Apple's court. Well-designed and Open Source GPGPU libraries exist, even ones that Apple has supported in the past. Nvidia supports many of them, either through CUDA or as a native driver.

Deleted Comment

paulmd · a year ago

the problem with the GPGPU space is that everything except CUDA is so fractally broken that everything eventually converges to the NVIDIA stuff that actually works.

yes, the heterogeneous compute frameworks are largely broken, except for OneAPI, which does work, but only on CUDA. SPIR-V, works best on CUDA. OpenCL: works best on CUDA.

Even once you get past the topline "does it even attempt to support that", you'll find that AMD's runtimes are broken too. Their OpenCL runtime is buggy and has a bunch of paper features which don't work, and a bunch of AMD-specific behavior and bugs that aren't spec-compliant. So basically you have to have an AMD-specific codepath anyway to handle the bugs. Same for SPIR-V: the biggest thing they have working against them is that AMD's Vulkan Compute support is incomplete and buggy too.

https://devtalk.blender.org/t/was-gpu-support-just-outright-...

https://render.otoy.com/forum/viewtopic.php?f=7&t=75411 ("As of right now, the Vulkan drivers on AMD and Intel are not mature enough to compile (much less ship) Octane for Vulkan")

If you are going to all that effort anyway, why are you (a) targeting AMD at all, and (b) why don't you just use CUDA in the first place? So everyone writes more CUDA and nothing gets done. Cue some new whippersnapper who thinks they're gonna cure all AMD's software problems in a month, they bash into the brick wall, write blog post, becomes angry forums commenter, rinse and repeat.

And now you have another abandoned cross-platform project that basically only ever supported NVIDIA anyway.

Intel, bless their heart, is actually trying and their stuff largely does just work, supposedly, although I'm trying to get their linux runtime up and running on a Serpent Canyon NUC with A770m and am having a hell of a time. But supposedly it does work especially on windows (and I may just have to knuckle under and use windows, or put a pcie card in a server pc). But they just don't have the marketshare to make it stick.

AMD is stuck in this perpetual cycle of expecting anyone else but themselves to write the software, and then not even providing enough infrastructure to get people to the starting line, and then surprised-pikachu nothing works, and surprise-pikachu they never get any adoption. Why has nvidia done this!?!? /s

The other big exception is Metal, which both works and has an actual userbase. The reason they have Metal support for cycles and octane is because they contribute the code, that's really what needs to happen (and I think what Intel is doing - there's just a lot of work to come from zero). But of course Metal is apple-only, so really ideally you would have a layer that goes over the top...

sorenjan · a year ago

> yes, the heterogeneous compute frameworks are largely broken, except for OneAPI, which does work, but only on CUDA.

> Intel, bless their heart, is actually trying and their stuff largely does just work, supposedly, .... But they just don't have the marketshare to make it stick.

Isn't OneAPI basically SYCL? And there are different SYCL runtimes that run on Intel, Cuda, and Rocm? So what's lost if you use oneAPI instead of Cuda, and run it on Nvidia GPUs? In an ideal world that should work about the same as Cuda, but can also be run on other hardware in the future, but since the world rarely is ideal I would welcome any insight into this. Is writing oneAPI code mainly for use on Nvidia a bad idea? Or how about AdaptiveCpp (previously hipSYCL/Open SYCL)?

I've been meaning to try some GPGPU programming again (I did some OpenCL 10+ years ago), but the landscape is pretty confusing and I agree that it's very tempting to just pick Cuda and get started with something that works, but if at all possible I would prefer something that is open and not locked to one hardware vendor.

raytopia · a year ago

I love how many python to native/gpu code projects there are now. It's nice to see a lot of competition in the space. An alternative to this one could be Taichi Lang [0] it can use your gpu through Vulkan so you don't have to own Nvidia hardware. Numba [1] is another alternative that's very popular. I'm still waiting on a Python project that compiles to pure C (unlike Cython [2] which is hard to port) so you can write homebrew games or other embedded applications.

[0] https://www.taichi-lang.org/

[1] http://numba.pydata.org/

[2] https://cython.readthedocs.io/en/stable/

Joky · a year ago

> I'm still waiting on a Python project that compiles to pure C

In case you haven't tried it yet, Pythran is an interesting one to play with: https://pythran.readthedocs.io

Also, not compiling to C but to native code still would be Mojo: https://www.modular.com/max/mojo

holoduke · a year ago

Does it really matters in performance. I see python in these kind of setups as orchestrators of computing apis/engines. For example from python you instruct to compute following list etc. No hard computing in python. Performance not so much of an issue.

ok123456 · a year ago

nuitka already does this

setopt · a year ago

CuPy is also great – makes it trivial to port existing numerical code from NumPy/SciPy to CUDA, or to write code than can run either on CPU or on GPU.

I recently saw a 2-3 orders of magnitude speed-up of some physics code when I got a mid-range nVidia card and replaced a few NumPy and SciPy calls with CuPy.

6gvONxR4sf7o · a year ago

Don’t forget JAX! It’s my preferred library for “i want to write numpy but want it to run on gpu/tpu with auto diff etc”

I’m a huge Taichi stan. So much easier and more elegant than numba. The support for data classes and data_oriented classes is excellent. Being able to define your own memory layouts is extremely cool. Great documentation. Really really recommend!

skrhee · a year ago

I would like to warn people away from taichi if possible. At least back in 1.7.0 there were some bugs in the code that made it very difficult to work with.

hoosieree · a year ago

Do you have any more specifics about these limitations? I'm considering trying Taichi for a project because it seems to be GPU vendor agnostic (unlike CuPy).

pjmlp · a year ago

I would rather that Python catches up with Common Lisp tooling in JIT/AOT in the box, instead of compilation via C.

kazinator · a year ago

Python is designed from the ground up to be hostile to efficient compiling. It has really poorly designed internals, and proudly exposes them to application code, in a documented way and everything.

Only a restricted subset of Python is efficiently compilable.

heavyset_go · a year ago

I'd kill for AOT compiled Python. 3.13 ships with a basic JIT compiler.

Dead Comment

bradknowles · a year ago

I’m happy to have ways to run my python code that will execute much faster on GPUs, but I don’t want anything that is tied to a particular GPU family. I’ll happily use CUDA if I’m on NVIDIA, but I want something that is also performant on other architectures as well.

Otherwise, I’m not going to bother.

tony69 · a year ago

https://nuitka.net/ ?

jkercher · a year ago

I'm not looking for an argument, but my knee jerk reaction to seeing 4 or 5 different answers to the question of getting python to C... Why not just learn C?

parentheses · a year ago

The python already exists. These efforts enable increasing performance without having to rewrite in a very different language.

eigenvalue · a year ago

VyseofArcadia · a year ago

Aren't warps already architectural elements of nvidia graphics cards? This name collision is going to muddy search results.

ahfeah7373 · a year ago

There is also already WARP in the graphics world:

https://learn.microsoft.com/en-us/windows/win32/direct3darti...

Its basically the software implementation of DirectX

logicchains · a year ago

>Aren't warps already architectural elements of nvidia graphics cards?

Architectural elements of _all_ graphics cards.

Unsure of how authoritative this is, but this article[0] seems to imply it's a matter of branding.

> The efficiency of executing threads in groups, which is known as warps in NVIDIA and wavefronts in AMD, is crucial for maximizing core utilization.

[0] https://www.xda-developers.com/how-does-a-graphics-card-actu...

marmaduke · a year ago

Ive dredged though Julia, Numba, Jax, Futhark, looking a way to have good CPU performance in absence of GPU, and I'm not really happy with any of them. Especially given how many want you to lug LLVM along with.

A recent simulation code when pushed with gcc openmp-simd matched performance on a 13900K vs jax.jit on a rtx 4090. This case worked because the overall computation can be structured into pieces that fit in L1/L2 cache, but I had to spend a ton of time writing the C code, whereas jax.jit was too easy.

So I'd still like to see something like this but which really works for CPU as well.

mccoyb · a year ago

Agreed, JAX is specialized for GPU computation -- I'd really like similar capabilities with more permissive constructs, maybe even co-effect tagging of pieces of code (which part goes on GPU, which part goes on CPU), etc.

I've thought about extending JAX with custom primitives and a custom lowering process to support constructs which work on CPU (but don't work on GPU) -- but if I did that, and wanted a nice programmable substrate -- I'd need to define my own version of abstract tracing (because necessarily, permissive CPU constructs might imply array type permissiveness like dynamic shapes, etc).

You start heading towards something that looks like Julia -- the problem (for my work) with Julia is that it doesn't support composable transformations like JAX does.

Julia + JAX might be offered as a solution -- but it's quite unsatisfying to me.

TNWin · a year ago

Slightly related

What's this community's take on Triton? https://openai.com/index/triton/

Are there better alternatives?

owenpalmer · a year ago

> Warp is designed for spatial computing

What does this mean? I've mainly heard the term "spatial computing" in the context of the Vision Pro release. It doesn't seem like this was intended for AR/VR

educasean · a year ago

As someone not in this space, I was immediately tripped up by this as well. Does spatial computing mean something else in this context?

basiccalendar74 · a year ago

main use case seems to be simulations in 2D, 3D or nD spaces. spaces -> spatial.

dudus · a year ago

Gotta keep digging that CUDA moat as hard and as fast as possible.

tomjen3 · a year ago

Thats the part I don't get. When you are developing AI, how much code are you really running on GPUs? How bad would it be to write it for something else if you could get 10% more compute per dollar?

incrudible · a year ago

Those 10% are going to matter when you have an established business case and you can start optimizing. The AI space is not like that at all, nobody cares about losing money 10% faster. You can not risk a 100% slowdown running into issues with an exotic platform for a 10% speedup.

astromaniak · a year ago

Exactly. and that's why it's valued at $3T++, about 10x of AMD and Intel put together.

you mean because that's how you get to be a meme stock? yep. stock markets are casinos filled with know-nothing high-rollers and pension sheep.