The motivation behind a lot of this was to have community LLVM implementations of runtime functions normally provided by the vendor libraries (C math, printf, malloc), but if you implement one function you may as well implement them all.
Realistically, the infrastructure behind this project is more relevant than the C library calls themselves. The examples in the linked documentation can be used for any arbitrary C/C++ just as well as the LLVM C library, it's simply statically linking. This is what allowed me to compile and run more complicated things like libc++ and DOOM on the GPU as well. The RPC interface can also be used to implement custom host services from the GPU, or used to communicate between any two shared memory processes.
Just wanted to say thanks for pushing on this front! I'm not using the libc portion but the improvements to clang/llvm that allow this to work have been incredible. When I was looking a few months back the only options that felt practical for writing large amounts of device code were cuda/hip or opencl and a friend suggested I just try C _and it worked_. Definitely made my "most practical/coolest that it actually works" list for 2024 :)
I wonder how GPU is going to access an unknown size NULL terminated string in system RAM, the strchr() source looks like normal C++. In my minimal Vulkan GPGPU experience the data need to be bound to VkDeviceMemory to be accessible through PCI bus with compute shader, is LLVM libc runtime doing similar set-ups in the background, and if so, is it faster than glibc's hand-tuned AVX implementation?
This is libc running on the GPU, not a libc spanning CPU and GPU. The primary obstruction to doing that is persuading people to let go of glibc. The spec of "host runs antique glibc, GPU runs some other thing that interops transparently with glibc" is a nightmare of hacks and tragedy.
What would be relatively easy to put together is llvm libc running on the host x64 and also llvm libc running on the GPU. There's then the option to do things like malloc() on the GPU and free() the same pointer on the CPU. Making it genuinely seamless also involves persuading people to change what function pointers are, do some work on the ABI, and preferably move to APUs because PCIe is unhelpful.
There's an uphill battle to bring people along on the journey of "program the things differently". For example, here's a thread trying to drum up enthusiasm for making function pointers into integers as that makes passing them between heterogenous architectures far easier https://discourse.llvm.org/t/rfc-function-pointers-as-intege....
>making function pointers into integers as that makes passing them between heterogenous architectures
This is interesting, though function pointers are long expected to be address on binary, C-brained people like me would probably adapt to the concept of "pointer to a heterogeneous lambda object" or "shared id across heterogeneous runtimes" easier.
I am pretty sure this is just a gimmick. I would not call libc code in a GPU kernel. It would mean dragging in a whole bunch of stuff I don't want, and cant control or pick-and-choose. That makes sense for regular processes on the CPU; it does _not_ make sense in code you run millions of times in GPU threads.
I see people saying they've "dreamed" of this or have waited so long for this to happen... well, my friends, you should not have; and I'm afraid you're in for a disappointment.
It uses a few mb of contiguous shared memory and periodically calling a function from a host thread. Unless you only want sprintf or similar in which case neither is needed. The unused code deadstrips pretty well. Won't help compilation time. Generally you don't want libc calls in numerical kernels doing useful stuff - the most common request was for printf as a debugging crutch, I mostly wanted mmap.
> It uses a few mb of contiguous shared memory and periodically calling a function from a host thread.
Well, if someone wants to shoot themselves in the head, then by all means...
> Unless you only want sprintf or similar in which case neither is needed.
> ... Generally you don't want libc calls in numerical kernels doing useful
> stuff - the most common request was for printf as a debugging crutch
I have actually adapted a library for that particular case:
I started with a standalone printf-family implementation targetting embedded devices, and (among other things) adapted it for use also with CUDA.
> I mostly wanted mmap.
Does it really make sense to make a gazillion mmap calls from the threads of your GPU kernel? I mean, is it really not always better to mmap on the CPU side? At most, I might do it asynchronously using a CUDA callback or some other mechanism. But I will admit I've not had that use-case.
I'm not sure what disappointment you're predicting. Unless your GPU is connected through a cache coherent protocol like CXL to your CPU, you are unlikely to make your code run faster by transferring data back to the CPU and back again to the GPU. You have 128 compute units on the 4090, even at a lower frequency and higher memory latency, you will probably not end up too far away from the performance of an 8 core CPU running at 4.5GHz. Nobody is running millions of CPU threads in the first place, so you seem to be completely misunderstanding the workload here. Nobody wants to speed up their CPU code by running it on the GPU, they want to stop slowing down their GPU code by waiting for data transfers to and from the CPU.
It seems like unified memory has to be the goal. This all just feels like a kludgy workaround until that happens (kind of like segmented memory in the 16-bit era).
Is unified memory practical for a "normal" desktop/server configuration though? Apple has been doing unified memory, but they also have the GPU on the CPU die. I would be interested to know if a discrete GPU plugged into a PCIe slot would have enough latency to make unified memory impractical.
* Your project is large enough that you are likely use using an unsupported libc function somewhere.
* Your project is small enough that you would benefit from just implementing a new kernel yourself.
I am biased because I avoid the C standard library even on the CPU, but this seems like a technology that raises the floor not the ceiling of what is possible.
> ... this seems like a technology that raises the floor not the ceiling of what is possible.
The root cause reason for this project existing is to show that GPU programming is not synonymous with CUDA (or the other offloading languages).
It's nominally to help people run existing code on GPUs. Disregarding that use case, it shows that GPUs can actually do things like fprintf or open sockets. This is obvious to the implementation but seems largely missed by application developers. Lots of people think GPUs can only do floating point math.
Especially on an APU, where the GPU units and the CPU cores can hammer on the same memory, it is a travesty to persist with the "offloading to accelerator" model. Raw C++ isn't an especially sensible language to program GPUs in but it's workable and I think it's better than CUDA.
One thing this gives you is syscall on the gpu. Functions like sprintf are just blobs of userspace code, but others like fopen require support from the operating system (or whatever else the hardware needs you to do). That plumbing was decently annoying to write for the gpu.
These aren't gpu kernels. They're functions to call from kernels.
I've never understood why people say you "can't" do this or that on GPU. A GPU is made of SMs, and each SM is just a CPU with very wide SIMD pipes and very good hyperthreading. You can take one thread of a warp in a SM and do exactly the same things a CPU would do. Would you get 1/32 potential performance? Sure. But so what? Years ago, we did plenty of useful work with less than 1/32 of a modest CPU, and we can again.
One of the more annoying parts of the Nvidia experience is PTX. I. I know perfectly well that your CPU/SM/whatever has a program counter. Let me manipulate it directly!
The motivation behind a lot of this was to have community LLVM implementations of runtime functions normally provided by the vendor libraries (C math, printf, malloc), but if you implement one function you may as well implement them all.
Realistically, the infrastructure behind this project is more relevant than the C library calls themselves. The examples in the linked documentation can be used for any arbitrary C/C++ just as well as the LLVM C library, it's simply statically linking. This is what allowed me to compile and run more complicated things like libc++ and DOOM on the GPU as well. The RPC interface can also be used to implement custom host services from the GPU, or used to communicate between any two shared memory processes.
What would be relatively easy to put together is llvm libc running on the host x64 and also llvm libc running on the GPU. There's then the option to do things like malloc() on the GPU and free() the same pointer on the CPU. Making it genuinely seamless also involves persuading people to change what function pointers are, do some work on the ABI, and preferably move to APUs because PCIe is unhelpful.
There's an uphill battle to bring people along on the journey of "program the things differently". For example, here's a thread trying to drum up enthusiasm for making function pointers into integers as that makes passing them between heterogenous architectures far easier https://discourse.llvm.org/t/rfc-function-pointers-as-intege....
This is interesting, though function pointers are long expected to be address on binary, C-brained people like me would probably adapt to the concept of "pointer to a heterogeneous lambda object" or "shared id across heterogeneous runtimes" easier.
I see people saying they've "dreamed" of this or have waited so long for this to happen... well, my friends, you should not have; and I'm afraid you're in for a disappointment.
Well, if someone wants to shoot themselves in the head, then by all means...
> Unless you only want sprintf or similar in which case neither is needed. > ... Generally you don't want libc calls in numerical kernels doing useful > stuff - the most common request was for printf as a debugging crutch
I have actually adapted a library for that particular case:
https://github.com/eyalroz/printf/
I started with a standalone printf-family implementation targetting embedded devices, and (among other things) adapted it for use also with CUDA.
> I mostly wanted mmap.
Does it really make sense to make a gazillion mmap calls from the threads of your GPU kernel? I mean, is it really not always better to mmap on the CPU side? At most, I might do it asynchronously using a CUDA callback or some other mechanism. But I will admit I've not had that use-case.
A PCIe 4.0x16 link gives 32 GB/s bandwidth; an RTX 4090 has over 1 TB/s bandwidth to its on-card memory.
- For new code, all of the functions [here](https://libc.llvm.org/gpu/support.html#libc-gpu-support) you can do without just fine.
- For old code:
I am biased because I avoid the C standard library even on the CPU, but this seems like a technology that raises the floor not the ceiling of what is possible.> ... this seems like a technology that raises the floor not the ceiling of what is possible.
The root cause reason for this project existing is to show that GPU programming is not synonymous with CUDA (or the other offloading languages).
It's nominally to help people run existing code on GPUs. Disregarding that use case, it shows that GPUs can actually do things like fprintf or open sockets. This is obvious to the implementation but seems largely missed by application developers. Lots of people think GPUs can only do floating point math.
Especially on an APU, where the GPU units and the CPU cores can hammer on the same memory, it is a travesty to persist with the "offloading to accelerator" model. Raw C++ isn't an especially sensible language to program GPUs in but it's workable and I think it's better than CUDA.
In your view, how is making GPU programming easier a bad thing?
These aren't gpu kernels. They're functions to call from kernels.
Deleted Comment
What if a CPU had assembly instructions for everything a GPU can do? Would compiler/language designers support them?
One of the more annoying parts of the Nvidia experience is PTX. I. I know perfectly well that your CPU/SM/whatever has a program counter. Let me manipulate it directly!