From the recommendation of another commenter, here's a more recent indie game that seems focused exactly on that style of path logistics:
Should have just been an extension with a paid plan.
I've had good luck with indirection tables used during lookup inside of the kernels consuming/producing the kvcache data - it's essentially user-mode remapping like they do here: you can publish a buffer offset table and threads are uniform, have coalesced reads to the table, and cache the offsets no problem. You have the same memory locality issues as VM (contiguous virtual but potentially random physical) but are not limited to device page sizes and since you can update while work is in-flight you can be much more aggressive about reuse and offload (enqueue DMA to cold storage to evict from VRAM, enqueue DMA to copy from cold memory into reused VRAM, enqueue offset table update, enqueue work using them, repeat - all without host synchronization). You can also defrag in-flight if you do want to try to restore the physical locality. It's nothing crazy and fairly normal in CPU land (or even classic virtual texturing), but in ML GPU land I could write a big paper on it and call it SuperDuperFancyAttention4 and publish press releases...
A couple First Impression Experience pieces of feedback:
* when it first starts, how about a blank untitled document to play with, and maybe select a fun tool like the Bezier pencil as the initially selected tool. I was able and motivated to click around, realize there was no open document (probably the biggest stumbling block), create a new document, change the tool, and start playing, but many users won't be.
* it seems like a small thing, but please make the default canvas a bit larger (maybe 512 or 500 square). Again just more fun in that critical 10-20 second window, which is all most people will realistically give when checking out a new thing. If you can't hook 'em in that timeframe, they simply browse away and you lose a ton of folks.
Looks very good, thank you for sharing.
The motivation behind a lot of this was to have community LLVM implementations of runtime functions normally provided by the vendor libraries (C math, printf, malloc), but if you implement one function you may as well implement them all.
Realistically, the infrastructure behind this project is more relevant than the C library calls themselves. The examples in the linked documentation can be used for any arbitrary C/C++ just as well as the LLVM C library, it's simply statically linking. This is what allowed me to compile and run more complicated things like libc++ and DOOM on the GPU as well. The RPC interface can also be used to implement custom host services from the GPU, or used to communicate between any two shared memory processes.