In fact, the use of loops described in this article reminded me of what Reich called "phases", basically the same concept of emerging/shifting melodic patterns between different samples.
In fact, the use of loops described in this article reminded me of what Reich called "phases", basically the same concept of emerging/shifting melodic patterns between different samples.
They started to experiment on that in mesa and linux ("user queues", as "user hardware queues").
I don't know how they will work around the scarse VM IDs, but here, we are talking near 0 driver. Obviously, they will have to simplify/cleanup a lot 3D pipeline programming and be very sure of its robustness, basically to have it ready for "default" rendering/usage right away.
Userland will get from the kernel stuff along those lines: command/event hardware ring buffers, data dma buffers, read/write pointers & doorbells memory page for those ring buffers, and an event file descriptor for an event ring buffer. Basically, what the kernel currently has.
I wonder if it will provide some significant simplification than the current way which is giving indirect command buffers to the kernel and deal with 'sync objects'/barriers.
The major upside is removing the context switch on a submission. The idea is that an application only talks to the kernel for queue setup/teardown, everything else happens in userland.
How much that does for efficiency I can't say, but I imagine it helps, especially given just how damn easy it is to decode.
Initially this was just a vehicle for me to get stuck in and learn some WebGPU, so no doubt I'm missing lots of opportunities for optimisation - but it's been fun as much as frustrating. I leaned heavily on the SMPTE specification document and the FFMPEG proresdec.c implementation to understand and debug.
The ProRes bitstream spec was given to SMPTE [1], but I never managed to find any information on ProRes RAW, so it's exciting to see software and compute implementations here. Has this been reverse-engineered by the FFMPEG wizards? At first glance of the code, it does look fairly similar to the regular ProRes.
[1] https://pub.smpte.org/doc/rdd36/20220909-pub/rdd36-2022.pdf
I'm curious wrt how a WebGPU implementation would differ from Vulkan. Here's mine if you're interested: https://github.com/averne/FFmpeg/tree/vk-proresdec
> The go-to solution here is GPU accelerated video compression
Isn't the solution usually hardware encoding?
> I think this is an order of magnitude faster than even dedicated hardware codecs on GPUs.
Is there an actual benchmark though?
I would have assumed that built-in hardware encoding would always be faster. Plus, I'd assume your game is already saturating your GPU, so the last thing you want to do is use it for simultaneous video encoding. But I'm not an expert in either of these, so curious to know if/how I'm wrong here? Like if hardware encoders are designed to be real-time, but intentionally trade off latency for higher compression? And is the proposed video encoding really is so lightweight it can easily share the GPU without affecting game performance?
Generally, you're right that these hardware blocks favor latency. One example of this is motion estimation (one of the most expensive operations during encoding). The NVENC engine on NVidia GPUs will only use fairly basic detection loops, but can optionally be fed motion hints from an external source. I know that NVidia has a CUDA-based motion estimator (called CEA) for this purpose. On recent GPUs there is also the optical flow engine (another separate block) which might be able to do higher quality detection.
>Fabrice won International Obfuscated C Code Contest three times and you need a certain mindset to create code like that—which creeps into your other work. So despite his implementation of FFmpeg was fast-working, it was not very nice to debug or refactor, especially if you’re not Fabrice