But... I consider SLJIT to be for a different use-case than AsmJit. It's more portable, but its scope is much more limited.
But... I consider SLJIT to be for a different use-case than AsmJit. It's more portable, but its scope is much more limited.
If this function is optimized, or switched to some other implementation when there is tens of thousands of virtual registers, you would get orders of magnitude faster compilation.
But realistically, which query requires tens of megabytes of machine code? These are pathological cases. For example we are talking about 25ms when it comes to a single function having 1MB of machine code, and sub-ms time when you generate tens of KB of machine code.
So from my perspective the ability to generate SIMD code that the CPU would execute fast in inner loops is much more valuable than anything else. Any workload, which is CPU-bound just deserves this. The question is how much the CPU bound the workload is. I would imagine databases like postgres would be more memory-bound if you are processing huge rows and accessing only a very tiny part of each row - that's why columnar databases are so popular, but of course they have different problems.
I worked on one project, which tried to deal with this by using buckets and hashing in a way that there would be 16 buckets, and each column would get into one of these, to make the columns closer to each other, so the query engine needs to load only buckets used in the query. But we are talking about gigabytes of RAW throughput per core in this case.
I think AsmJit's strength is completeness of its backends as you can emit nice SIMD code with it (like AVX-512). But the performance could be better of course, and that's possible - making it 2x faster would be possible.
This technique is known as a "tiered JIT". It's how production virtual machines operate for high-level languages like JavaScript.
There can be many tiers, like an interpreter, baseline compiler, optimizing compiler, etc. The runtime switches into the faster tier once it becomes ready.
More info for the interested:
This is not too difficult, it just requires a different execution style. Salesforce's Hyper for example very heavily relies on JIT compilation, as does Umbra [1], which some people regard as one of the fastest databases right now. Umbra doesn't cache any IR or compiled code and still has an extremely low start-up latency; an interpreter exists but is practically never used.
Postgres is very robust and very powerful, but simply not designed for fast execution of queries.
Disclosure: I work in the group that develops Umbra.
The problem will always be queries where the compilation is orders of magnitude more expensive than the query itself. I can imagine indexed lookup of 1 or few entries, etc... Accessing indexed entries like these are very well optimized by SQL query engines and possibly make no sense JIT optimizing.
The problems related to PostgreSQL are pretty much all described here. It's very difficult to do low-latency queries if you cannot cache the compiled code and do it over and over again. And once your JIT is slow you need a logic to decide whether to interpret or compile.
I think it would be the best to start interpreting the query and start compilation in another thread, and once the compilation is finished and interpreter still running, stop the interpreter and run the JIT compiled code. This would give you the best latency, because there would be no waiting for JIT compiler.
So no, it doesn't kill any compatibility - it only shows a different approach.
BTW GPU-only renderers suck, and many renderers that have GPU and CPU engines suck when GPU is not available or have bugs. Strong CPU rendering performance is just necessary for any kind of library if you want true compatibility across various platforms.
I have seen many many times broken rendering on GPU without any ability to switch to CPU. And the biggest problem is that more exotic HW you run it on, less chance that somebody would be able to fix it (talking about GPUs).
On my Apple M1 Pro, the Vello CPU renderer is competitive with the GPU renderers on simple scenes, but falls behind on more complex ones. And especially seems to struggle with large raster images. This is also without a glyph cache (so re-rasterizing every glyph every time, although there is a hinting cache) which isn't implemented yet. This is dependent on multi-threading being enabled and can consume largish portions of all-core CPU while it runs. Skia raster (CPU) gets similarish numbers, which is quite impressive if that is single-threaded.
The obsession for memory safety just doesn't pay off in some cases - if you can batch 64 pixels at once with SIMD it just cannot be compared to a per-pixel processor that has a branch in a path.
IMO, one of biggest benefit of a high performance renderer would be power savings (very important for laptops and phones). If I can run the same work but use half the power, then by all means I'd be happy to deal with the complications that the GPU brings. AFAIK though, no one really cares about that and even efforts like Vello are just targeting fps gains, which do correlate with reduced power consumption but only indirectly.
Historically 2D rendering on CPU was pretty much single-threaded. Skia is single-threaded, Cairo too, Qt mostly (they offload gradient rendering to threads, but it's painfully slow for small gradients, worse than single-threaded), AGG is single-threaded, etc...
In the end only Blend2D, Blaze, and now Vello can use multiple threads on CPU, so finally CPU vs GPU comparisons can be made more fairy - and power draw is definitely a nice property of a benchmark. BTW Blend2D was probably the first library to offer multi-threaded rendering on CPU (just an option to pass to the rendering context, same API).
As far as I know - nobody did a good benchmarking between CPU and GPU 2D renderers - it's very hard to do completely unbiased comparison, and you would be surprised how good the CPU is in this mix. Modern CPU cores consume maybe few watts and you can render to a 4K framebuffer with that single CPU core. Put rendering text to the mix and the numbers would start to be very interesting. Also GPU memory allocation should be included, because rendering fonts on GPU means to pre-process them as well, etc...
2D is just very hard, on both CPU and GPU you would be solving a little bit different problems, but doing it right is insane amount of work, research, and experimentation.
Definitely good to know though. When it comes to low-latency compilation my personal goal is to make it even faster when generating small functions.