Asm2D (u/Asm2D) - Readit News

Asm2D commented on Better JIT for Postgres github.com/vladich/pg_jit... · Posted by u/vladich

vladich · 10 days ago

I have a test of 200Kb query that AsmJit takes 7 seconds to compile (that's not too bad both LLVM and MIR take ~20s), while sljit does it in 50ms. 200Kb is a pathological case, but it's not unheard of in the area I'm working on. It's realistic, although a rare case. Last 10-15 years most OLTP workloads became CPU bound, because active datasets of most real databases fully fit in memory. There are exceptions, of course.

Asm2D · 10 days ago

That's interesting - 200kB should not be a big deal for it - maybe it uses something that I usually don't, like many function calls, or insane number of branches, etc... I would be interested in that case, but I'm not sure whether I would be able to blindly improve AsmJit without a comprehensive test.

Definitely good to know though. When it comes to low-latency compilation my personal goal is to make it even faster when generating small functions.

Asm2D commented on Better JIT for Postgres github.com/vladich/pg_jit... · Posted by u/vladich

vladich · 10 days ago

SLJIT is a bit smarter than just to use hardcoded registers. It's multi-platform anyway, so it uses registers when they are available on the target platform, if not it will use memory, that's why performance can differ between Windows and Linux on x64 for example - different number of available registers.

Asm2D · 10 days ago

Indeed, but this also means that you would get drastically different performance on platforms that have more physical registers vs on platforms that have less. For example x86_64 only has 16 GP registers, while AArch64 has 32 - if you use 25 registers without any analysis and just go to stack with 10 of them, the difference could be huge.

But... I consider SLJIT to be for a different use-case than AsmJit. It's more portable, but its scope is much more limited.

Asm2D commented on Better JIT for Postgres github.com/vladich/pg_jit... · Posted by u/vladich

vladich · 10 days ago

There are other issues with that auto-allocation. I tested all 3 backends on very large queries (hundreds of KBs) per query. Performance of all of them (+LLVM, but -sljit) was abysmal - the compiler overhead was in seconds to tens(!) of seconds. They have some non-linear components in their optimization algorithms. While sljit was scaling linearly and almost as fast as for smaller queries. So yes, it gives higher run-time performance but the cost of that performance grows non-linearly with code size and complexity. While you still can have good performance with manual allocations. I also don't believe you can make AsmJit 2x faster without sacrificing that auto-allocation algorithm.

Asm2D · 10 days ago

AsmJit has only one place where a lot of time is spent - bin-packing. It's the least optimized part, which has quadratic complexity (at the moment), which starts to show when you have like hundreds of thousands of virtual registers. There is even a benchmark in AsmJit called `asmjit_bench_regalloc`, which shows that a single function that has 16MB alone, with 65k labels and 200k virtual registers takes 2.2 seconds to generate (and 40ms of that is time to just call `emit()`).

If this function is optimized, or switched to some other implementation when there is tens of thousands of virtual registers, you would get orders of magnitude faster compilation.

But realistically, which query requires tens of megabytes of machine code? These are pathological cases. For example we are talking about 25ms when it comes to a single function having 1MB of machine code, and sub-ms time when you generate tens of KB of machine code.

So from my perspective the ability to generate SIMD code that the CPU would execute fast in inner loops is much more valuable than anything else. Any workload, which is CPU-bound just deserves this. The question is how much the CPU bound the workload is. I would imagine databases like postgres would be more memory-bound if you are processing huge rows and accessing only a very tiny part of each row - that's why columnar databases are so popular, but of course they have different problems.

I worked on one project, which tried to deal with this by using buckets and hashing in a way that there would be 16 buckets, and each column would get into one of these, to make the columns closer to each other, so the query engine needs to load only buckets used in the query. But we are talking about gigabytes of RAW throughput per core in this case.

Asm2D commented on Better JIT for Postgres github.com/vladich/pg_jit... · Posted by u/vladich

vladich · 11 days ago

Interesting... AsmJit is pretty fast for compilation, but about 3x than sljit. The only way I can see how to make it fast enough, in theory (i.e. without slowing down point-lookup queries and such) would be to fuse planning with code generation - i.e. a single pass plan builder + compiler essentially. Not sure if Umbra tries to do that, and AsmJit is not the best choice for it anyway, but with sljit it could be on par with interpreter even for fastest queries I believe. Pretty hard (likely impossible) to implement though, planning is inherently a non-linear process...

Asm2D · 11 days ago

Because pg_jitter uses AsmJit's Compiler, which also allocates registers. That's much more work than using hardcoded physical registers in SLJIT case. There is always a cost of such comfort.

I think AsmJit's strength is completeness of its backends as you can emit nice SIMD code with it (like AVX-512). But the performance could be better of course, and that's possible - making it 2x faster would be possible.

Asm2D commented on Better JIT for Postgres github.com/vladich/pg_jit... · Posted by u/vladich

chrisaycock · 12 days ago

> I think it would be the best to start interpreting the query and start compilation in another thread

This technique is known as a "tiered JIT". It's how production virtual machines operate for high-level languages like JavaScript.

There can be many tiers, like an interpreter, baseline compiler, optimizing compiler, etc. The runtime switches into the faster tier once it becomes ready.

More info for the interested:

https://ieeexplore.ieee.org/document/10444855

Asm2D · 11 days ago

I write JITs so I know, but I always try to write in a way that even non-JIT people can understand :)

Asm2D commented on Better JIT for Postgres github.com/vladich/pg_jit... · Posted by u/vladich

aengelke · 12 days ago

> It's very difficult to do low-latency queries if you cannot cache the compiled code

This is not too difficult, it just requires a different execution style. Salesforce's Hyper for example very heavily relies on JIT compilation, as does Umbra [1], which some people regard as one of the fastest databases right now. Umbra doesn't cache any IR or compiled code and still has an extremely low start-up latency; an interpreter exists but is practically never used.

Postgres is very robust and very powerful, but simply not designed for fast execution of queries.

Disclosure: I work in the group that develops Umbra.

[1]: https://umbra-db.com/

Asm2D · 11 days ago

If I recall research papers regarding Umbra it's also using AsmJit as a JIT backend, which means that theoretically the compilation times would be comparable if you only consider code emitting overhead.

The problem will always be queries where the compilation is orders of magnitude more expensive than the query itself. I can imagine indexed lookup of 1 or few entries, etc... Accessing indexed entries like these are very well optimized by SQL query engines and possibly make no sense JIT optimizing.

Asm2D commented on Better JIT for Postgres github.com/vladich/pg_jit... · Posted by u/vladich

sourcegrift · 12 days ago

We have everything optimized, and yet somehow DB queries need to be "interpreted" at runtime. There's no reason for DB queries to not be precompiled.

Asm2D · 12 days ago

Many SQL engines have JIT compilers.

The problems related to PostgreSQL are pretty much all described here. It's very difficult to do low-latency queries if you cannot cache the compiled code and do it over and over again. And once your JIT is slow you need a logic to decide whether to interpret or compile.

I think it would be the best to start interpreting the query and start compilation in another thread, and once the compilation is finished and interpreter still running, stop the interpreter and run the JIT compiled code. This would give you the best latency, because there would be no waiting for JIT compiler.

Asm2D commented on Vector graphics on GPU gasiulis.name/vector-grap... · Posted by u/gsf_emergency_6

hermet · 2 months ago

Blend2D is a CPU-only rendering engine, so I don't think it's a fair comparison to ThorVG. If we're talking about CPU rendering, ThorVG is faster than Skia. (no idea about Blend2d) But at high resolutions, CPU rendering has serious limitations anyway. Blend2D is still more of an experimental project that JIT kills the compatiblity and Vello is not yet production-ready and webgpu only. No point of arguing fast today if it's not usable in real-world scenarios.

Asm2D · 2 months ago

How JIT kills compatibility if it's only enabled on x86 and aaarch64? You can compile Blend2D without it and it would just work.

So no, it doesn't kill any compatibility - it only shows a different approach.

BTW GPU-only renderers suck, and many renderers that have GPU and CPU engines suck when GPU is not available or have bugs. Strong CPU rendering performance is just necessary for any kind of library if you want true compatibility across various platforms.

I have seen many many times broken rendering on GPU without any ability to switch to CPU. And the biggest problem is that more exotic HW you run it on, less chance that somebody would be able to fix it (talking about GPUs).

Asm2D commented on Vector graphics on GPU gasiulis.name/vector-grap... · Posted by u/gsf_emergency_6

nicoburns · 2 months ago

It's not a formal benchmark, but my Browser Engine / Webview (https://github.com/DioxusLabs/blitz/) has pluggable rendering backends (via https://github.com/DioxusLabs/anyrender) with Vello (GPU), Vello CPU, Skia (various backends incl. Vulkan, Metal, OpenGL, and CPU) currently implemented

On my Apple M1 Pro, the Vello CPU renderer is competitive with the GPU renderers on simple scenes, but falls behind on more complex ones. And especially seems to struggle with large raster images. This is also without a glyph cache (so re-rasterizing every glyph every time, although there is a hinting cache) which isn't implemented yet. This is dependent on multi-threading being enabled and can consume largish portions of all-core CPU while it runs. Skia raster (CPU) gets similarish numbers, which is quite impressive if that is single-threaded.

Asm2D · 2 months ago

I think Vello CPU would always struggle with raster images, because it does a bounds check for every pixel fetched from a source image. They have at least described this behavior somewhere in Vello PRs.

The obsession for memory safety just doesn't pay off in some cases - if you can batch 64 pixels at once with SIMD it just cannot be compared to a per-pixel processor that has a branch in a path.

Asm2D commented on Vector graphics on GPU gasiulis.name/vector-grap... · Posted by u/gsf_emergency_6

coffeeaddict1 · 2 months ago

> There's also the issue of just how many billions of line segments you really need to draw every 1/120th of a second at 8K resolution

IMO, one of biggest benefit of a high performance renderer would be power savings (very important for laptops and phones). If I can run the same work but use half the power, then by all means I'd be happy to deal with the complications that the GPU brings. AFAIK though, no one really cares about that and even efforts like Vello are just targeting fps gains, which do correlate with reduced power consumption but only indirectly.

Asm2D · 2 months ago

Adding a power draw into the mix is pretty interesting. Just because a GPU can render something 2x faster in a particular test doesn't mean you have consumed 50% less power, especially when we talk about dedicated GPUs that can have power draw in hundreds of watts.

Historically 2D rendering on CPU was pretty much single-threaded. Skia is single-threaded, Cairo too, Qt mostly (they offload gradient rendering to threads, but it's painfully slow for small gradients, worse than single-threaded), AGG is single-threaded, etc...

In the end only Blend2D, Blaze, and now Vello can use multiple threads on CPU, so finally CPU vs GPU comparisons can be made more fairy - and power draw is definitely a nice property of a benchmark. BTW Blend2D was probably the first library to offer multi-threaded rendering on CPU (just an option to pass to the rendering context, same API).

As far as I know - nobody did a good benchmarking between CPU and GPU 2D renderers - it's very hard to do completely unbiased comparison, and you would be surprised how good the CPU is in this mix. Modern CPU cores consume maybe few watts and you can render to a 4K framebuffer with that single CPU core. Put rendering text to the mix and the numbers would start to be very interesting. Also GPU memory allocation should be included, because rendering fonts on GPU means to pre-process them as well, etc...

2D is just very hard, on both CPU and GPU you would be solving a little bit different problems, but doing it right is insane amount of work, research, and experimentation.