A 94x speed improvement demonstrated using handwritten assembly

mort96 · a year ago

I saw this. The problem is that it's not comparing "hand-written assembly" to "not hand-writing assembly", it's comparing scalar code with vector code. If you wrote C code using AVX intrinsics, you'd get similar speed-ups without hand-writing any assembly.

And the annoying part is that there are good reasons to write hand-written assembly in some particular cases. Video decoders contain a lot of assembly for a reason: you often have extremely tight loops where 1) every instruction matters, 2) the assembly is relatively straightforward, and 3) you want dependable performance which doesn't change across compilers/compiler versions/compiler settings. But those reasons don't let you make ridiculous claims like "94x improvement from hand-written assembly compared to C".

nicce · a year ago

> And the annoying part is that there are good reasons to write hand-written assembly in some particular cases.

The major, and well-known problem is that hand-written assembly is usually 100% non-portable. Maybe it is okay if you need the boost only in few platforms. But that still requires few different implementations.

mort96 · a year ago

Yeah but that's kind of a given here, C with AVX intrinsics or C with NEON intrinsics or C with SVE intrinsics is also 100% non-portable.

cjensen · a year ago

To clarify, it's 94x the performance of the naive C implementation in just one type of filter. On the same filter, the table posted to twitter shows SSSE3 at 40x and AVX2 at 67x. So maybe just a case where most users were using AVX/SSE and there was no reason to optimize the C.

But this is just one feature of FFmpeg. Usually the heaviest CPU user is encode and decode, which is not affected by this improvement.

It's interesting and good work, but the "94x" statement is misleading.

jsheard · a year ago

According to someone in the dupe thread, the C implementation is not just naive with no use of vector intrinsics, it also uses a more expensive filter algorithm than the assembly versions, and it was compiled with optimizations disabled in the benchmark showing a 94x improvement:

https://news.ycombinator.com/item?id=42042706

Talk about stacking the deck to make a point. Finely tuned assembly may well beat properly optimized C by a hair, but there's no way you're getting a two orders of magnitude difference unless your C implementation is extremely far from properly optimized.

evoke4908 · a year ago

If and only if someone has spent the time to write optimizations for your specific platform.

GCC for AVR is absolutely abysmal. It has essentially no optimizations and almost always emits assembly that is tens of times slower than handwritten assembly.

For just a taste of the insanity, how would you walk through a byte array in assembly? You'd load a pointer to a register, load the value at that pointer, then increment the pointer. AVR devices can load and post-increment as a single instruction. This is not even remotely what GCC does. GCC will load your pointer into a register, then for each iteration it adds the index to the pointer, loads the value with the most expensive instruction possible, then subtracts the index from the pointer.

In assembly, the correct AVR method takes two cycles per iteration. The GCC method takes seven or eight.

For every iteration in every loop. If you use an int instead of a byte for your index, you've added two to four more cycles to each loop. (For 8 bit architectures obviously)

I've just spent the last three weeks carefully optimizing assembly for a ~40x overall improvement. I have a *lot* to say about GCC right now.

zbobet2012 · a year ago

You've mis-understood. The 8tap filter is part of the HEVC encode loop and is used for sub pixel motion estimation. This is likely an improvement in encoding performance, but it's only in one specific coding tool.

chrisco255 · a year ago

FFMPEG uses hand written assembly across their code base, and while 94x may not be representative everywhere, it's generally true that they regularly outperform the compiler with assembly:

https://x.com/FFmpeg/status/1852913590258618852 https://x.com/FFmpeg/status/1850475265455251704

refulgentis · a year ago

^ this, took my biggest step entry into programming via learning how to get my videos on an iPod Video in 2004, that eventually required compiling ffmpeg and keeping up with it.

I'll bet money, sight unseen, that poster above is right its used for HEVC. I'll bet even more money its not some massive out of nowhere win, hand-writing assembly for popular codecs was de rigeur for ffmpeg. Thrust of the article, or at least the headline, is clickbait-y.

Deleted Comment

Temporary_31337 · a year ago

How does this compare to the naive GPU performance on an RTX GPU?

astrange · a year ago

If the entire program was on the GPU, that specific part would be extremely fast.

But the rest of it is on the CPU because GPU cores aren't any good at largely serial things like video decoding. So it doesn't matter.

H8crilA · a year ago

The GPU is a lot slower than the CPU - per core/thread. Otherwise you would need to select different core/thread counts on both sides, or set something like a power limit in watts. When it comes to watts the process node (nanometers) will largely determine the outcome.

Deleted Comment

jsheard · a year ago

This is a slightly misleading comparison, because they're comparing naive scalar C code to hand-vectorized assembly, skipping over hand-vectorized C with vendor intrinsics. Assembly of course always wins if enough effort is put into tuning it, but intrinsics can usually get extremely close with much less effort.

For reasons unknown to me the FFmpeg team has a weird vendetta against intrinsics, they require all platform-specific code to be written in assembly even if C with intrinsics would perform exactly the same. It goes without saying that assembly will be faster if you arbitrarily forbid using the fastest C constructs.

unlord · a year ago

It is actually quite a bit more misleading. I was not able to reproduce these numbers on Zen2 hardware, see https://people.videolan.org/~unlord/dav1d_6tap.png. I spoke with the slide author and he confirmed he was using an -O0 debug build of the checkasm binary.

What's more, the C code is running an 8-tap filter where the SIMD for that function (in all of SSSE3, AVX2 and AVX512) is implemented as 6-tap. Last week I posted MR !1745 (https://code.videolan.org/videolan/dav1d/-/merge_requests/17...) which adds 6-tap to the C code and brings improved performance to all platforms dav1d supports.

This, of course, also closes the gap in these numbers but is a more accurate representation of the speed-up from hand-written assembly.

zbobet2012 · a year ago

The thing I found interesting in the AVX512 gains over AVX2. That's a pretty nice gain from the wider instruction set which has often been ignored in the video community.

Const-me · a year ago

> FFmpeg team has a weird vendetta against intrinsics

To be fair, ffmpeg is really old software. Wikipedia says they released their initial version in the end of 2000. The software landscape was very different.

Back then, there were multiple competing CPU architectures. In modern world we only have two mainstream ones, AMD64 and ARM64, two legacy ones in the process of being phased out, x86 and 32-bit ARM, and very few people care about any other CPUs.

Another thing, C compilers of 2000 weren’t good in terms of performance of the generated code. Clang only arrived in 2007. In 2000, neither GCC nor VC++ had particularly good optimizers or code generators.

Back in 2000, it was reasonable to use assembly for performance-critical code like that. It just they never questioned that decision later, despite they should have done that many years ago.

astrange · a year ago

FFmpeg developers aren't stupid and aren't doing things because of historical accident. If you don't mind writing platform specific code, it simply works better this way.

Here's clang messing up x86 intrinsics code.

https://x.com/ffmpeg/status/1852913590258618852

The other reason to do it is that, since x86 intrinsics are named in Hungarian notation, they're so hard to read that the asm is actually more maintainable.

lsllc · a year ago

The intrinsics aren't portable across architectures, so using them vs inline asm (or linking with a .s/.asm file) is more about convenience/ease of use than portability (although they might be slightly more portable across different OSes for the _same_ architecture, e.g. macOS/linux for aarch64).

In some ways, I prefer the "go big or go home" of asm either inline, or in a .s/.asm file, although both inline and .s/.asm have portability issues (e.g. inline asm syntax across C/C++ compilers or the flavor of the .s/.asm files depending on your assembler).

jsheard · a year ago

Intrinsics do have certain advantages besides convenience, since the compiler can reason about them in ways that it can't with an opaque blob of asssembly. For example if you set the compiler to tune for a specific CPU architecture then it can re-order intrinsics according to the instruction costs of that specific hardware, whereas with assembly you would have to manually write a separate implementation for each architecture (and AFAICT FFmpeg doesn't go that far).

flohofwoe · a year ago

> but intrinsics can usually get extremely close with much less effort

Why 'much less effort' though? Intrinsics are on the same abstraction level as an equivalent sequence of assembly instructions aren't they? And they only target one specific ISA anyway and are not portable between CPU architectures, so the difference between coding in intrinsics and assembly doesn't seem all that big. Also I wonder if MSVC and GCC/Clang intrinsics are fully compatible to each other, compiler compatibility might be another reason to use assembly.

jsheard · a year ago

Instrinsics more or less map to specific assembly instructions, but the instruction scheduling and register allocation is still handled by the compiler, which you have to do by hand in raw assembly. Possibly multiple times, since the optimal instruction scheduling is hardware specific. Intrinsics can also be automatically inlined, constant folded, etc, which you don't get with assembly.

Also Intel and ARM themselves specify the C intrinsics for their architectures so they're the same across MSVC, GCC and Clang, it's not like the wild west of other compiler extensions.

brigade · a year ago

Intrinsics are compatible across compilers and OSs within the same architecture, and are also mostly compatible between 32 and 64 bit variants of the same architecture. With asm, you have to handle the 3 different calling conventions between x86, x86-64, and win64, and also write two completely separate implementations for arm32 and arm64, instead of just one x86 intrinsic and one NEON intrinsic version. Sure ffmpeg tries to automatically handle the different x86 calling conventions with standard macros, but there's still some %if WIN64 scattered around, and 32-bit x86's register constraints means larger functions are littered with %if ARCH_X86_64.

Which brings us to the most "more effort" of assembly - no variables or inline functions, only registers and macros. Which is okay for self-contained functions a hundred lines or so, but less so with heavily templated multi-thousand line files of deeply intertwined macros nested 4 levels deep. Being able to write an inlined function that has no side effects felt a thousand lines away reduces mental effort by a lot, as does not having to redo register allocation across a thousand lines because you now need another temporary register for a short section of code, or even think about it much in the first place to still get near-optimal performance on modern CPUs.

account42 · a year ago

> This is a slightly misleading comparison, because they're comparing naive scalar C code to hand-vectorized assembly, skipping over hand-vectorized C with vendor intrinsics. Assembly of course always wins if enough effort is put into tuning it, but intrinsics can usually get extremely close with much less effort.

It's a perfectly valid comparision between straightforward C and the best hand optimization you can get.

> For reasons unknown to me the FFmpeg team has a weird vendetta against intrinsics, they require all platform-specific code to be written in assembly even if C with intrinsics would perform exactly the same.

Wanting to standardize on a single language for low level code is absolutely reasonable. This way contributors only need to know standard C as well as assembly instead of standar C, assembly and also intel intrisics which are a fusion of the two but also not the same as either and have their own gotchas.

mort96 · a year ago

I mean currently, contributors who want to work on vectorized algorithms need to know C, amd64 assembly with SSE/AVX instructions, aarch64 assembly with NEON instructions and aarch64 assembly with SVE instructions (and presumably soon, or maybe already, risc-v assembly with the vector extension). I wouldn't say it's simpler than needing to know C, C with SSE/AVX instricsics, C with NEON intrinsics, C with SVE intrinsics and C with risc-v vector intrinsics.

snvzz · a year ago

Relevant: https://xcancel.com/FFmpeg/status/1852847683666641124

And my take on intrinsics, they add, not remove, complexity. For no gain.

Awful, relative to writing (or reading) asm.

cornstalks · a year ago

For the curious, this is fresh from VDD24[1]. The 94x is mildly contested since IIRC how you write the C code really matters (e.g. scalar vs autovectorized). But even with vectorized C code the speed up was something like >10x, which is still great but not as fun to say.

One clarification: this is an optimization in dav1d, not FFmpeg. FFmpeg uses dav1d, so it can take advantage of this, but so can other non-FFmpeg programs that use dav1d. If you do any video handling consider adding dav1d to your arsenal!

There’s currently a call for RISC-V (64-bit) optimizations in dav1d. If you want to dip your toes in RISC-V optimizations and assembly this is a great opportunity. We need more!

[1]: https://www.videolan.org/videolan/events/vdd24/

tiffanyh · a year ago

Similarly, Mike Pall of LuaJIT fame wrote (13-years ago) about the benefits of handwriting assembly.

http://lua-users.org/lists/lua-l/2011-02/msg00742.html

janwas · a year ago

This advice seems specific to an interpreter with "hundreds of slow paths" :)

bhouston · a year ago

These were on a micro-benchmark, a single function got 94x faster compared to raw C.

This is similar to this other improvement that was also on a micro-benchmark: https://news.ycombinator.com/item?id=42007695

htk · a year ago

"...Intel disabled AVX-512 for its Core 12th, 13th, and 14th Generations of Core processors, leaving owners of these CPUs without them."

Well known to most but news to me, I've tried to find out the reason why but couldn't come up with a definitive answer.

mort96 · a year ago

They introduced efficiency cores, and those don't have AVX-512. Lots of software breaks if it suddenly gets moved to a core which supports different instructions, so OSes wouldn't be able to move processes between E-cores and P-cores if P-cores supported AVX-512 while E-cores didn't.

PhilipRoman · a year ago

As long as the vast majority of processes don't use AVX-512, you could probably catch sigill or whatever in kernel and transparently move to a P-core, marking the task to avoid rescheduling on an E-core it again in near future. Probably not very efficient, but tasks which use AVX are usually something you want to run on a P-core anyway.

account42 · a year ago

Wouldn't it have been more reasonable to emulate AVX-512 on the E-cores?

jsheard · a year ago

The reason is that those CPUs have two types of cores, performance and efficiency, and only the former supports AVX512. Early on you could actually get access to AVX512 if you disabled the efficiency cores, but they put a stop to that in later silicon revisions, IIRC with the justification that AVX512 didn't go through proper validation on those chips since it wasn't supposed to be used.

hnuser123456 · a year ago

Probably reduce wasted silicon because very few consumers will see a significant benefit in everyday computing tasks. Also supposedly they had issues combining it with e-cores. Intel is struggling to get their margins back up. The 11th gen had AVX512 but the people who cared seem to be PS3 emulator users.

CyberDildonics · a year ago

A 94x performance boost will not come from just writing some assembly instructions. It will come from changing terrible memory access patterns to be optimal, not allocating memory in a hot loop and using SIMD instructions. At 94x there could be some algorithmic changes too, like not doing redundant calculations on pixels that can have their order reversed and applied to other pixels.