I can’t imagine the scale that FFMPEG operates at. A small improvement has to be thousands and thousands of hours of compute saved. Insanely useful project.
There's tons of backlash here as if people think better performance requires writing in assembly.
But to anyone complaining, I want to know, when was the last you pulled out a profiler? When was the last time you saw anyone use a profiler?
People asking for performance aren't pissed you didn't write Microsoft Word in assembly we're pissed it takes 10 seconds to open a fucking text editor.
I literally timed it on my M2 Air. 8s to open and another 1s to get a blank document. Meanwhile it took (neo)vim 0.1s and it's so fast I can't click my stopwatch fast enough to properly time it. And I'm not going to bother checking because the race isn't even close.
I'm (we're) not pissed that the code isn't optional, I'm pissed because it's slower than dialup. So take that Knuth quote you love about optimization and do what he actually suggested. Grab a fucking profiler, it is more important than your Big O
That would be an enormous waste of time. 99.9% of software doesn't have to be anywhere near optimal. It just has to not be wasteful.
Sadly lots of software is blatantly wasteful. But it doesn't take fancy assembly micro optimization to fix it, the problem is typically much higher level than that. It's more like serialized network requests, unnecessarily high time complexities, just lots of unnecessary work and unnecessary waiting.
Once you have that stuff solved you can start looking at lower level optimization, but by that point most apps are already nice and snappy so there's no reason to optimize further.
Seems so easy! You only need the entire world even tangentially related to video to rely solely on your project for a task and you too can have all the developers you need to work on performance!
As an industry we are too bad at correctness to even begin to worry about performance. Looking at FFmpeg (who are a pretty good project that I don't want to pick on too much) I see their most recent patch release fixes 3 CVEs from 2023 plus one from this year, and that's just security vulnerabilities, never mind regular bugs. Imagine if half the effort that people put into making it fast went into making it right.
They run on machines that are designed to have unused computer capacity. Desktops, laptops, tablets, phones. Responsiveness is the key.
Ffmpeg runs on a load of server farms for all your favourite streaming services and the bajillion more you’ve never heard of. Saving compute there saves machines and power.
Your point is well taken but there is a distinction that matters some.
It'd be nice, though, to have a proper API (in the traditional sense, not SaaS) instead of having to figure out these command lines in what's practically its own programming language....
FFMpeg does have an API. It ships a few libraries (libavcodec, libavformat, and others) which expose a C api that is used in the ffmpeg command line tool.
I get why the CLI is so complicated, but I will say AI has been great at figuring out what I need to run given an English language input. It's been one of the highest value uses of AI for me.
I would be interested in more examples where "assembly is faster than intrinsics". I.e., when the compiler screws up. I generally write Zig code with the expectation of a specific sequence of instructions being emitted, and I usually get it via the high level wrappers in std.simd + a few llvm intrinsics. If those fail I'll use inline assembly to force a particular instruction. On extremely rare occasions I'll rely on auto-vectorization, if it's good and I want it to fall back on scalar on less sophisticated CPU targets (although sometimes it's the compiler that lacks sophistication). Aside from the glaring holes in the VPTERNLOG finder, I feel that instruction selection is generally good enough that I can get whatever I want.
The bigger issue is instruction ordering and register allocation. On code where the compiler effectively has to lower serially-dependent small snippets independently, I think the compiler does a great job. However, when it comes to massive amounts of open code I'm shocked at how silly the decisions are that the compiler makes. I see super trivial optimizations available at a glance. Things like spilling x and y to memory, just so it can read them both in to do an AND, and spill it again. Constant re-use is unfortunately super easy to break: Often just changing the type in the IR makes it look different to the compiler. It also seems unable to merge partially poisoned (undefined) constants with other constants that are the same in all the defined portions. Even when you write the code in such a way where you use the same constant twice to get around the issue, it will give you two separate constants instead.
I hope we can fix these sorts of things in compilers. This is just my experience. Let me know if I left anything out.
>There are two flavours of x86 assembly syntax that you’ll see online: AT&T and Intel. AT&T Syntax is older and harder to read compared to Intel syntax. So we will use Intel syntax.
So the main issues here are not what people think they are. They generally aren't "suboptimal assembly", at least not what you can reasonably expect out of a C compiler.
The factors are something like:
- specialization: there's already a decent plain-C implementation of the loop, asm/SIMD versions are added on for specific hardware platforms. And different platforms have different SIMD features, so it's hard to generalize them.
- predictability: users have different compiler versions, so even if there is a good one out there not everyone is going to use it.
- optimization difficulties: C's memory model specifically makes optimization difficult here because video is `char *` and `char *` aliases everything. Also, the two kinds of features compilers add for this (intrinsics and autovectorization) can fight each other and make things worse than nothing.
- taste: you could imagine a better portable language for writing SIMD in, but C isn't it. And on Intel C with intrinsics definitely isn't it, because their stuff was invented by Microsoft, who were famous for having absolutely no aesthetic taste in anything. The assembly is /more/ readable than C would be because it'd all be function calls with names like `_mm_movemask_epi8`.
One time I spent a week carefully rewriting all of the SIMD asm in libtheora, really pulling out all of the stops to go after every last cycle [0], and managed to squeeze out 1% faster total decoder performance. Then I spent a day reorganizing some structs in the C code and got 7%. I think about that a lot when I decide what optimizations to go after.
> And on Intel C with intrinsics definitely isn't it, because their stuff was invented by Microsoft, who were famous for having absolutely no aesthetic taste in anything.
Wouldn't Intel be the one defining the intrinsics? They're referenced from the ISA manuals, and the Intel Intrinsics Guide regularly references intrinsics like _allow_cpu_features() that are only supported by the Intel compiler and aren't implemented in MSVC.
Normally you spin up a tool like vtune or uprof to analyze your benchmark hotspots at the ISA level. No idea about tools like that for ARM.
> Would it ever make sense to write handwritten compiler intermediate representation like LLVM IR instead of architecture-specific assembly?
IME, not really. I've done a fair bit of hand-written assembly and it exclusively comes up when dealing with architecture-specific problems - for everything else you can just write C (unless you hit one of the edge cases where C semantics don't allow you to express something in C, but those are rare).
For example: C and C++ compilers are really, really good at writing optimized code in general. Where they tend to be worse are things like vectorized code which requires you to redesign algorithms such that they can use fast vector instructions, and even then, you'll have to resort to compiler intrinsics to use the instructions at all, and even then, compiler intrinsics can lead to some bad codegen. So your code winds up being non-portable, looks like assembly, and has some overhead just because of what the compiler emits (and can't optimize). So you wind up just writing it in asm anyway, and get smarter about things the compiler worries about like register allocation and out-of-order instructions.
But the real problem once you get into this domain is that you simply cannot tell at a glance whether hand written assembly is "better" (insert your metric for "better here) than what the compiler emits. You must measure and benchmark, and those benchmarks have to be meaningful.
> Would it ever make sense to write handwritten compiler intermediate representation like LLVM IR instead of architecture-specific assembly?
Not really. There are a couple of reasons to reach for handwritten assembly, and in every case, IR is just not the right choice:
If your goal is to ensure vector code, your first choice is to try slapping explicit vectorize-me pragmas onto the loop. If that fails, your next effort is either to use generic or arch-specific vector intrinsics (or jump to something like ISPC, a language for writing SIMT-like vector code). You don't really gain anything in this use case from jumping to IR, since the intrinsics will satisfy your code.
If your goal is to work around compiler suboptimality in register allocation or instruction selection... well, trying to write it in IR gives the compiler a very high likelihood of simply recanonicalizing the exact sequence you wrote to the same sequence the original code would have produced for no actual difference in code. Compiler IR doesn't add anything to the code; it just creates an extra layer that uses an unstable and harder-to-use interface for writing code. To produce the best handwritten version of assembly in these cases, you have to go straight to writing the assembly you wanted anyways.
Loop vectorization doesn't work for ffmpeg's needs because the kernels are too small and specialized. It works better for scientific/numeric computing.
You could invent a DSL for writing the kernels in… but they did, it's x86inc.asm. I agree ispc is close to something that could work.
Why not include the required or targeted math lessons needed for the FFmpeg Assembly Lessons in the GitHub repository? It'd be easier for people to get started if everything was in one place :)
NTA but if the assumption is that the reader has only a basic understanding of C programming and wants to contribute to a video codec there is a lot of ground that needs to be covered just to get to how the cooley/tukey algorithm works and even that's just the basic fundamentals.
I read the repo more as "go through this if you want to have a greater understanding of how things work on a lower level inside your computer". In other words, presumably it's not only intended for people who want to contribute to a video codec/other parts of ffmpeg. But I'm also NTA, so could be wrong.
Imagine all projects were similarly committed.
But to anyone complaining, I want to know, when was the last you pulled out a profiler? When was the last time you saw anyone use a profiler?
People asking for performance aren't pissed you didn't write Microsoft Word in assembly we're pissed it takes 10 seconds to open a fucking text editor.
I literally timed it on my M2 Air. 8s to open and another 1s to get a blank document. Meanwhile it took (neo)vim 0.1s and it's so fast I can't click my stopwatch fast enough to properly time it. And I'm not going to bother checking because the race isn't even close.
I'm (we're) not pissed that the code isn't optional, I'm pissed because it's slower than dialup. So take that Knuth quote you love about optimization and do what he actually suggested. Grab a fucking profiler, it is more important than your Big O
Sadly lots of software is blatantly wasteful. But it doesn't take fancy assembly micro optimization to fix it, the problem is typically much higher level than that. It's more like serialized network requests, unnecessarily high time complexities, just lots of unnecessary work and unnecessary waiting.
Once you have that stuff solved you can start looking at lower level optimization, but by that point most apps are already nice and snappy so there's no reason to optimize further.
How many projects would have anything to benefit from this focus on optimization, though?
There is a reason why the first rule of optimization is "don't do it", and the second (experts only) is "don't do it yet".
Ffmpeg runs on a load of server farms for all your favourite streaming services and the bajillion more you’ve never heard of. Saving compute there saves machines and power.
Your point is well taken but there is a distinction that matters some.
They publish doxygen generated documentation for the APIs, available here: https://ffmpeg.org/doxygen/trunk/
The bigger issue is instruction ordering and register allocation. On code where the compiler effectively has to lower serially-dependent small snippets independently, I think the compiler does a great job. However, when it comes to massive amounts of open code I'm shocked at how silly the decisions are that the compiler makes. I see super trivial optimizations available at a glance. Things like spilling x and y to memory, just so it can read them both in to do an AND, and spill it again. Constant re-use is unfortunately super easy to break: Often just changing the type in the IR makes it look different to the compiler. It also seems unable to merge partially poisoned (undefined) constants with other constants that are the same in all the defined portions. Even when you write the code in such a way where you use the same constant twice to get around the issue, it will give you two separate constants instead.
I hope we can fix these sorts of things in compilers. This is just my experience. Let me know if I left anything out.
God bless you, ffmpeg.
The few chapters I saw seemed to be pretty generic intro to assembly language type stuff.
Would it ever make sense to write handwritten compiler intermediate representation like LLVM IR instead of architecture-specific assembly?
The factors are something like:
- specialization: there's already a decent plain-C implementation of the loop, asm/SIMD versions are added on for specific hardware platforms. And different platforms have different SIMD features, so it's hard to generalize them.
- predictability: users have different compiler versions, so even if there is a good one out there not everyone is going to use it.
- optimization difficulties: C's memory model specifically makes optimization difficult here because video is `char *` and `char *` aliases everything. Also, the two kinds of features compilers add for this (intrinsics and autovectorization) can fight each other and make things worse than nothing.
- taste: you could imagine a better portable language for writing SIMD in, but C isn't it. And on Intel C with intrinsics definitely isn't it, because their stuff was invented by Microsoft, who were famous for having absolutely no aesthetic taste in anything. The assembly is /more/ readable than C would be because it'd all be function calls with names like `_mm_movemask_epi8`.
[0] https://gitlab.xiph.org/xiph/theora/-/blob/main/lib/x86/mmxl... is an example of what we are talking about here.
Wouldn't Intel be the one defining the intrinsics? They're referenced from the ISA manuals, and the Intel Intrinsics Guide regularly references intrinsics like _allow_cpu_features() that are only supported by the Intel compiler and aren't implemented in MSVC.
> Would it ever make sense to write handwritten compiler intermediate representation like LLVM IR instead of architecture-specific assembly?
IME, not really. I've done a fair bit of hand-written assembly and it exclusively comes up when dealing with architecture-specific problems - for everything else you can just write C (unless you hit one of the edge cases where C semantics don't allow you to express something in C, but those are rare).
For example: C and C++ compilers are really, really good at writing optimized code in general. Where they tend to be worse are things like vectorized code which requires you to redesign algorithms such that they can use fast vector instructions, and even then, you'll have to resort to compiler intrinsics to use the instructions at all, and even then, compiler intrinsics can lead to some bad codegen. So your code winds up being non-portable, looks like assembly, and has some overhead just because of what the compiler emits (and can't optimize). So you wind up just writing it in asm anyway, and get smarter about things the compiler worries about like register allocation and out-of-order instructions.
But the real problem once you get into this domain is that you simply cannot tell at a glance whether hand written assembly is "better" (insert your metric for "better here) than what the compiler emits. You must measure and benchmark, and those benchmarks have to be meaningful.
perf is included with the Linux kernel, and works with a fair amount of architectures (including Arm).
Not really. There are a couple of reasons to reach for handwritten assembly, and in every case, IR is just not the right choice:
If your goal is to ensure vector code, your first choice is to try slapping explicit vectorize-me pragmas onto the loop. If that fails, your next effort is either to use generic or arch-specific vector intrinsics (or jump to something like ISPC, a language for writing SIMT-like vector code). You don't really gain anything in this use case from jumping to IR, since the intrinsics will satisfy your code.
If your goal is to work around compiler suboptimality in register allocation or instruction selection... well, trying to write it in IR gives the compiler a very high likelihood of simply recanonicalizing the exact sequence you wrote to the same sequence the original code would have produced for no actual difference in code. Compiler IR doesn't add anything to the code; it just creates an extra layer that uses an unstable and harder-to-use interface for writing code. To produce the best handwritten version of assembly in these cases, you have to go straight to writing the assembly you wanted anyways.
You could invent a DSL for writing the kernels in… but they did, it's x86inc.asm. I agree ispc is close to something that could work.
Deleted Comment