While BOLT does an excellent job of squeezing extra performance from highly optimized binaries with optimizations such as code layout, it has these major
issues:
* It does not take advantage of distributed build systems.
* It has scalability issues and to rewrite a binary with a ~300M text segment size:
* Memory foot-print is 70G.
* It takes more than 10 minutes to rewrite the binary.
Similar to Full LTO, BOLT’s design is monolithic as it disassembles the original binary, optimizes and rewrites the final binary in one process. This limits the scalability of BOLT and the memory and time overhead shoots up quickly for large binaries.
It takes BOLT less than 10 second to optimize the Linux kernel once the profile is collected. There's no need for a distributed build system to take advantage of that. Overall, we have improved both processing time and memory consumption over the years.
I seem to remember that glandium had done something like this on the Firefox binary to optimize the startup, with impressive speedup. Can't find the details, though.
If some of the optimization gain is from bad alignment of potentially fused operations that can't because they straddle a 64 byte cache boundary, it feels like this is something the compiler should be aware of and mitigating.
No. The key observation of BOLT is that by collecting profiling on an optimized binary and then mapping the profiling onto a compiler's decompilation of that optimized binary, you get better profiling fidelity.
My intuition for why BOLT works is that:
- If you try to profile an unoptimized (or even insufficiently optimized) binary, you don't get accurate profiling because the timings are different.
- If you try to profile an optimized binary and then rerun the compiler from source using that profiling data, then you'll have a bad time mapping the profiler's observations back to what the source looked like. This is because the compiler pipeline will have done many transforms - totally changing control flow layout in some cases - that make some of the profiling meaningless when you try to inject it before those optimizations happened.
But BOLT injects the profiling data into the code exactly as it was at time of profiling, i.e. the binary itself.
It's totally insane, wacky, and super fucking cool - these folks should be hella proud of themselves.
In theory you can get a pretty good idea of where instructions came from in the source code though the optimizer does, shall we say, obfuscate/spread that a little bit (which is why debugging through optimized code or looking at core dumps from optimized code can be tricky- but you can still mostly do it, there's just some lack of precision in the mapping back).
Maybe the bigger problem is at what point do the profiles feed back. Since a compiler may generate many object files which are then linked to form the final binary you'd sort of maybe want to do this in the linker vs. earlier on.
I guess specifically with the kernel there's an extra layer of complexity. It looks like they use `perf` to record the profile which is cool. And then they apply the results to the binary which is also cool.
If it increases performance by 5% then 5% of execution time was spent missing branches and instruction cache misses... Which sounds utterly implausible. Conventional wisdom has it that instruction caching is not a problem because whatever the size of the binary it is dwarfed by the size of the data. And hot loops are generally no more than a few KBs at most anyway. I'm skeptical.
The performance loss due to cache misses in data-center applications far exceeds 5%. Combined data and instruction cache misses are contributing to more than half of stalled cycles.
"Our results demonstrate a significant and growing problem with instruction-cache bottlenecks. Front-end core stalls account for 15-30% of all pipeline slots, with many workloads showing 5-10% of cycles completely starved on instructions (Section 6)."
You're conflating data misses with instruction misses. Besides, the article is about hundred-megabytes-binaries (forgot to strip debug data?), which the Linux kernel isn't.
You can measure it yourself, this is not a philosophical thing you have to be skeptical about.
It's not that 5% of the time was spent in icache misses, it’s more about cascading benefits of hitting the cache or predicting a branch correctly. You get more instruction level parallelism, less stalls, and in general more straightline execution without hitting memory.
5% doesn't sound like what you would get addressing the lowest hanging fruit.. Which makes sense because the Linux kernel's primary bloat is mostly in the form of drivers that never load on any specific system. So sure, a pretty good result for not a very relevant optimization but I'd be disappointed if I built this tool specifically for the kernel instead of applied it because I had it..
>5% of execution time was spent missing branches and instruction cache misses... Which sounds utterly implausible.
You could be right, but this reminds of the "efficient markets" joke.
Wherein two economists refuse to believe the $100 the see on the ground is real because "someone would have picked it up" :-)
It's called Propeller and it had some purported advantages afaik.
Anyone know if such large scale experiments have been conducted with this for the sake for comparison?
It takes BOLT less than 10 second to optimize the Linux kernel once the profile is collected. There's no need for a distributed build system to take advantage of that. Overall, we have improved both processing time and memory consumption over the years.
https://engineering.fb.com/2018/06/19/data-infrastructure/ac...
More details about Propeller are available in a recently published paper: https://research.google/pubs/propeller-a-profile-guided-reli...
I'd love to know if the kernel improvements could be improved in this way too for PC / Android users.
My intuition for why BOLT works is that:
- If you try to profile an unoptimized (or even insufficiently optimized) binary, you don't get accurate profiling because the timings are different.
- If you try to profile an optimized binary and then rerun the compiler from source using that profiling data, then you'll have a bad time mapping the profiler's observations back to what the source looked like. This is because the compiler pipeline will have done many transforms - totally changing control flow layout in some cases - that make some of the profiling meaningless when you try to inject it before those optimizations happened.
But BOLT injects the profiling data into the code exactly as it was at time of profiling, i.e. the binary itself.
It's totally insane, wacky, and super fucking cool - these folks should be hella proud of themselves.
Maybe the bigger problem is at what point do the profiles feed back. Since a compiler may generate many object files which are then linked to form the final binary you'd sort of maybe want to do this in the linker vs. earlier on.
I guess specifically with the kernel there's an extra layer of complexity. It looks like they use `perf` to record the profile which is cool. And then they apply the results to the binary which is also cool.
But what you just described sounds awesome!(and crazy)
The following publication by Google from 2015 goes into details: https://static.googleusercontent.com/media/research.google.c...
"Our results demonstrate a significant and growing problem with instruction-cache bottlenecks. Front-end core stalls account for 15-30% of all pipeline slots, with many workloads showing 5-10% of cycles completely starved on instructions (Section 6)."
It's not that 5% of the time was spent in icache misses, it’s more about cascading benefits of hitting the cache or predicting a branch correctly. You get more instruction level parallelism, less stalls, and in general more straightline execution without hitting memory.
You could be right, but this reminds of the "efficient markets" joke. Wherein two economists refuse to believe the $100 the see on the ground is real because "someone would have picked it up" :-)