Meta Sees ~5% Performance Gains to Optimizing the Linux Kernel with Bolt

fooker · a year ago

Interestingly Google went ahead and made their own version of Bolt after it was already open source.

It's called Propeller and it had some purported advantages afaik.

Anyone know if such large scale experiments have been conducted with this for the sake for comparison?

canucker2016 · a year ago

More info about Google's Propeller, from https://lists.llvm.org/pipermail/llvm-dev/2019-September/135...:

    While BOLT does an excellent job of squeezing extra performance from highly optimized binaries with optimizations such as code layout, it has these major

issues:

     * It does not take advantage of distributed build systems.
     * It has scalability issues and to rewrite a binary with a ~300M text segment size:
     * Memory foot-print is 70G.
     * It takes more than 10 minutes to rewrite the binary.

    Similar to Full LTO, BOLT’s design is monolithic as it disassembles the original binary, optimizes and rewrites the final binary in one process.  This limits the scalability of BOLT and the memory and time overhead shoots up quickly for large binaries.

maxpan · a year ago

Disclaimer: I'm an active developer of BOLT.

It takes BOLT less than 10 second to optimize the Linux kernel once the profile is collected. There's no need for a distributed build system to take advantage of that. Overall, we have improved both processing time and memory consumption over the years.

rasz · a year ago

Isnt what BOLT does done _once_ before distribution?

nickcw · a year ago

Here is an article on the Meta engineering blog about BOLT

https://engineering.fb.com/2018/06/19/data-infrastructure/ac...

snehasish · a year ago

The gains are dependent on how much time the workload spends in the kernel. The Propeller team showed 2% performance improvement over PGO+LTO in this LLVM discussion post: https://discourse.llvm.org/t/optimizing-the-linux-kernel-wit...

More details about Propeller are available in a recently published paper: https://research.google/pubs/propeller-a-profile-guided-reli...

lawlessone · a year ago

Is this specific to big programs running on data centers or could it be applied to programs people run locally on their computers or phones?

I'd love to know if the kernel improvements could be improved in this way too for PC / Android users.

Yoric · a year ago

I seem to remember that glandium had done something like this on the Firefox binary to optimize the startup, with impressive speedup. Can't find the details, though.

djmips · a year ago

If some of the optimization gain is from bad alignment of potentially fused operations that can't because they straddle a 64 byte cache boundary, it feels like this is something the compiler should be aware of and mitigating.

nsguy · a year ago

Aren't profile guided optimizers capable of doing similar optimizations?

pizlonator · a year ago

No. The key observation of BOLT is that by collecting profiling on an optimized binary and then mapping the profiling onto a compiler's decompilation of that optimized binary, you get better profiling fidelity.

My intuition for why BOLT works is that:

- If you try to profile an unoptimized (or even insufficiently optimized) binary, you don't get accurate profiling because the timings are different.

- If you try to profile an optimized binary and then rerun the compiler from source using that profiling data, then you'll have a bad time mapping the profiler's observations back to what the source looked like. This is because the compiler pipeline will have done many transforms - totally changing control flow layout in some cases - that make some of the profiling meaningless when you try to inject it before those optimizations happened.

But BOLT injects the profiling data into the code exactly as it was at time of profiling, i.e. the binary itself.

It's totally insane, wacky, and super fucking cool - these folks should be hella proud of themselves.

nsguy · a year ago

In theory you can get a pretty good idea of where instructions came from in the source code though the optimizer does, shall we say, obfuscate/spread that a little bit (which is why debugging through optimized code or looking at core dumps from optimized code can be tricky- but you can still mostly do it, there's just some lack of precision in the mapping back).

Maybe the bigger problem is at what point do the profiles feed back. Since a compiler may generate many object files which are then linked to form the final binary you'd sort of maybe want to do this in the linker vs. earlier on.

I guess specifically with the kernel there's an extra layer of complexity. It looks like they use `perf` to record the profile which is cool. And then they apply the results to the binary which is also cool.

muffa · a year ago

I know very little about compilers or bolt.

But what you just described sounds awesome!(and crazy)

maxpan · a year ago

The gains mentioned are on top of compiler's PGO+LTO.

bjourne · a year ago

If it increases performance by 5% then 5% of execution time was spent missing branches and instruction cache misses... Which sounds utterly implausible. Conventional wisdom has it that instruction caching is not a problem because whatever the size of the binary it is dwarfed by the size of the data. And hot loops are generally no more than a few KBs at most anyway. I'm skeptical.

maxpan · a year ago

The performance loss due to cache misses in data-center applications far exceeds 5%. Combined data and instruction cache misses are contributing to more than half of stalled cycles.

The following publication by Google from 2015 goes into details: https://static.googleusercontent.com/media/research.google.c...

"Our results demonstrate a significant and growing problem with instruction-cache bottlenecks. Front-end core stalls account for 15-30% of all pipeline slots, with many workloads showing 5-10% of cycles completely starved on instructions (Section 6)."

bjourne · a year ago

You're conflating data misses with instruction misses. Besides, the article is about hundred-megabytes-binaries (forgot to strip debug data?), which the Linux kernel isn't.

fooker · a year ago

You can measure it yourself, this is not a philosophical thing you have to be skeptical about.

It's not that 5% of the time was spent in icache misses, it’s more about cascading benefits of hitting the cache or predicting a branch correctly. You get more instruction level parallelism, less stalls, and in general more straightline execution without hitting memory.

tardy_one · a year ago

5% doesn't sound like what you would get addressing the lowest hanging fruit.. Which makes sense because the Linux kernel's primary bloat is mostly in the form of drivers that never load on any specific system. So sure, a pretty good result for not a very relevant optimization but I'd be disappointed if I built this tool specifically for the kernel instead of applied it because I had it..

lawlessone · a year ago

>5% of execution time was spent missing branches and instruction cache misses... Which sounds utterly implausible.

You could be right, but this reminds of the "efficient markets" joke. Wherein two economists refuse to believe the $100 the see on the ground is real because "someone would have picked it up" :-)