maxpan (u/maxpan) - Readit News

maxpan commented on Meta Sees ~5% Performance Gains to Optimizing the Linux Kernel with Bolt phoronix.com/news/Linux-B... · Posted by u/newman314

bjourne · a year ago

You're conflating data misses with instruction misses. Besides, the article is about hundred-megabytes-binaries (forgot to strip debug data?), which the Linux kernel isn't.

maxpan · a year ago

The quote above talks exclusively of instruction cache misses. In case you are really interested, the two kinds are related as L2 and L3 caches are shared my instructions and data.

In terms of the execution profile, the kernel is very close to a typical WSC application. I.e. very flat without hotspots. The size of L1 I$ is 32KB, hence your application doesn't have to have 100s of megabytes of code to benefit from layout optimizations.

maxpan commented on Meta Sees ~5% Performance Gains to Optimizing the Linux Kernel with Bolt phoronix.com/news/Linux-B... · Posted by u/newman314

canucker2016 · a year ago

More info about Google's Propeller, from https://lists.llvm.org/pipermail/llvm-dev/2019-September/135...:

    While BOLT does an excellent job of squeezing extra performance from highly optimized binaries with optimizations such as code layout, it has these major

issues:

     * It does not take advantage of distributed build systems.
     * It has scalability issues and to rewrite a binary with a ~300M text segment size:
     * Memory foot-print is 70G.
     * It takes more than 10 minutes to rewrite the binary.

    Similar to Full LTO, BOLT’s design is monolithic as it disassembles the original binary, optimizes and rewrites the final binary in one process.  This limits the scalability of BOLT and the memory and time overhead shoots up quickly for large binaries.

maxpan · a year ago

Disclaimer: I'm an active developer of BOLT.

It takes BOLT less than 10 second to optimize the Linux kernel once the profile is collected. There's no need for a distributed build system to take advantage of that. Overall, we have improved both processing time and memory consumption over the years.

maxpan commented on Meta Sees ~5% Performance Gains to Optimizing the Linux Kernel with Bolt phoronix.com/news/Linux-B... · Posted by u/newman314

bjourne · a year ago

If it increases performance by 5% then 5% of execution time was spent missing branches and instruction cache misses... Which sounds utterly implausible. Conventional wisdom has it that instruction caching is not a problem because whatever the size of the binary it is dwarfed by the size of the data. And hot loops are generally no more than a few KBs at most anyway. I'm skeptical.

maxpan · a year ago

The performance loss due to cache misses in data-center applications far exceeds 5%. Combined data and instruction cache misses are contributing to more than half of stalled cycles.

The following publication by Google from 2015 goes into details: https://static.googleusercontent.com/media/research.google.c...

"Our results demonstrate a significant and growing problem with instruction-cache bottlenecks. Front-end core stalls account for 15-30% of all pipeline slots, with many workloads showing 5-10% of cycles completely starved on instructions (Section 6)."

maxpan commented on Meta Sees ~5% Performance Gains to Optimizing the Linux Kernel with Bolt phoronix.com/news/Linux-B... · Posted by u/newman314

nsguy · a year ago

Aren't profile guided optimizers capable of doing similar optimizations?

maxpan · a year ago

The gains mentioned are on top of compiler's PGO+LTO.

maxpan commented on Accelerate large-scale applications with BOLT code.facebook.com/posts/6... · Posted by u/ot

AboutTheWhisles · 7 years ago

Can BOLT optimize a shared library as well?

maxpan · 7 years ago

Not yet, but the support is coming.

maxpan commented on Accelerate large-scale applications with BOLT code.facebook.com/posts/6... · Posted by u/ot

nwmcsween · 7 years ago

So what exactly would BOLT help with? I'm guessing mainly junky code or things a compiler cannot possibly optimize. Could you run BOLT on something that is known to be be well optimized?

maxpan · 7 years ago

BOLT can optimize the compiler itself. Either GCC or Clang.

maxpan commented on Accelerate large-scale applications with BOLT code.facebook.com/posts/6... · Posted by u/ot

sanxiyn · 7 years ago

AutoFDO is exactly that. Facebook's execuse to develop this is that AutoFDO didn't work well with HHVM, but that sounds like an AutoFDO bugfix, not a whole new project. I agree with you that it is better to work on AutoFDO, because it will see more wide usage.

maxpan · 7 years ago

In many cases BOLT complements AutoFDO. AutoFDO affects several optimizations, and code layout is just one of them. Another critical optimization influenced by AutoFDO/PGO is function inlining. After inlining a callee code profile is often different from a "pre-inlined" profile seen by the compiler, which prevents the compiler from making an optimal layout decisions. Since BOLT observes the code after all compiler optimizations, its decisions are not affected by context-sensitive inlined function behavior.

maxpan commented on Accelerate large-scale applications with BOLT code.facebook.com/posts/6... · Posted by u/ot

ebikelaw · 7 years ago

It seems unfortunate that they developed their own data format for the input. Why can't it be in the same format that SamplePGO ingests? Additionally, it also seems unfortunate to add yet another stage to the toolchain. We already have either build, run+profile, rebuild with pgo, link with LTO, or we have build with samplepgo, link with thinlto, and this is adding a third or second rebuild. It already takes a phenomenally long time to compile a large C++ application, and another pass isn't going to make it shorter.

maxpan · 7 years ago

There's no need for another build as BOLT runs directly on a compiled binary, and could be integrated into an existing build system. Operating directly on a machine code allows BOLT to boost performance on top of AutoFDO/PGO and LTO. Processing the binary directly requires a profile in a format different from what a compiler expects, as the compiler operates on source code and needs to attribute the profile to high-level constructs.