More info about Google's Propeller, from https://lists.llvm.org/pipermail/llvm-dev/2019-September/135...:
While BOLT does an excellent job of squeezing extra performance from highly optimized binaries with optimizations such as code layout, it has these major
issues: * It does not take advantage of distributed build systems.
* It has scalability issues and to rewrite a binary with a ~300M text segment size:
* Memory foot-print is 70G.
* It takes more than 10 minutes to rewrite the binary.
Similar to Full LTO, BOLT’s design is monolithic as it disassembles the original binary, optimizes and rewrites the final binary in one process. This limits the scalability of BOLT and the memory and time overhead shoots up quickly for large binaries.
Disclaimer: I'm an active developer of BOLT.
It takes BOLT less than 10 second to optimize the Linux kernel once the profile is collected. There's no need for a distributed build system to take advantage of that. Overall, we have improved both processing time and memory consumption over the years.
In terms of the execution profile, the kernel is very close to a typical WSC application. I.e. very flat without hotspots. The size of L1 I$ is 32KB, hence your application doesn't have to have 100s of megabytes of code to benefit from layout optimizations.