Better than brute force is using a model of your instruction set (x86 start here [1]), then using something like Z3 [] to find solutions. Here's a paper doing the same [2]. With these approaches you can get vastly bigger pieces of code generated than brute force, and with significantly less hand heuristic tuning effort (which likely still loses out to current solvers).
If you are familiar with that paper, I'm a little confused by the results section. If I'm reading p01 from figure 7 correctly, the "before" is x&(x-1) and the "after" is a single instruction x&(x<<1). What am I doing wrong?
(The other two wins listed are "recognize the Hacker's Delight trick and replace it with the straight-forward implementation a human would write", which I like.)
Happy to see this! Superoptimizers are one of my all time favorite topics.
My favorite trick when cutting down the search space of possible programs is to generate DAGs directly instead of an instruction sequence. This way, it's easier to write rules that avoid nonsensical programs and to some extent avoid generating multiple equivalent programs.
For example to avoid programs that overwrite calculations you make sure that each node in the generated DAG has something that depends on it.
In a suitable environment the DAG can then be executed directly without lowering to a sequence of instructions.
This sounds a little bit like representing boolean functions as binary decision diagrams -- although for a superoptimizer presumably each node in the DAG would correspond to an instruction from the target instruction set, which might do something rather complicated, not just a primitive logical operation between a few bits.
(i first read about BDDs in Knuth's, "The Art of Computer Programming", Volume 4, Fascicle 1, then spent an enjoyable few weeks going down a rabbit hole of building a boolean function to decide if an input 2d grid of bits was a maze or not, then uniformly sampling mazes from it...)
The 1987 paper about superoptimizers is well worth the read, BTW. It's quite short. Even the first example on its own (writing the sign function without any conditionals) is a mind-bender but becomes straightforward once you see the trick used: https://courses.cs.washington.edu/courses/cse501/15sp/papers...
By 'those optimisations' I assume you mean the specific rewrites, not superoptimisation? There has been some interesting recent work in rewrite inference, e.g. ruler. (I recall one such project did a survey and found that many—possibly a majority—of the rewrites it had generated automatically were not present in the hand-written corpuses of gcc and clang. I'm not sure if there was subsequent work to supplement gcc and clang; this was a few years ago now; I think they were looking at gcc 7.)
It's a shame it's a bit hard to actually build these days. Its target is people working on compilers, not end users. But there are some odd cases where I'd really like to use it myself, like trying to get speedup in llama.cpp.
One can now compile against cuBLAS and other GPU libraries to help take some of the load, but the core of the project is indeed to run on the CPU and do all of the work without dependencies.
Jetbrains was hiring mathematicians working on homotopy type theory to do something similar if I recall correctly. The idea is that HTT can be used to generate equivalent programs to that which is in your buffer and do something like superoptimization. Maybe I am hallucinating.
Is superoptimization related to supercompilation, or are these unrelated compiler optimization techniques? From what I have read, it seems like Turchin [1] et al. are trying to prove equality using some sort of algebraic rewriting system (i.e., the "intensional" approach), whereas superoptimization uses an extensional technique to first try proving disequality by testing on a small set of inputs, then applies formal verification to the remaining candidates.
Massalin (1987) [2] calls the first phase "probabilistic execution" and claims that nearly all of the functions which pass the PE test also pass the more rigorous Boolean verification test. Can you give any insight into the benefits of TT over more "automated" optimizations? I am curious if MLTT/HoTT is more suitable for certain compiler optimizations or offers additional expressive power for proving equivalence, or is the benefit mostly ergonomics?
There is (long abandoned tho) a GNU superoptimiser that supports multiple hardware architectures including HP PA-RISC, Motorola 68k and 88k, Intel i960 amongst others – https://www.gnu.org/software/superopt/
It is based on the exhaustive search with backtracking, kind of what Prolog is best for, and given a function that takes an input vector of integers, it attempts to find the most optimal instruction sequence that produces an output integer.
[1] https://github.com/zwegner/x86-sat
[2] https://people.cs.umass.edu/~aabhinav/Publications/Unbounded...
(The other two wins listed are "recognize the Hacker's Delight trick and replace it with the straight-forward implementation a human would write", which I like.)
My favorite trick when cutting down the search space of possible programs is to generate DAGs directly instead of an instruction sequence. This way, it's easier to write rules that avoid nonsensical programs and to some extent avoid generating multiple equivalent programs.
For example to avoid programs that overwrite calculations you make sure that each node in the generated DAG has something that depends on it.
In a suitable environment the DAG can then be executed directly without lowering to a sequence of instructions.
https://en.wikipedia.org/wiki/Binary_decision_diagram
(i first read about BDDs in Knuth's, "The Art of Computer Programming", Volume 4, Fascicle 1, then spent an enjoyable few weeks going down a rabbit hole of building a boolean function to decide if an input 2d grid of bits was a maze or not, then uniformly sampling mazes from it...)
Related: I work on Souper (https://github.com/google/souper).
Feel free to reach out if anyone has questions!
It's a shame it's a bit hard to actually build these days. Its target is people working on compilers, not end users. But there are some odd cases where I'd really like to use it myself, like trying to get speedup in llama.cpp.
Are llama.cpp (and similar targets) made into standalone C++ source files without dependencies doing the heavy lifting?
One can now compile against cuBLAS and other GPU libraries to help take some of the load, but the core of the project is indeed to run on the CPU and do all of the work without dependencies.
Massalin (1987) [2] calls the first phase "probabilistic execution" and claims that nearly all of the functions which pass the PE test also pass the more rigorous Boolean verification test. Can you give any insight into the benefits of TT over more "automated" optimizations? I am curious if MLTT/HoTT is more suitable for certain compiler optimizations or offers additional expressive power for proving equivalence, or is the benefit mostly ergonomics?
[1]: https://dl.acm.org/doi/pdf/10.1145/5956.5957
[2]: https://dl.acm.org/doi/pdf/10.1145/36177.36194
It is based on the exhaustive search with backtracking, kind of what Prolog is best for, and given a function that takes an input vector of integers, it attempts to find the most optimal instruction sequence that produces an output integer.
First try - get every permutation, feed to some stable code-oriented LLM, ask to explain code and measure number of words.