t-tests run afoul of the “no Gaussian assumptions”, though. Distributions arising from benchmarking frequently has various forms of skew which messes up t-tests and gives artificially narrow confidence intervals.
(I'll gladly give you credit for your outlier detection, though!)
>> Automatic isolation to the greatest extent possible (given appropriate permissions) > This sounds interesting. Please feel free to open a ticket if you have any ideas.
Off the top of my head, some option that would:
* Bind to isolated CPUs, if booted with it (isolcpus=) * Binding to a consistent set of cores/hyperthreads (the scheduler frequently sabotages benchmarking, especially if your cores are have very different maximum frequency) * Warns if thermal throttling is detected during the run * Warns if an inappropriate CPU governor is enabled * Locks the program into RAM (probably hard to do without some sort of help from the program) * Enables realtime priority if available (e.g., if isolcpus= is not enabled, or you're not on Linux)
Of course, sometimes you would _want_ to benchmark some of these effects, and that's fine. But most people probably won't, and won't know that they exist. I may easily have forgotten some.
On the flip side (making things more random as opposed to less), something that randomizes the initial stack pointer would be nice, as I've sometimes seen this go really, really wrong (renaming a binary from foo to foo_new made it run >1% slower!).
This is something we do already. We set a `HYPERFINE_RANDOMIZED_ENVIRONMENT_OFFSET` environment variable with a random-length value: https://github.com/sharkdp/hyperfine/blob/87d77c861f1b6c761a...
Current defaults: "By default, it will perform at least 10 benchmarking runs and measure for at least 3 seconds." If your program takes 1s to run, it should take 10 seconds to benchmark.
Is it possible that your program was waiting for input that never came? One "gotcha" is that it expects each argument to be a full program, so if you ran `hyperfine ./a.out input.txt`, it will first bench a.out with no args, then try to bench input.txt (which will fail). If a.out reads from stdin when no argument is given, then it would hang forever, and I can see why you'd give up after a half hour.
We do close stdin to prevent this. So you can benchmark `cat`, for example, and it works just fine.