Perhaps interesting (for some) to note that hyperfine is from the same author as at least a few other "ne{w,xt} generation" command line tools (that could maybe be seen as part of "rewrite it in Rust", but I don't want to paint the author with a brush they disagree with!!): fd (find alternative; https://github.com/sharkdp/fd), bat ("supercharged version of the cat command"; https://github.com/sharkdp/bat), and hexyl (hex viewer; https://github.com/sharkdp/hexyl). (And certainly others I've missed!)
Pointing this out because I myself appreciate comments that do this.
For myself, `fd` is the one most incorporated into my own "toolbox" -- used it this morning prior to seeing this thread on hyperfine! So, thanks for all that, sharkdp if you're reading!
It’s absolutely my preferred `find` replacement. Its CLI interface just clicks for me and I can quickly express my desires. Quite unlike `find`. `fd` is one of the first packages I install on a new system.
The "funny" thing for me about `fd` is that the set of operations I use for `find` are very hard-wired into my muscle memory from using it for 20+ years, so when I reach for `fd` I often have to reference the man page! I'm getting a little better from more exposure, but it's just different enough from `find` to create a bit of an uncanny valley effect (I think that's the right use of the term...).
Even with that I reach for `fd` for some of its quality-of-life features: respecting .gitignore, its speed, regex-ability. (Though not its choices with color; I am a pretty staunch "--color never" person, for better or worse!)
Anyway, that actually points to another good thing about sharkdp's tools: they have good man pages!!
Hyperfine is a great tool but when I was using it at Deno to benchmark startup time there was a lot of weirdness around the operating system apparently caching inodes of executables.
If you are looking at shaving sub 20ms numbers, be aware you may need to pull tricks on macos especially to get real numbers.
Caching is something that you almost always have to be aware of when benchmarking command line applications, even if the application itself has no caching behavior. Please see https://github.com/sharkdp/hyperfine?tab=readme-ov-file#warm... on how to run either warm-cache benchmarks or cold-cache benchmarks.
I'm fully aware but it's not a problem that warmup runs fix. An executable freshly compiled will always benchmark differently than one that has "cooled off" on macos, regardless of warmup runs.
I've tried to understand what the issue is (played with resigning executables etc) but it's literally something about the inode of the executable itself. Most likely part of the OSX security system.
I've found pretty good results with the System Trace template in xcode instruments. You can also stack instruments, for example combining the file inspector with a virtual memory inspector.
I've run into some memory corruption with it sometimes, though, so be wary of that. Emerge tools has an alternative for iOS at least, maybe one day they'll port it to mac.
I've also had a good experience using the 'perf'[^1] tools for when I don't want to install 'hyperfine'. Shameless plug for a small blog post about it as I don't think it is that well known: https://usrme.xyz/tils/perf-is-more-robust-for-repeated-timi....
As mentioned here in the thread, when you want to go into the single ms optimisations it is not the best approach since there is a lot of overhead especially the way I demonstrate here, but it works very well for some sanity checks.
What I would expect a system like this to have, at a minimum:
* Robust statistics with p-values (not just min/max, compensation for multiple hypotheses, no Gaussian assumptions)
* Multiple stopping points depending on said statistics.
* Automatic isolation to the greatest extent possible (given appropriate permissions)
* Interleaved execution, in case something external changes mid-way.
I don't see any of this in hyperfine. It just… runs things N times and then does a naïve average/min/max? At that rate, one could just as well use a shell script and eyeball the results.
> It just… runs things N times and then does a naïve average/min/max?
While there is nothing wrong with computing average/min/max, this is not all hyperfine does. We also compute modified Z-scores to detect outliers. We use that to issue warnings, if we think the mean value is influenced by them. We also warn if the first run of a command took significantly longer than the rest of the runs and suggest counter-measures.
Depending on the benchmark I do, I tend to look at either the `min` or the `mean`. If I need something more fine-grained, I export the results and use the scripts referenced above.
> At that rate, one could just as well use a shell script and eyeball the results.
Statistical analysis (which you can consider to be basic) is just one reason why I wrote hyperfine. The other reason is that I wanted to make benchmarking easy to use. I use warmup runs, preparation commands and parametrized benchmarks all the time. I also frequently use the Markdown export or the JSON export to generate graphs or histograms. This is my personal experience. If you are not interested in all of these features, you can obviously "just as well use a shell script".
Personally, I'm all about the UNIX philosophy of doing one thing and doing it well. All I want is the process to be invoked k times to do a thing with warmup etc. etc. If I want additional stats, it's easy to calculate. I just `--export-json` and then once it's in a dataframe I can do what I want with it.
The comment about statistics that I wanted to reply to has disappeared. That commenter said:
> I stand firm in my belief that unless you can prove how CLT applies to your input distributions, you should not assume normality. And if you don't know what you are doing, stop reporting means.
I agree. My research group stopped using Hyperfine because it ranks benchmarked commands by mean, and provides standard deviation as a substitute for a confidence measure. These are not appropriate for heavy-tailed, skewed, and otherwise non-normal distributions.
It's easy to demonstrate that most empirical runtime distributions are not normal. I wrote BestGuess [0] because we needed a better benchmarking tool. Its analysis provides measures of skew, kurtosis, and Anderson-Darling distance from normal, so that you can see how normal or not is your distribution. It ranks benchmark results using non-parametric methods. And, unlike many tools, it saves all of the raw data, making it easy to re-analyze later.
My team also discovered that Hyperfine's measurements are a bit off. It reports longer run times than other tools, including BestGuess. I believe this is due to the approach, which is to call getrusage(), then fork/exec the program to be measured, then call getrusage() again. The difference in user and system times is reported as the time used by the benchmarked command, but unfortunately this time also includes cycles spent in the Rust code for managing processes (after the fork but before the exec).
BestGuess avoids external libraries (we can see all the relevant code), does almost nothing after the fork, and uses wait4() to get measurements. The one call to wait4() gives us what the OS measured by its own accounting for the benchmarked command.
While BestGuess is still a work in progress (not yet at version 1.0), my team has started using it regularly. I plan to continue its development, and I'll write it up soon at [1].
Hyperfine is great! I remember I learned about it when comparing functions with/without tail recursion (not sure if it was from the Go reference or the Rust reference). It provides simple configurations for unit test. But I have not tried it on DBMS (e.g. like sysbench). Does anyone have a try?
Pointing this out because I myself appreciate comments that do this.
For myself, `fd` is the one most incorporated into my own "toolbox" -- used it this morning prior to seeing this thread on hyperfine! So, thanks for all that, sharkdp if you're reading!
Ok, end OT-ness.
It’s absolutely my preferred `find` replacement. Its CLI interface just clicks for me and I can quickly express my desires. Quite unlike `find`. `fd` is one of the first packages I install on a new system.
Even with that I reach for `fd` for some of its quality-of-life features: respecting .gitignore, its speed, regex-ability. (Though not its choices with color; I am a pretty staunch "--color never" person, for better or worse!)
Anyway, that actually points to another good thing about sharkdp's tools: they have good man pages!!
If you are looking at shaving sub 20ms numbers, be aware you may need to pull tricks on macos especially to get real numbers.
I've tried to understand what the issue is (played with resigning executables etc) but it's literally something about the inode of the executable itself. Most likely part of the OSX security system.
I've run into some memory corruption with it sometimes, though, so be wary of that. Emerge tools has an alternative for iOS at least, maybe one day they'll port it to mac.
Windows has microsecond precision counters (see QueryPerformanceCounter and friends)
---
[^1]: https://www.mankier.com/1/perf
https://abuisman.com/posts/developer-tools/quick-page-benchm...
As mentioned here in the thread, when you want to go into the single ms optimisations it is not the best approach since there is a lot of overhead especially the way I demonstrate here, but it works very well for some sanity checks.
Is it, though?
What I would expect a system like this to have, at a minimum:
I don't see any of this in hyperfine. It just… runs things N times and then does a naïve average/min/max? At that rate, one could just as well use a shell script and eyeball the results.This is not included in the core of hyperfine, but we do have scripts to compute "advanced" statistics, and to perform t-tests here: https://github.com/sharkdp/hyperfine/tree/master/scripts
Please feel free to comment here if you think it should be included in hyperfine itself: https://github.com/sharkdp/hyperfine/issues/523
> Automatic isolation to the greatest extent possible (given appropriate permissions)
This sounds interesting. Please feel free to open a ticket if you have any ideas.
> Interleaved execution, in case something external changes mid-way.
Please see the discussion here: https://github.com/sharkdp/hyperfine/issues/21
> It just… runs things N times and then does a naïve average/min/max?
While there is nothing wrong with computing average/min/max, this is not all hyperfine does. We also compute modified Z-scores to detect outliers. We use that to issue warnings, if we think the mean value is influenced by them. We also warn if the first run of a command took significantly longer than the rest of the runs and suggest counter-measures.
Depending on the benchmark I do, I tend to look at either the `min` or the `mean`. If I need something more fine-grained, I export the results and use the scripts referenced above.
> At that rate, one could just as well use a shell script and eyeball the results.
Statistical analysis (which you can consider to be basic) is just one reason why I wrote hyperfine. The other reason is that I wanted to make benchmarking easy to use. I use warmup runs, preparation commands and parametrized benchmarks all the time. I also frequently use the Markdown export or the JSON export to generate graphs or histograms. This is my personal experience. If you are not interested in all of these features, you can obviously "just as well use a shell script".
Back at the time I needed it, it had peak memory usage - hyperfine was not able to show it. Maybe this had changed by now.
[1] https://tratt.net/laurie/src/multitime/
> I stand firm in my belief that unless you can prove how CLT applies to your input distributions, you should not assume normality. And if you don't know what you are doing, stop reporting means.
I agree. My research group stopped using Hyperfine because it ranks benchmarked commands by mean, and provides standard deviation as a substitute for a confidence measure. These are not appropriate for heavy-tailed, skewed, and otherwise non-normal distributions.
It's easy to demonstrate that most empirical runtime distributions are not normal. I wrote BestGuess [0] because we needed a better benchmarking tool. Its analysis provides measures of skew, kurtosis, and Anderson-Darling distance from normal, so that you can see how normal or not is your distribution. It ranks benchmark results using non-parametric methods. And, unlike many tools, it saves all of the raw data, making it easy to re-analyze later.
My team also discovered that Hyperfine's measurements are a bit off. It reports longer run times than other tools, including BestGuess. I believe this is due to the approach, which is to call getrusage(), then fork/exec the program to be measured, then call getrusage() again. The difference in user and system times is reported as the time used by the benchmarked command, but unfortunately this time also includes cycles spent in the Rust code for managing processes (after the fork but before the exec).
BestGuess avoids external libraries (we can see all the relevant code), does almost nothing after the fork, and uses wait4() to get measurements. The one call to wait4() gives us what the OS measured by its own accounting for the benchmarked command.
While BestGuess is still a work in progress (not yet at version 1.0), my team has started using it regularly. I plan to continue its development, and I'll write it up soon at [1].
[0] https://gitlab.com/JamieTheRiveter/bestguess [1] https://jamiejennings.com