Also measure many times, and make sure any difference you see isn't accounted for by variation in the test system (CPU frequency scaling due to temperature, your test being run partly (or entirely) on low-power cores if your CPU has such a distinction, network contention if you aren't working locally or over a link dedicated to the tests, etc.).
Also where IO is at all involved remember to test both cold and warm starts, and make sure that if your new version seems faster in the results that it actually is rather than a previous run having “warmed up” something. Make sure any prep you do before each measurement doesn't inadvertently change the results – keep prep and actual the tests separate (and cool down between if trying to analyse cold-run performance). On this latter point I once had a heck of a time convincing someone that his new index hadn't sped up a SQL query by an order of magnitude – he was correctly running from cold each time but was adding the index and running the query immediately (without dropping cached pages between), the affected table was small enough to fit in memory and adding the index caused a full scan so when the actual test was done the system was no longer cold.
Worth pointing out that this is actually what Knuth meant by premature optimisation. Nothing to do with optimising too early in a project or some sense of YAGNI, but rather optimising before you know where the actual time is spent.
I don't think the original paper by Knuth (https://dl.acm.org/doi/10.1145/356635.356640) supports this interpretation. In the paper, Knuth describes how as one develops as a programmer one learns how to write in a style that is generally reasonably efficient. In particular, you should take note of hot loops, which are often quite obvious, and then write in the most readable style outside of the hot loops without too much thought about efficiency. By accepting some inefficiency, but writing in a generally efficient style, you can move much more quickly than if you are constantly bogged down in the details trying to squeeze out every possible cpu cycle.
What people tend to miss in my view is that he is advocating writing in an efficient style, which may not be obvious to an inexperienced developer. For example, writing a linear algebra routine in which you traverse column wise in an inner loop when the data is laid out row wise is going to be very slow and a simple change of algorithm or restructuring of the data can lead to an enormous speed up. This should be done immediately if discovered. But a skilled programmer will also be unlikely to traverse in the wrong order anyway because they already know the data layout and have thought about the order of traversal as part of their thinking process. In other words, the "optimization" of traversing in the fast way is something they would just do naturally and has no influence on the development time.
Conversely, they may also suspect that with some clever bit-twiddling hacks, they may be able to squeeze another 5-10% performance out of the routine. Unless you know that the code is being distributed at scale, this is almost certainly a waste of time. It will be hard to write as well as brittle and error prone. It deflects your attention from more important problems. It is evil both because of the opportunity cost of chasing these small efficiencies as well as the increased complexity that they often introduce for minimal gain.
I'm pasting the famous passage below, and I don't know how to interpret it as anything other than a call to correctly identify the performance critical part of one's code before spending significant time on optimization. If you can intuit that, more power to you! If you can't, you must measure, and Knuth gives the suggestion that your compiler should do this automatically for you:
"There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail. After working with such tools or seven years, I've become convinced that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specifically turned off."
I certainly agree with the interpretation that the _cost_ is maintainability of prematurely optimised code and this is the evil he's describing. But this is also all in the context of efficiency as a virtue in itself - i.e. something you should be seeking, but seeking in the correct places.
My problem is that people use this as an excuse not to consider optimization at all initially, because they think that once they have a working version, they can "just" profile and optimize the slowest few functions to get ideal performance. From the point of view of that singular task, yes, they achieved an optimization objective. From the point of view of overall performance vs what was possible given certain hardware, they could still be several orders of magnitude away from optimal.
A profiler can't tell you that an entire operation would become completely unnecessary if you maintain a certain invariant elsewhere, and this can go as far out as you let it, e.g. entirely changing your in-memory data model to prepare it for the operations you know you'll be performing in the critical path.
I know because I've been fortunate to work in environments full of seasoned experts writing world-scale software in C++ and Rust, that I was still able to optimize 100x-1Mx by zooming out and solving the same problem from a new perspective, something a profiler cannot help with.
And I'm _nothing_ compared to the people that come around and make it another 10x faster with SIMD or some cutting edge computer science research I'm not across. (Remember when NFA regex was a breakthrough and now looks like a joke compared to DFA?). Even a profiler hinting that a hot function might benefit from SIMD can't tell you how you should arrange your overall data structures to fully benefit from SIMD, or cache locality for that matter.
This isn't an attack on what you said, and what I'm saying is bordering on incoherent because there's so much more to the performance optimization iceberg than the "profiling" tip that most engineers are permanently stuck on. But no, every time aggressive holistic optimization conversations come up, more than half of the audience will dismiss it as "premature" or insist on just profiling and decline to learn anything. Then people like me have to inherit and rewrite their projects because they've ground our global scale to a halt.
I'm curious how this interacts with language selection. In particular, it looks like most of the programming languages in widespread use in 1974 were all relatively efficient compiled and statically typed languages (Fortran, cobol, C) with the notable exception of Basic. Would Knuth have considered choosing a performant language as "optimization" or just as part of "writing in an efficient style as any good programmer should do"?
The distinction is a lot more important today with the ~100x speed and power consumption differences between Ruby/Python/etc and C++/Java/Go/Rust/etc.
I think this has been a costly misinterpretation of "premature optimisation". This phrase has been taken as justification for writing inefficient code, understanding it as "performance doesn't matter [for what we do]".
It almost always matters, and I believe it is possible to write both efficiently and in a readable way. Unfortunately, inefficient code is well established, in habits and even in (sometimes mandatory) coding styles. This inefficiency also often translates to hard to follow code because of too much abstraction.
Maybe we could counter this misinterpretation of "premature optimisation" with "premature abstraction".
> This phrase has been taken as justification for writing inefficient code, understanding it as "performance doesn't matter [for what we do]".
There is no contradiction. Writing inefficient code is perfectly fine until you _know_ (this means you have measurements that show that) it is _too_ inefficient and has to be optimised.
How do you measure the "optimisation" without "inefficient code" as baseline to begin with?
I've watched another version of this talk (maybe at Strange Loop), but I'm pretty sure I hadn't seen this hangup on measuring weird C++ artefacts against Python before which comes before the interesting stuff and for me was a big disappointment.
Comparing the size of C++ int (which will be the 32-bit signed integer because hey why not) against Python's integers (which are just fully arbitrary big integers) is unfair.
But it doesn't stop there, next it measures std::list<int> which is a doubly linked list, a container that C++ keeps around partly because it is infested with the sort of programmers who've heard this data structure is sometimes a good choice (it is) and therefore reach for it first -- exactly the sort of bad optimization we're warning against. Python doesn't even have a doubly linked list.
Then it compares std::map<int,int> against Python's dict, but std::map is a red-black tree of linked lists, Python's dict is a hashtable, because of course it is. So these aren't anywhere close to comparing like with like either.
Not everyday you see Emery be the top comment on HN.
I was lucky to be in his courses in grad-school. It's the most fun I've had in an American classroom, all while discussing serious systems stuff. A gem of a professor.
If you don't have a performance culture your app will not be performant. The idea that performance problems are easy to reproduce and a profiler will always point to a single, easy to fix problem is naive.
These online tech spaces always assume that saving on compute, storage, and memory are the ultimate targets for optimization.
I’ve worked (mostly on business/data teams) in companies for 10 years, and have to say, saving on technical debt and human inefficiency w/ a codebase typically far outweighs the computational metrics
How might we incorporate that in a metric-like way?
I agree. However simple to understand code with high human efficiency is nearly impossible to measure. Honestly, maybe ChatGPT could give you the best evaluation/measure of that.
“write code clear enough that an LLM can look at it and create an accurate summary of the high-level goals” gives me similar vibes to “explain like I’m 5”, which seems like it could be an interesting metric (and my gut tells me that things will go horribly wrong as soon as somebody turns this metric into a target :D)
It's easy to measure. You sit a dev down in front of it and time how long it takes them to understand it. That's not just an estimate that's the key value itself.
What techniques do people use to profile code where the core path takes <<1ms? A sampling profiler is max 1khz and thus aliasing artifacts will hide the true source of hotspots I find and valgrind is too slow. The other problem I have is that diffuse hotspots are hard to root cause when running the optimized version (eg hard to spot bounds checking). I’m not confident that simply running the code under test longer helps with the aliasing artifacts (probably because if the core loop is running in 1us then I have to run for 1000x longer to remove the sampling bias due to Nyquist if that’s even the right strategy?)
>> What techniques do people use to profile code where the core path takes <<1ms?
Make a loop that runs the code 5000 times and time that. What do you mean by "diffuse hot spots"? If a piece of code taking 1ms is called infrequently then optimizing it is not going to have a meaningful impact on your overall execution time.
You can measure the time for longer running higher level functions that do a lot. If you then optimize some small function that's a leaf in the call graph, you can see the impact on the larger function - and it probably won't be much.
One thing that can happen when nobody cares about performance is that every part of the code gets little inefficiencies and fixing any one of them has very little impact, but fixing ALL of them can be really significant. In that case you should start by profiling or timing the higher level functions and finding ones that have relatively small call trees that you can work on. Once you optimize one piece, the effects of optimizing others are a bit more significant because they are a larger part of overall execution time.
Diffuse could be something like bounds checking in Rust. It’s not code I’ve written nor will any profiler call out the bounds check - they’ll at best point to the high level collection method but each one is a speed bump. Or the compiler generating inefficient code - spelunking through assembly hoping to spot the problem isn’t an effective use of time and profilers I’ve tried (valgrind and hotspot) have been unhelpful in giving me the info.
The advice you’ve provided is correct philosophy but not actually actionable - I’m having trouble finding the inefficiencies.
Profiling != benchmarking. Profiling is great to find _where_ you should setup benchmarks in your code. If you've got code in a path that consistently takes <1ms and isn't called enough times to significantly add to the profiling of the application, would it really warrant benchmark comparisons?
As others have said, benchmark frameworks are great. Many systems have support for high-resolution perf counters these days. Just ran a quick test on my local system, it's able to make a timer accurate to 100ns.
Benchmark frameworks let you hook into your code in such a way that you can easily run a single fragment 10k-10m times and produce an analysis of the result that includes mean time, std deviation, and memory allocations. They usually let you setup a side by side comparison so you can directly compare 2+ methods.
If you use a language with a JIT then good frameworks for those include a warmup period to get the best-JIT version of the code in place before collecting data.
Yes I’ve done all that. But I’m still at being able to do the operation 20M times/s (ie 50ns) whereas a prototype I tested could do 100M (10ns) so I’m having trouble spotting the speed bumps that are in the way.
I'm confused here. Does this core path only run once per second? I've had no problems profiling blocks of code that short, because it sounds like every inner loop to me.
If you're looking to optimize a few hundred instructions, then focus on reading and understanding the code manually at the instruction level. A profiler will never help you there, CPUs are too complex:
1. I don't trust that the profiler has equal probability of pausing on any specific instruction
2. CPUs do no execute instructions in sequence, but in parallel. So you will need to understand data dependencies & maybe pipeline structures for your target processor. If you're getting really serious, you'll want something like https://uops.info/table.html
So I’m toying with my own database implementation and I’m getting ~20M writes per second. The write operation is a critical path that takes 1/20Mhz (ie 50ns). A sampling profiler will only capture max 1000hz (and even that is pushing it). And a hot spot here is 5-10ns probably. And similarly it could very well be memory transfer costs vs cpu which I’ve found hard to distinguish.
I’ve had luck with valgrind but just reading individual instructions has been uninformative as the code path is too complex for that kind of analysis.
A optimized implementation should only incur overhead around one hardware timestamp per function call which is generally around 10-100 ns. That is sufficient for benchmarking quantitative differences at the 1000 ns level and even clear qualitative differences at the 100 ns level. The bias is also fairly stable so you can generally even subtract it out with a little bit of care. And really, the microarchitectural state starts dominating at that point anyways, so anything more fine-grained than that starts depending on making sure the surrounding code did not mess up your L2 or branch predictor and such.
Do you have any link to something that does it? I’ve not been able to find anything. And the things I’m trying to optimize is 50ns -> 10ns so I’d need to be careful about the overhead but hopefully it can be set to a mode that only does it for instrumented functions I’m trying to profile rather than all entry/exit
Not that the exact tracing relies on Intel PT - support for AMD was added recently but uses perf so suffers from the same sampling/skew issues, but is still very useful.
Thanks for the tip. I thankfully went with Intel so it should work and looks like it might get close. My needs are a bit higher frequency than that but it should help with other use cases that aren’t as high frequency + maybe there’s a way to set it to only sample instrumented functions so that the timer cost is further amortized.
Also where IO is at all involved remember to test both cold and warm starts, and make sure that if your new version seems faster in the results that it actually is rather than a previous run having “warmed up” something. Make sure any prep you do before each measurement doesn't inadvertently change the results – keep prep and actual the tests separate (and cool down between if trying to analyse cold-run performance). On this latter point I once had a heck of a time convincing someone that his new index hadn't sped up a SQL query by an order of magnitude – he was correctly running from cold each time but was adding the index and running the query immediately (without dropping cached pages between), the affected table was small enough to fit in memory and adding the index caused a full scan so when the actual test was done the system was no longer cold.
What people tend to miss in my view is that he is advocating writing in an efficient style, which may not be obvious to an inexperienced developer. For example, writing a linear algebra routine in which you traverse column wise in an inner loop when the data is laid out row wise is going to be very slow and a simple change of algorithm or restructuring of the data can lead to an enormous speed up. This should be done immediately if discovered. But a skilled programmer will also be unlikely to traverse in the wrong order anyway because they already know the data layout and have thought about the order of traversal as part of their thinking process. In other words, the "optimization" of traversing in the fast way is something they would just do naturally and has no influence on the development time.
Conversely, they may also suspect that with some clever bit-twiddling hacks, they may be able to squeeze another 5-10% performance out of the routine. Unless you know that the code is being distributed at scale, this is almost certainly a waste of time. It will be hard to write as well as brittle and error prone. It deflects your attention from more important problems. It is evil both because of the opportunity cost of chasing these small efficiencies as well as the increased complexity that they often introduce for minimal gain.
"There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail. After working with such tools or seven years, I've become convinced that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specifically turned off."
I certainly agree with the interpretation that the _cost_ is maintainability of prematurely optimised code and this is the evil he's describing. But this is also all in the context of efficiency as a virtue in itself - i.e. something you should be seeking, but seeking in the correct places.
A profiler can't tell you that an entire operation would become completely unnecessary if you maintain a certain invariant elsewhere, and this can go as far out as you let it, e.g. entirely changing your in-memory data model to prepare it for the operations you know you'll be performing in the critical path.
I know because I've been fortunate to work in environments full of seasoned experts writing world-scale software in C++ and Rust, that I was still able to optimize 100x-1Mx by zooming out and solving the same problem from a new perspective, something a profiler cannot help with.
And I'm _nothing_ compared to the people that come around and make it another 10x faster with SIMD or some cutting edge computer science research I'm not across. (Remember when NFA regex was a breakthrough and now looks like a joke compared to DFA?). Even a profiler hinting that a hot function might benefit from SIMD can't tell you how you should arrange your overall data structures to fully benefit from SIMD, or cache locality for that matter.
This isn't an attack on what you said, and what I'm saying is bordering on incoherent because there's so much more to the performance optimization iceberg than the "profiling" tip that most engineers are permanently stuck on. But no, every time aggressive holistic optimization conversations come up, more than half of the audience will dismiss it as "premature" or insist on just profiling and decline to learn anything. Then people like me have to inherit and rewrite their projects because they've ground our global scale to a halt.
The distinction is a lot more important today with the ~100x speed and power consumption differences between Ruby/Python/etc and C++/Java/Go/Rust/etc.
I think this has been a costly misinterpretation of "premature optimisation". This phrase has been taken as justification for writing inefficient code, understanding it as "performance doesn't matter [for what we do]".
It almost always matters, and I believe it is possible to write both efficiently and in a readable way. Unfortunately, inefficient code is well established, in habits and even in (sometimes mandatory) coding styles. This inefficiency also often translates to hard to follow code because of too much abstraction.
Maybe we could counter this misinterpretation of "premature optimisation" with "premature abstraction".
There is no contradiction. Writing inefficient code is perfectly fine until you _know_ (this means you have measurements that show that) it is _too_ inefficient and has to be optimised.
How do you measure the "optimisation" without "inefficient code" as baseline to begin with?
It's mind blowing, but also terrifying when you realize how many mistakes you've been making up until this point.
https://youtu.be/7g1Acy5eGbE
Comparing the size of C++ int (which will be the 32-bit signed integer because hey why not) against Python's integers (which are just fully arbitrary big integers) is unfair.
But it doesn't stop there, next it measures std::list<int> which is a doubly linked list, a container that C++ keeps around partly because it is infested with the sort of programmers who've heard this data structure is sometimes a good choice (it is) and therefore reach for it first -- exactly the sort of bad optimization we're warning against. Python doesn't even have a doubly linked list.
Then it compares std::map<int,int> against Python's dict, but std::map is a red-black tree of linked lists, Python's dict is a hashtable, because of course it is. So these aren't anywhere close to comparing like with like either.
I was lucky to be in his courses in grad-school. It's the most fun I've had in an American classroom, all while discussing serious systems stuff. A gem of a professor.
Or it may be that the conditions to reach that logic are too rare to make the time to measure and optimize worthwhile.
In any case, I understand it as: if you're not bothered about measuring the performance of something, it may be that you don't need to optimize it.
I’ve worked (mostly on business/data teams) in companies for 10 years, and have to say, saving on technical debt and human inefficiency w/ a codebase typically far outweighs the computational metrics
How might we incorporate that in a metric-like way?
Eliminating breakdowns thus eliminate "waste" in required effort, coordination etc.
So I don't see how human processes differs from technical ones that much...
Make a loop that runs the code 5000 times and time that. What do you mean by "diffuse hot spots"? If a piece of code taking 1ms is called infrequently then optimizing it is not going to have a meaningful impact on your overall execution time.
You can measure the time for longer running higher level functions that do a lot. If you then optimize some small function that's a leaf in the call graph, you can see the impact on the larger function - and it probably won't be much.
One thing that can happen when nobody cares about performance is that every part of the code gets little inefficiencies and fixing any one of them has very little impact, but fixing ALL of them can be really significant. In that case you should start by profiling or timing the higher level functions and finding ones that have relatively small call trees that you can work on. Once you optimize one piece, the effects of optimizing others are a bit more significant because they are a larger part of overall execution time.
The advice you’ve provided is correct philosophy but not actually actionable - I’m having trouble finding the inefficiencies.
As others have said, benchmark frameworks are great. Many systems have support for high-resolution perf counters these days. Just ran a quick test on my local system, it's able to make a timer accurate to 100ns.
Benchmark frameworks let you hook into your code in such a way that you can easily run a single fragment 10k-10m times and produce an analysis of the result that includes mean time, std deviation, and memory allocations. They usually let you setup a side by side comparison so you can directly compare 2+ methods.
If you use a language with a JIT then good frameworks for those include a warmup period to get the best-JIT version of the code in place before collecting data.
If you're looking to optimize a few hundred instructions, then focus on reading and understanding the code manually at the instruction level. A profiler will never help you there, CPUs are too complex:
1. I don't trust that the profiler has equal probability of pausing on any specific instruction
2. CPUs do no execute instructions in sequence, but in parallel. So you will need to understand data dependencies & maybe pipeline structures for your target processor. If you're getting really serious, you'll want something like https://uops.info/table.html
I’ve had luck with valgrind but just reading individual instructions has been uninformative as the code path is too complex for that kind of analysis.
A optimized implementation should only incur overhead around one hardware timestamp per function call which is generally around 10-100 ns. That is sufficient for benchmarking quantitative differences at the 1000 ns level and even clear qualitative differences at the 100 ns level. The bias is also fairly stable so you can generally even subtract it out with a little bit of care. And really, the microarchitectural state starts dominating at that point anyways, so anything more fine-grained than that starts depending on making sure the surrounding code did not mess up your L2 or branch predictor and such.
https://github.com/janestreet/magic-trace
Not that the exact tracing relies on Intel PT - support for AMD was added recently but uses perf so suffers from the same sampling/skew issues, but is still very useful.
The most interesting results, were cache hits. It could have a huge impact (like 1000X), if you could keep execution and data in lower-level caches.
Often, the fix was cut-and-paste repeated code, as opposed to loops or function calls, FP-like static functions, small stack frames, etc.