Oh man, don't get me started. This was a point in a talk I gave years ago called "Please Please Help the Compiler" (what I thought was a clever cut at the conventional wisdom at the time of "Don't Try to Help the Compiler")
I work on MSVC backend. I argued pretty strenuously at the time that noexcept was costly and being marketed incorrectly. Perhaps the costs are worth it, but none the less there is a cost
The reason is simple: there is a guarantee here that noexcept functions don't throw. std::terminate has to be called. That has to be implemented. There is some cost to that - conceptually every noexcept function (or worse, every call to a noexcept function) is surrounded by a giant try/catch(...) block.
Yes there are optimizations here. But it's still not free
Less obvious; how does inlining work? What happens if you inline a noexcept function into a function that allows exceptions? Do we now have "regions" of noexceptness inside that function (answer: yes). How do you implement that? Again, this is implementable, but this is even harder than the whole function case, and a naive/early implementation might prohibit inlining across degrees of noexcept-ness to be correct/as-if. And guess what, this is what early versions of MSVC did, and this was our biggest problem: a problem which grew release after release as noexcept permeated the standard library.
Anyway. My point is, we need more backend compiler engineers on WG21 and not just front end, library, and language lawyer guys.
I argued then that if instead noexcept violations were undefined, we could ignore all this, and instead just treat it as the pure optimization it was being marketed as (ie, help prove a region can't throw, so we can elide entire try/catch blocks etc). The reaction to my suggestion was not positive.
*edit 2 also I have since added a heuristic bonus for the "inline" keyword because I could no longer stand the irony of "inline" not having anything to do with inlining
*edit 3 ok, also statements like "consider doing X if you have no security exposure" haven't held up well
> Anyway. My point is, we need more backend compiler engineers on WG21 and not just front end, library, and language lawyer guys.
Even better, the current way of working is broken, WG21 should only discuss papers that come with a preview implementation, just like in other language ecosystems.
We have had too many features being approved with "on-paper only" designs, to be proven a bad idea when they finally got implemented, some of which removed/changed in later ISO revisions, that already prove the point this isn't working.
> I argued then that if instead noexcept violations were undefined, we could ignore all this, and instead just treat it as the pure optimization it was being marketed as (ie, help prove a region can't throw, so we can elide entire try/catch blocks etc).
Do you know if the reasoning for originally switching noexcept violations from UB to calling std::terminate was documented anywhere? The corresponding meeting minutes [0] describes the vote to change the behavior but not the reason(s). There's this bit, though:
> [Adamczyk] added that there was strong consensus that this approach did not add call overhead in quality exception handling implementations, and did not restrict optimization unnecessarily.
I think WG21 has been violently against adding additional UB to the language, because of some hacker news articles a decade ago about people being alarmed at null pointer checks being elided or things happening that didn’t match their expectation in signed int overflow or whatever. Generally it seems a view of spread that compiler implementers view undefined behavior as a license to party, that we’re generally having too much fun, and are not to be trusted.
In reality undefined behavior is useful in the sense that (like this case) it allows us to not have to write code to consider and handle certain situations - code which may make all situations slower, or allows certain optimizations to exist which work 99% of the time.
Regarding “not pan out”: I think the overhead of noexcept for the single function call case is fine, and inlining is and has always been the issue.
It's kinda funny that C++ even in recent editions generally reaches for the UB gun to enable optimizations, but somehow noexcept ended up to mean "well actually, try/catch std::terminate". I bet most C++-damaged people would expect throwing in a noexcept function to simply be UB and potentially blow their heap off or something instead of being neatly defined behavior with invisible overhead.
Probably the right thing for noexcept would be to enforce a "noexcept may only call noexcept methods", but that ship has sailed. I also understand that it would necessarily create the red/green method problem, but that's sort of unavoidable.
Unless you're C++-damaged enough to assume it's one of those bullshit gaslighting "it might actually not do anything lol" premature optimization keywords, like `constexpr`.
> I argued then that if instead noexcept violations were undefined, we could ignore all this, and instead just treat it as the pure optimization it was being marketed as (ie, help prove a region can't throw, so we can elide entire try/catch blocks etc). The reaction to my suggestion was not positive.
So instead of helping programmers actually write noexcept functions, you wanted to make this an even bigger footgun than it already is? How often are there try/catch blocks that are actually elideable in real-world code? How much performance would actually be gained by doing that, versus the cost of all of the security issues that this feature would introduce?
If the compiler actually checked that noexcept code can't throw exceptions (i.e. noexcept functions were only allowed to call other noexcept functions), and the only way to get exceptions in noexcept functions was calls to C code which then calls other C++ code that throws, then I would actually agree with you that this would have been OK as UB (since anyway there are no guarantees that even perfectly written C code that gets an exception wouldn't leave your system in a bad state). But with a feature that already relies on programmer care, and can break at every upgrade of a third party library, making this UB seems far too dangerous for far too little gain.
-fno-exceptions only prevents you from calling throw. If you don't want overhead likely you want -fno-asynchronous-unwind-tables + that clang flag that specifies that extern "C" functions don't throw
> there is a guarantee here that noexcept functions don't throw. std::terminate has to be called. That has to be implemented
Could you elaborate on how this causes more overhead than without noexcept? The fact that something has to be done when throwing an exception is true in both cases, right?. Naively it'd seem like without noexcept, you raise the exception; and with noexcept, you call std::terminate instead. Presumably the compiler is already moving your exception throwing instructions off the happy hot path.
Very very basic test with Clang: https://godbolt.org/z/6aqWWz4Pe
Looks like both variations have similar code structure, with 1 extra instruction for noexcept.
Pick a different architecture - anything 32bit. Exception handling on 64bit windows works differently, where the overhead is in the PE headers instead of asm directly (and is in general lower). You don't have the setup and teardown in your example
Throwing exception has the same overhead in both cases. In case of noexcept function, the function has to (or used to have, depending on architecture setup an exception handling frame and remove it when leaving.
>Naively it'd seem like without noexcept, you raise the exception; and with noexcept, you call std::terminate instead
Except you may call a normal function from a noexcept function, and this function may still raise an exception.
If you're on one of the platforms with sane exception handling, it's a matter of emitting different assembly code for the landing pad so that when unwinding it calls std::terminate instead of running destructors for the local scope. Zero additional overhead. If you're on old 32-bit Microsoft Windows using MSVC 6 or something, well, you might have problems. One of the lesser ones being increased overhead for noexcept.
I’m curious: where does the overhead of try/catch come from in a “zero-overhead” implementation?
Is it just that it forces the stack to be “sufficiently unwindable” in a way that might make it hard to apply optimisations that significantly alter the structure of the CFG? I could see inlining and TCO being tricky perhaps?
Or does Windows use a different implementation? Not sure if it uses the Itanium ABI or something else.
Everyone keeps scanning over the inlining issues, which I think are much larger
“Zero overhead” refers to the actual functions code gen; there are still tables and stuff that have to be updated
Our implementation of noexcept for the single function case I think is fine now. There is a single extra bit in the exception function info which is checked by the unwinder. Other than requiring exception info in cases where we otherwise wouldn’t
The inlining case has always been both more complicated and more of a problem. If your language feature inhibits inlining in any situation you have a real problem
Doesn't every function already need exception unwinding metadata? If the function is marked noexcept, then can't you write the logical equivalent of "Unwinding instructions: Don't." and the exception dispatcher can call std::terminate when it sees that?
Nah that was mostly about extern "C" functions which technically can't throw (so the noexcept runtime stuff would be optimized out) but in practice there is a ton of code marked extern "C" which throws
The most common place where noexcept improves performance is on move constructors and move assignments when moving is cheaper than copying. If your type is not nothrow moveable std::vector will copy it instead of moving when resizing, as the move constructor throwing would leave the vector in an invalid state (while the copy constructor throwing leaves the vector unchanged).
Platforms with setjmp-longjmp based exceptions benefit greatly from noexcept as there’s setup code required before calling functions which may throw. Those platforms are now mostly gone, though. Modern “zero cost” exceptions don’t execute a single instruction related to exception handling if no exceptions are thrown (hence the name), so there just isn’t much room for noexcept to be useful to the optimizer.
Outside of those two scenarios there isn’t any reason to expect noexcept to improve performance.
There is another standard library related scenario: hash tables. The std unordered containers will store the hash of each key unless your hash function is noexcept. Analogous to how vector needs noexcept move for fast reserve and resize, unordered containers need noexcept hash to avoid extra memory usage. See https://gcc.gnu.org/onlinedocs/libstdc++/manual/unordered_as...
For many key types and access patterns, storing the hash is faster anyway. I assume people who care about performance are already not using std::unordered_map though.
This is the correct analysis. The article's author could have saved themselves (and the reader) a good amount of blind data diving by learning more about exception processing beforehand.
That's quite interesting and a huge work has been done here, respect for that.
Here's what has jumped out at me: `noexcept` qualifier is not free in some cases, particularly, when a qualified function could actually throw, but is marked `noexcept`. In that case, a compiler still must set something up to fulfil the main `noexcept` promise - call `std::terminate()` if an exception is thrown. That means, that putting `noexcept` on each and every function blindly without any regard to whether the function could really throw or not (for example, `std::vector::push_back()` could throw on reallocation failure, hence if a `noexcept` qualified function call it, a compiler must take into account) doesn't actually test/benchmark/prove anything, since as the author correctly said, - you won't ever do this in a real production project.
It would be really interesting to take a look into a full code of cases that showed very bad performance, however, here we're approaching the second issue: if that's the core benchmark code: https://github.com/define-private-public/PSRayTracing/blob/a... then unfortunately it's totally invalid since it measures time with the `std::chrono::system_clock` which isn't monotonic. Given how long the code required to run, it's almost certain that the clock has been adjusted several times...
> in that case, a compiler still must set something up to fulfil the main `noexcept` promise - call `std::terminate()`
This is actually something that has been more of a problem in clang than gcc due to LLVM IR limitations... but that is being fixed (or maybe is already?) There was a presentation about it at the 2023 LLVM Developer's meeting which was recently published on their youtube channel https://www.youtube.com/watch?v=DMUeTaIe1CU
The short version (as I understand) is that you don't really need to produce any code to call std::terminate, all you need is tell the linker it needs to leave a hole in the table which maps %rip to the required unwind actions. If the unwinder doesn't know what to do, it will call std::terminate per the standard.
IR didn't have a way of expressing this "hole", though, so instead clang was forced to emit an explicit "handler" to do the std::terminate call
In MSVC we've also pretty heavily optimized the whole function case such that we no longer have a literal try/catch block around it (I think there is a single bit in our per function unwind info that the unwinder checks and kills the program if it encounters while unwinding). One extra branch but no increase in the unwind metadata size
The inlining case was always the hard problem to solve though
> then unfortunately it's totally invalid since it measures time with the `std::chrono::system_clock` which isn't monotonic. Given how long the code required to run, it's almost certain that the clock has been adjusted several times
monotonic clocks are mostly useful for short measurement periods. for long-term timing wall-time clocks (with their adjustments) are more accurate because they will drift less.
Ah, that's a great correction, thank you!
Yes, indeed, due to a drift, in order to discern second+ (?) differences on different machines (or same machines, but different OSes?), one definitely needs to use a wall-clock time, otherwise it's comparing apples to oranges. There's a lot of interesting questions related to that, but they out of the scope of the thread. If I'm not mistaken the author has also timed some individual small functions, which, if correct, still poses a problem to me, but for measuring huge long running tasks like a full suite running 10+ hours, they are probably right in choosing wall-clock timer indeed.
However, before researching into results any further (for example, -10% difference for `noexcept` case is extremely interesting to debug up to the root cause), I'd still like to understand how the code was run and measured exactly. I didn't find a plausible looking benchmark runner in their code base.
> I didn't know std::uniform_int_distribution doesn't actually produce the same results on different compilers
I think this is genuinely my biggest complaint about the C++ standard library. There are countless scenarios where you want deterministic random numbers (for testing if nothing else), so std's distributions are unusable. Fortunately you can just plug in Boost's implementation.
It's actually really important that uniform_int_distribution is implementation defined. The 'right' way to do it on one architecture is probably not the right way to do it on a different architecture.
For instance, Apple's new CPUs has very fast division. A convenient and useful tool to implement uniform_int_distribution relies on using modulo. So the implementation that runs on Apple's new CPUs ought to use the modulo instructions of the CPU.
On other architectures, the ISA might not even have a modulo instruction. In this case, it's very important that you don't try to emulate modulo in software; it's much better to rely other more complicated constructs to give a uniform distribution.
C++ is also expected to run on GPUs. NVIDIA's CUDA and AMD's HIP are both implementations of C++. (these implementations are non-compliant given the nature of GPUs, but both they and the C++ standard's committee have a shared goal of narrowing that gap) In general, std::uniform_int_distribution uses loops to eliminate redundancies; the 'happy path' has relatively easily predicted branches, but they can and do have instances where the branch is not easily predicted and will as often as not have to loop in order to complete. Doing this on a GPU might be multiple orders of magnitude slower than another method that's better suited for a GPU.
Overzealously dictating an implementation is why C++ ended up with a relatively bad hash table and very bad regex in the standard. It's a mistake that shouldn't be made again.
But reproducibility is as important as performance for the vast majority of use cases, if these implementation-defined bits start to affect the observable outcomes. (That's why we define the required time complexity for many container-related functions but do not actually specify the exact algorithm; difference in Big-O time complexity is just large enough to be "observed".)
A common solution is to provide two versions of such features, one for the less reproducible but maximally performant version and another for common middle grounds that can be reproduced reasonably efficiently across many common platforms. In fact I believe `std::chrono` was designed in that way to sidestep many uncertainties in platform clock implementations.
> Overzealously dictating an implementation is why C++ ended up with a relatively bad hash table and very bad regex in the standard.
What parts of the standard dictate a particular regex implementation? IIRC the performance issues are usually blamed on ABI compatibility constraints rather than the standard making a fast(er) implementation impossible.
However, I personally disagree with them since I think it's really important to have _some_ basic reproducibility for things like reproducing the results of a randomized test. In that case, I'm going to avoid changing as much as possible anyways.
> There are countless scenarios where you want deterministic random numbers (for testing if nothing else), so std's distributions are unusable. Fortunately you can just plug in Boost's implementation.
I don't understand what's your complain. If you're already plugging in alternative implementations,what stops you from actually stubbing these random number generators with any realization at all?
I don't feel like this article illuminates anything about how noexcept works. The asm diff at the end suggests _there is no difference_ in the emitted code. I plugged it into godbolt myself and see absolutely no difference. https://godbolt.org/z/jdro5jdnG
It seems the selected example function may not be exercising noexcept. I suppose the assumption is that operator[] is something that can throw, but ... perhaps the machinery lives outside the function (so should really examine function calls), or is never emitted without a try/catch, or operator[] (though not marked noexcept...) doesn't throw b/c OOB is undefined behavior, or ... ?
You can't just look at the codegen of the function itself, you also have to consider the metadata, and the overhead of processing any metadata
Specifically here (as I said in other comments) where it goes from complicated/quality of implementation issue to "shit this is complicated" is when you consider inlining. If noexcept inhibits inlining in any conceivable circumstances then it's having a dramatic (slightly indirect) impact on performance
> I don't feel like this article illuminates anything about how noexcept works. The asm diff at the end suggests _there is no difference_ in the emitted code.
You are absolutely correct. The OP is basically testing the hypothesis "Wrapping a function in `noexcept` will magically make it faster," which is (1) nonsense to anyone who knows how C++ works, and also (2) trivially easy to falsify, because all you have to do is look at the compiled code. Same codegen? Then it's not going to be faster (or slower). You needn't spend all those CPU cycles to find out what you already know by looking.
There has been a fair bit of literature written on the performance of exceptions and noexcept, but OP isn't contributing anything with this particular post.
The second one is much more interesting, because it shows where `noexcept` can actually have an effect on codegen in the core language. TLDR, it can matter on functions that the compiler can't inline, such as when crossing ABI boundaries or when (as in this case) it's an indirect call through a function pointer.
https://quuxplusone.github.io/blog/2022/07/30/type-erased-in...
I would like to have seen a comparison that actually includes -fno-exceptions, rather than just noexcept. My assumption is that to get a consistent gain from noexcept, you would need every function called to be explicitly noexcept, because a bunch of the cost of exceptions is code size and state required to support unwinding. So if the performance cost exception handling is causing is due to that, then if _anything_ can cause an exception (or I guess more accurately unless every opaque call is explicitly indicated to not cause an exception) then that overhead remains.
That said, I'm still confused by the perf results of the article, especially the perlin noise vs MSVC one. It's sufficiently weird outlier that it makes me wonder if something in the compiler has a noexcept path that adds checks that aren't usually on (i.e imagine the code has a "debug" mode that did bounds checks or something, but the function resolution you hit in the noexcept path always does the bounds check - I'm really not sure exactly how you'd get that to happen, but "non-default path was not benchmarked" is not exactly an uncommon occurrence)
Even a speedup of around 1% (if it is consistent and in a carefully controlled experiment) is significant for many workloads, if the workload is big enough.
The OP has this as in the fuzz, which it may be for that particular workload. But across a giant distributed system like youtube or Google search, it is a real gain.
programs can be quite sensitive to how code is laid out because of cache line alignment, cache conflicts etc.
So random changes can have a surprising impact.
There was a paper a couple of years ago explaining this and how to measure compiler optimizations more reliably. Sadly, I do not recall the title/author.
I work on MSVC backend. I argued pretty strenuously at the time that noexcept was costly and being marketed incorrectly. Perhaps the costs are worth it, but none the less there is a cost
The reason is simple: there is a guarantee here that noexcept functions don't throw. std::terminate has to be called. That has to be implemented. There is some cost to that - conceptually every noexcept function (or worse, every call to a noexcept function) is surrounded by a giant try/catch(...) block.
Yes there are optimizations here. But it's still not free
Less obvious; how does inlining work? What happens if you inline a noexcept function into a function that allows exceptions? Do we now have "regions" of noexceptness inside that function (answer: yes). How do you implement that? Again, this is implementable, but this is even harder than the whole function case, and a naive/early implementation might prohibit inlining across degrees of noexcept-ness to be correct/as-if. And guess what, this is what early versions of MSVC did, and this was our biggest problem: a problem which grew release after release as noexcept permeated the standard library.
Anyway. My point is, we need more backend compiler engineers on WG21 and not just front end, library, and language lawyer guys.
I argued then that if instead noexcept violations were undefined, we could ignore all this, and instead just treat it as the pure optimization it was being marketed as (ie, help prove a region can't throw, so we can elide entire try/catch blocks etc). The reaction to my suggestion was not positive.
https://github.com/TriangleCppDevelopersGroup/TerryMahaffeyC...
*edit except the stuff about fastlink
*edit 2 also I have since added a heuristic bonus for the "inline" keyword because I could no longer stand the irony of "inline" not having anything to do with inlining
*edit 3 ok, also statements like "consider doing X if you have no security exposure" haven't held up well
I would be very interested in an updated blog post on this if you felt so inclined!
Even better, the current way of working is broken, WG21 should only discuss papers that come with a preview implementation, just like in other language ecosystems.
We have had too many features being approved with "on-paper only" designs, to be proven a bad idea when they finally got implemented, some of which removed/changed in later ISO revisions, that already prove the point this isn't working.
Do you know if the reasoning for originally switching noexcept violations from UB to calling std::terminate was documented anywhere? The corresponding meeting minutes [0] describes the vote to change the behavior but not the reason(s). There's this bit, though:
> [Adamczyk] added that there was strong consensus that this approach did not add call overhead in quality exception handling implementations, and did not restrict optimization unnecessarily.
Did that view not pan out since that meeting?
[0]: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n30...
In reality undefined behavior is useful in the sense that (like this case) it allows us to not have to write code to consider and handle certain situations - code which may make all situations slower, or allows certain optimizations to exist which work 99% of the time.
Regarding “not pan out”: I think the overhead of noexcept for the single function call case is fine, and inlining is and has always been the issue.
well clearly there is a cost
So instead of helping programmers actually write noexcept functions, you wanted to make this an even bigger footgun than it already is? How often are there try/catch blocks that are actually elideable in real-world code? How much performance would actually be gained by doing that, versus the cost of all of the security issues that this feature would introduce?
If the compiler actually checked that noexcept code can't throw exceptions (i.e. noexcept functions were only allowed to call other noexcept functions), and the only way to get exceptions in noexcept functions was calls to C code which then calls other C++ code that throws, then I would actually agree with you that this would have been OK as UB (since anyway there are no guarantees that even perfectly written C code that gets an exception wouldn't leave your system in a bad state). But with a feature that already relies on programmer care, and can break at every upgrade of a third party library, making this UB seems far too dangerous for far too little gain.
Could you elaborate on how this causes more overhead than without noexcept? The fact that something has to be done when throwing an exception is true in both cases, right?. Naively it'd seem like without noexcept, you raise the exception; and with noexcept, you call std::terminate instead. Presumably the compiler is already moving your exception throwing instructions off the happy hot path.
Very very basic test with Clang: https://godbolt.org/z/6aqWWz4Pe Looks like both variations have similar code structure, with 1 extra instruction for noexcept.
Throwing exception has the same overhead in both cases. In case of noexcept function, the function has to (or used to have, depending on architecture setup an exception handling frame and remove it when leaving.
>Naively it'd seem like without noexcept, you raise the exception; and with noexcept, you call std::terminate instead
Except you may call a normal function from a noexcept function, and this function may still raise an exception.
Is it just that it forces the stack to be “sufficiently unwindable” in a way that might make it hard to apply optimisations that significantly alter the structure of the CFG? I could see inlining and TCO being tricky perhaps?
Or does Windows use a different implementation? Not sure if it uses the Itanium ABI or something else.
“Zero overhead” refers to the actual functions code gen; there are still tables and stuff that have to be updated
Our implementation of noexcept for the single function case I think is fine now. There is a single extra bit in the exception function info which is checked by the unwinder. Other than requiring exception info in cases where we otherwise wouldn’t
The inlining case has always been both more complicated and more of a problem. If your language feature inhibits inlining in any situation you have a real problem
Platforms with setjmp-longjmp based exceptions benefit greatly from noexcept as there’s setup code required before calling functions which may throw. Those platforms are now mostly gone, though. Modern “zero cost” exceptions don’t execute a single instruction related to exception handling if no exceptions are thrown (hence the name), so there just isn’t much room for noexcept to be useful to the optimizer.
Outside of those two scenarios there isn’t any reason to expect noexcept to improve performance.
Here's what has jumped out at me: `noexcept` qualifier is not free in some cases, particularly, when a qualified function could actually throw, but is marked `noexcept`. In that case, a compiler still must set something up to fulfil the main `noexcept` promise - call `std::terminate()` if an exception is thrown. That means, that putting `noexcept` on each and every function blindly without any regard to whether the function could really throw or not (for example, `std::vector::push_back()` could throw on reallocation failure, hence if a `noexcept` qualified function call it, a compiler must take into account) doesn't actually test/benchmark/prove anything, since as the author correctly said, - you won't ever do this in a real production project. It would be really interesting to take a look into a full code of cases that showed very bad performance, however, here we're approaching the second issue: if that's the core benchmark code: https://github.com/define-private-public/PSRayTracing/blob/a... then unfortunately it's totally invalid since it measures time with the `std::chrono::system_clock` which isn't monotonic. Given how long the code required to run, it's almost certain that the clock has been adjusted several times...
This is actually something that has been more of a problem in clang than gcc due to LLVM IR limitations... but that is being fixed (or maybe is already?) There was a presentation about it at the 2023 LLVM Developer's meeting which was recently published on their youtube channel https://www.youtube.com/watch?v=DMUeTaIe1CU
The short version (as I understand) is that you don't really need to produce any code to call std::terminate, all you need is tell the linker it needs to leave a hole in the table which maps %rip to the required unwind actions. If the unwinder doesn't know what to do, it will call std::terminate per the standard.
IR didn't have a way of expressing this "hole", though, so instead clang was forced to emit an explicit "handler" to do the std::terminate call
The inlining case was always the hard problem to solve though
monotonic clocks are mostly useful for short measurement periods. for long-term timing wall-time clocks (with their adjustments) are more accurate because they will drift less.
However, before researching into results any further (for example, -10% difference for `noexcept` case is extremely interesting to debug up to the root cause), I'd still like to understand how the code was run and measured exactly. I didn't find a plausible looking benchmark runner in their code base.
I think this is genuinely my biggest complaint about the C++ standard library. There are countless scenarios where you want deterministic random numbers (for testing if nothing else), so std's distributions are unusable. Fortunately you can just plug in Boost's implementation.
For instance, Apple's new CPUs has very fast division. A convenient and useful tool to implement uniform_int_distribution relies on using modulo. So the implementation that runs on Apple's new CPUs ought to use the modulo instructions of the CPU.
On other architectures, the ISA might not even have a modulo instruction. In this case, it's very important that you don't try to emulate modulo in software; it's much better to rely other more complicated constructs to give a uniform distribution.
C++ is also expected to run on GPUs. NVIDIA's CUDA and AMD's HIP are both implementations of C++. (these implementations are non-compliant given the nature of GPUs, but both they and the C++ standard's committee have a shared goal of narrowing that gap) In general, std::uniform_int_distribution uses loops to eliminate redundancies; the 'happy path' has relatively easily predicted branches, but they can and do have instances where the branch is not easily predicted and will as often as not have to loop in order to complete. Doing this on a GPU might be multiple orders of magnitude slower than another method that's better suited for a GPU.
Overzealously dictating an implementation is why C++ ended up with a relatively bad hash table and very bad regex in the standard. It's a mistake that shouldn't be made again.
A common solution is to provide two versions of such features, one for the less reproducible but maximally performant version and another for common middle grounds that can be reproduced reasonably efficiently across many common platforms. In fact I believe `std::chrono` was designed in that way to sidestep many uncertainties in platform clock implementations.
What parts of the standard dictate a particular regex implementation? IIRC the performance issues are usually blamed on ABI compatibility constraints rather than the standard making a fast(er) implementation impossible.
What do you think of Abseil hash tables randomizing themselves (piggybacking on ASLR) on each start of your program?
However, I personally disagree with them since I think it's really important to have _some_ basic reproducibility for things like reproducing the results of a randomized test. In that case, I'm going to avoid changing as much as possible anyways.
I don't understand what's your complain. If you're already plugging in alternative implementations,what stops you from actually stubbing these random number generators with any realization at all?
It seems the selected example function may not be exercising noexcept. I suppose the assumption is that operator[] is something that can throw, but ... perhaps the machinery lives outside the function (so should really examine function calls), or is never emitted without a try/catch, or operator[] (though not marked noexcept...) doesn't throw b/c OOB is undefined behavior, or ... ?
Specifically here (as I said in other comments) where it goes from complicated/quality of implementation issue to "shit this is complicated" is when you consider inlining. If noexcept inhibits inlining in any conceivable circumstances then it's having a dramatic (slightly indirect) impact on performance
You are absolutely correct. The OP is basically testing the hypothesis "Wrapping a function in `noexcept` will magically make it faster," which is (1) nonsense to anyone who knows how C++ works, and also (2) trivially easy to falsify, because all you have to do is look at the compiled code. Same codegen? Then it's not going to be faster (or slower). You needn't spend all those CPU cycles to find out what you already know by looking.
There has been a fair bit of literature written on the performance of exceptions and noexcept, but OP isn't contributing anything with this particular post.
Here are two of my own blog posts on the subject. The first one is just an explanation of the "vector pessimization" which was also mentioned (obliquely) in OP's post — but with an actual benchmark where you can see why it matters. https://quuxplusone.github.io/blog/2022/08/26/vector-pessimi...https://godbolt.org/z/e4jEcdfT9
The second one is much more interesting, because it shows where `noexcept` can actually have an effect on codegen in the core language. TLDR, it can matter on functions that the compiler can't inline, such as when crossing ABI boundaries or when (as in this case) it's an indirect call through a function pointer. https://quuxplusone.github.io/blog/2022/07/30/type-erased-in...
https://godbolt.org/z/1asa7Tjq9
That said, I'm still confused by the perf results of the article, especially the perlin noise vs MSVC one. It's sufficiently weird outlier that it makes me wonder if something in the compiler has a noexcept path that adds checks that aren't usually on (i.e imagine the code has a "debug" mode that did bounds checks or something, but the function resolution you hit in the noexcept path always does the bounds check - I'm really not sure exactly how you'd get that to happen, but "non-default path was not benchmarked" is not exactly an uncommon occurrence)
The OP has this as in the fuzz, which it may be for that particular workload. But across a giant distributed system like youtube or Google search, it is a real gain.
programs can be quite sensitive to how code is laid out because of cache line alignment, cache conflicts etc.
So random changes can have a surprising impact.
There was a paper a couple of years ago explaining this and how to measure compiler optimizations more reliably. Sadly, I do not recall the title/author.