Readit News logoReadit News
terrymah · a year ago
Oh man, don't get me started. This was a point in a talk I gave years ago called "Please Please Help the Compiler" (what I thought was a clever cut at the conventional wisdom at the time of "Don't Try to Help the Compiler")

I work on MSVC backend. I argued pretty strenuously at the time that noexcept was costly and being marketed incorrectly. Perhaps the costs are worth it, but none the less there is a cost

The reason is simple: there is a guarantee here that noexcept functions don't throw. std::terminate has to be called. That has to be implemented. There is some cost to that - conceptually every noexcept function (or worse, every call to a noexcept function) is surrounded by a giant try/catch(...) block.

Yes there are optimizations here. But it's still not free

Less obvious; how does inlining work? What happens if you inline a noexcept function into a function that allows exceptions? Do we now have "regions" of noexceptness inside that function (answer: yes). How do you implement that? Again, this is implementable, but this is even harder than the whole function case, and a naive/early implementation might prohibit inlining across degrees of noexcept-ness to be correct/as-if. And guess what, this is what early versions of MSVC did, and this was our biggest problem: a problem which grew release after release as noexcept permeated the standard library.

Anyway. My point is, we need more backend compiler engineers on WG21 and not just front end, library, and language lawyer guys.

I argued then that if instead noexcept violations were undefined, we could ignore all this, and instead just treat it as the pure optimization it was being marketed as (ie, help prove a region can't throw, so we can elide entire try/catch blocks etc). The reaction to my suggestion was not positive.

terrymah · a year ago
Oh, cool! I googled myself and someone actually archived the slides from the talk I gave. I think it holds up pretty well today

https://github.com/TriangleCppDevelopersGroup/TerryMahaffeyC...

*edit except the stuff about fastlink

*edit 2 also I have since added a heuristic bonus for the "inline" keyword because I could no longer stand the irony of "inline" not having anything to do with inlining

*edit 3 ok, also statements like "consider doing X if you have no security exposure" haven't held up well

jahnu · a year ago
Props for the edits ;)

I would be very interested in an updated blog post on this if you felt so inclined!

pjmlp · a year ago
> Anyway. My point is, we need more backend compiler engineers on WG21 and not just front end, library, and language lawyer guys.

Even better, the current way of working is broken, WG21 should only discuss papers that come with a preview implementation, just like in other language ecosystems.

We have had too many features being approved with "on-paper only" designs, to be proven a bad idea when they finally got implemented, some of which removed/changed in later ISO revisions, that already prove the point this isn't working.

aw1621107 · a year ago
> I argued then that if instead noexcept violations were undefined, we could ignore all this, and instead just treat it as the pure optimization it was being marketed as (ie, help prove a region can't throw, so we can elide entire try/catch blocks etc).

Do you know if the reasoning for originally switching noexcept violations from UB to calling std::terminate was documented anywhere? The corresponding meeting minutes [0] describes the vote to change the behavior but not the reason(s). There's this bit, though:

> [Adamczyk] added that there was strong consensus that this approach did not add call overhead in quality exception handling implementations, and did not restrict optimization unnecessarily.

Did that view not pan out since that meeting?

[0]: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n30...

terrymah · a year ago
I think WG21 has been violently against adding additional UB to the language, because of some hacker news articles a decade ago about people being alarmed at null pointer checks being elided or things happening that didn’t match their expectation in signed int overflow or whatever. Generally it seems a view of spread that compiler implementers view undefined behavior as a license to party, that we’re generally having too much fun, and are not to be trusted.

In reality undefined behavior is useful in the sense that (like this case) it allows us to not have to write code to consider and handle certain situations - code which may make all situations slower, or allows certain optimizations to exist which work 99% of the time.

Regarding “not pan out”: I think the overhead of noexcept for the single function call case is fine, and inlining is and has always been the issue.

flamedoge · a year ago
> did not restrict optimization unnecessarily.

well clearly there is a cost

formerly_proven · a year ago
It's kinda funny that C++ even in recent editions generally reaches for the UB gun to enable optimizations, but somehow noexcept ended up to mean "well actually, try/catch std::terminate". I bet most C++-damaged people would expect throwing in a noexcept function to simply be UB and potentially blow their heap off or something instead of being neatly defined behavior with invisible overhead.
cogman10 · a year ago
Probably the right thing for noexcept would be to enforce a "noexcept may only call noexcept methods", but that ship has sailed. I also understand that it would necessarily create the red/green method problem, but that's sort of unavoidable.
shrimp_emoji · a year ago
Unless you're C++-damaged enough to assume it's one of those bullshit gaslighting "it might actually not do anything lol" premature optimization keywords, like `constexpr`.
tsimionescu · a year ago
> I argued then that if instead noexcept violations were undefined, we could ignore all this, and instead just treat it as the pure optimization it was being marketed as (ie, help prove a region can't throw, so we can elide entire try/catch blocks etc). The reaction to my suggestion was not positive.

So instead of helping programmers actually write noexcept functions, you wanted to make this an even bigger footgun than it already is? How often are there try/catch blocks that are actually elideable in real-world code? How much performance would actually be gained by doing that, versus the cost of all of the security issues that this feature would introduce?

If the compiler actually checked that noexcept code can't throw exceptions (i.e. noexcept functions were only allowed to call other noexcept functions), and the only way to get exceptions in noexcept functions was calls to C code which then calls other C++ code that throws, then I would actually agree with you that this would have been OK as UB (since anyway there are no guarantees that even perfectly written C code that gets an exception wouldn't leave your system in a bad state). But with a feature that already relies on programmer care, and can break at every upgrade of a third party library, making this UB seems far too dangerous for far too little gain.

rockwotj · a year ago
Added to my list why I compile with -fno-exceptions
jcelerier · a year ago
-fno-exceptions only prevents you from calling throw. If you don't want overhead likely you want -fno-asynchronous-unwind-tables + that clang flag that specifies that extern "C" functions don't throw
mcdeltat · a year ago
> there is a guarantee here that noexcept functions don't throw. std::terminate has to be called. That has to be implemented

Could you elaborate on how this causes more overhead than without noexcept? The fact that something has to be done when throwing an exception is true in both cases, right?. Naively it'd seem like without noexcept, you raise the exception; and with noexcept, you call std::terminate instead. Presumably the compiler is already moving your exception throwing instructions off the happy hot path.

Very very basic test with Clang: https://godbolt.org/z/6aqWWz4Pe Looks like both variations have similar code structure, with 1 extra instruction for noexcept.

hifromwork · a year ago
Pick a different architecture - anything 32bit. Exception handling on 64bit windows works differently, where the overhead is in the PE headers instead of asm directly (and is in general lower). You don't have the setup and teardown in your example

Throwing exception has the same overhead in both cases. In case of noexcept function, the function has to (or used to have, depending on architecture setup an exception handling frame and remove it when leaving.

>Naively it'd seem like without noexcept, you raise the exception; and with noexcept, you call std::terminate instead

Except you may call a normal function from a noexcept function, and this function may still raise an exception.

bregma · a year ago
If you're on one of the platforms with sane exception handling, it's a matter of emitting different assembly code for the landing pad so that when unwinding it calls std::terminate instead of running destructors for the local scope. Zero additional overhead. If you're on old 32-bit Microsoft Windows using MSVC 6 or something, well, you might have problems. One of the lesser ones being increased overhead for noexcept.
denotational · a year ago
I’m curious: where does the overhead of try/catch come from in a “zero-overhead” implementation?

Is it just that it forces the stack to be “sufficiently unwindable” in a way that might make it hard to apply optimisations that significantly alter the structure of the CFG? I could see inlining and TCO being tricky perhaps?

Or does Windows use a different implementation? Not sure if it uses the Itanium ABI or something else.

terrymah · a year ago
Everyone keeps scanning over the inlining issues, which I think are much larger

“Zero overhead” refers to the actual functions code gen; there are still tables and stuff that have to be updated

Our implementation of noexcept for the single function case I think is fine now. There is a single extra bit in the exception function info which is checked by the unwinder. Other than requiring exception info in cases where we otherwise wouldn’t

The inlining case has always been both more complicated and more of a problem. If your language feature inhibits inlining in any situation you have a real problem

immibis · a year ago
Doesn't every function already need exception unwinding metadata? If the function is marked noexcept, then can't you write the logical equivalent of "Unwinding instructions: Don't." and the exception dispatcher can call std::terminate when it sees that?
dataflow · a year ago
I assume the /EHr- flag was introduced to mitigate this, right?
terrymah · a year ago
Nah that was mostly about extern "C" functions which technically can't throw (so the noexcept runtime stuff would be optimized out) but in practice there is a ton of code marked extern "C" which throws
plorkyeran · a year ago
The most common place where noexcept improves performance is on move constructors and move assignments when moving is cheaper than copying. If your type is not nothrow moveable std::vector will copy it instead of moving when resizing, as the move constructor throwing would leave the vector in an invalid state (while the copy constructor throwing leaves the vector unchanged).

Platforms with setjmp-longjmp based exceptions benefit greatly from noexcept as there’s setup code required before calling functions which may throw. Those platforms are now mostly gone, though. Modern “zero cost” exceptions don’t execute a single instruction related to exception handling if no exceptions are thrown (hence the name), so there just isn’t much room for noexcept to be useful to the optimizer.

Outside of those two scenarios there isn’t any reason to expect noexcept to improve performance.

jzwinck · a year ago
There is another standard library related scenario: hash tables. The std unordered containers will store the hash of each key unless your hash function is noexcept. Analogous to how vector needs noexcept move for fast reserve and resize, unordered containers need noexcept hash to avoid extra memory usage. See https://gcc.gnu.org/onlinedocs/libstdc++/manual/unordered_as...
anonymoushn · a year ago
For many key types and access patterns, storing the hash is faster anyway. I assume people who care about performance are already not using std::unordered_map though.
10tacobytes · a year ago
This is the correct analysis. The article's author could have saved themselves (and the reader) a good amount of blind data diving by learning more about exception processing beforehand.
Arech · a year ago
That's quite interesting and a huge work has been done here, respect for that.

Here's what has jumped out at me: `noexcept` qualifier is not free in some cases, particularly, when a qualified function could actually throw, but is marked `noexcept`. In that case, a compiler still must set something up to fulfil the main `noexcept` promise - call `std::terminate()` if an exception is thrown. That means, that putting `noexcept` on each and every function blindly without any regard to whether the function could really throw or not (for example, `std::vector::push_back()` could throw on reallocation failure, hence if a `noexcept` qualified function call it, a compiler must take into account) doesn't actually test/benchmark/prove anything, since as the author correctly said, - you won't ever do this in a real production project. It would be really interesting to take a look into a full code of cases that showed very bad performance, however, here we're approaching the second issue: if that's the core benchmark code: https://github.com/define-private-public/PSRayTracing/blob/a... then unfortunately it's totally invalid since it measures time with the `std::chrono::system_clock` which isn't monotonic. Given how long the code required to run, it's almost certain that the clock has been adjusted several times...

bodyfour · a year ago
> in that case, a compiler still must set something up to fulfil the main `noexcept` promise - call `std::terminate()`

This is actually something that has been more of a problem in clang than gcc due to LLVM IR limitations... but that is being fixed (or maybe is already?) There was a presentation about it at the 2023 LLVM Developer's meeting which was recently published on their youtube channel https://www.youtube.com/watch?v=DMUeTaIe1CU

The short version (as I understand) is that you don't really need to produce any code to call std::terminate, all you need is tell the linker it needs to leave a hole in the table which maps %rip to the required unwind actions. If the unwinder doesn't know what to do, it will call std::terminate per the standard.

IR didn't have a way of expressing this "hole", though, so instead clang was forced to emit an explicit "handler" to do the std::terminate call

terrymah · a year ago
In MSVC we've also pretty heavily optimized the whole function case such that we no longer have a literal try/catch block around it (I think there is a single bit in our per function unwind info that the unwinder checks and kills the program if it encounters while unwinding). One extra branch but no increase in the unwind metadata size

The inlining case was always the hard problem to solve though

zokier · a year ago
> then unfortunately it's totally invalid since it measures time with the `std::chrono::system_clock` which isn't monotonic. Given how long the code required to run, it's almost certain that the clock has been adjusted several times

monotonic clocks are mostly useful for short measurement periods. for long-term timing wall-time clocks (with their adjustments) are more accurate because they will drift less.

Arech · a year ago
Ah, that's a great correction, thank you! Yes, indeed, due to a drift, in order to discern second+ (?) differences on different machines (or same machines, but different OSes?), one definitely needs to use a wall-clock time, otherwise it's comparing apples to oranges. There's a lot of interesting questions related to that, but they out of the scope of the thread. If I'm not mistaken the author has also timed some individual small functions, which, if correct, still poses a problem to me, but for measuring huge long running tasks like a full suite running 10+ hours, they are probably right in choosing wall-clock timer indeed.

However, before researching into results any further (for example, -10% difference for `noexcept` case is extremely interesting to debug up to the root cause), I'd still like to understand how the code was run and measured exactly. I didn't find a plausible looking benchmark runner in their code base.

TillE · a year ago
> I didn't know std::uniform_int_distribution doesn't actually produce the same results on different compilers

I think this is genuinely my biggest complaint about the C++ standard library. There are countless scenarios where you want deterministic random numbers (for testing if nothing else), so std's distributions are unusable. Fortunately you can just plug in Boost's implementation.

nwallin · a year ago
It's actually really important that uniform_int_distribution is implementation defined. The 'right' way to do it on one architecture is probably not the right way to do it on a different architecture.

For instance, Apple's new CPUs has very fast division. A convenient and useful tool to implement uniform_int_distribution relies on using modulo. So the implementation that runs on Apple's new CPUs ought to use the modulo instructions of the CPU.

On other architectures, the ISA might not even have a modulo instruction. In this case, it's very important that you don't try to emulate modulo in software; it's much better to rely other more complicated constructs to give a uniform distribution.

C++ is also expected to run on GPUs. NVIDIA's CUDA and AMD's HIP are both implementations of C++. (these implementations are non-compliant given the nature of GPUs, but both they and the C++ standard's committee have a shared goal of narrowing that gap) In general, std::uniform_int_distribution uses loops to eliminate redundancies; the 'happy path' has relatively easily predicted branches, but they can and do have instances where the branch is not easily predicted and will as often as not have to loop in order to complete. Doing this on a GPU might be multiple orders of magnitude slower than another method that's better suited for a GPU.

Overzealously dictating an implementation is why C++ ended up with a relatively bad hash table and very bad regex in the standard. It's a mistake that shouldn't be made again.

lifthrasiir · a year ago
But reproducibility is as important as performance for the vast majority of use cases, if these implementation-defined bits start to affect the observable outcomes. (That's why we define the required time complexity for many container-related functions but do not actually specify the exact algorithm; difference in Big-O time complexity is just large enough to be "observed".)

A common solution is to provide two versions of such features, one for the less reproducible but maximally performant version and another for common middle grounds that can be reproduced reasonably efficiently across many common platforms. In fact I believe `std::chrono` was designed in that way to sidestep many uncertainties in platform clock implementations.

aw1621107 · a year ago
> Overzealously dictating an implementation is why C++ ended up with a relatively bad hash table and very bad regex in the standard.

What parts of the standard dictate a particular regex implementation? IIRC the performance issues are usually blamed on ABI compatibility constraints rather than the standard making a fast(er) implementation impossible.

myworkinisgood · a year ago
Nobody is using standard library for high-performant random number implementations.
quotemstr · a year ago
> I think this is genuinely my biggest complaint about the C++ standard library

What do you think of Abseil hash tables randomizing themselves (piggybacking on ASLR) on each start of your program?

slaymaker1907 · a year ago
Their justification is here https://github.com/abseil/abseil-cpp/issues/720

However, I personally disagree with them since I think it's really important to have _some_ basic reproducibility for things like reproducing the results of a randomized test. In that case, I'm going to avoid changing as much as possible anyways.

chipdart · a year ago
> There are countless scenarios where you want deterministic random numbers (for testing if nothing else), so std's distributions are unusable. Fortunately you can just plug in Boost's implementation.

I don't understand what's your complain. If you're already plugging in alternative implementations,what stops you from actually stubbing these random number generators with any realization at all?

akira2501 · a year ago
It's a compromised and goofy implementation with lots of warts. What's the point it in having a /standard/ library then?
hoten · a year ago
I don't feel like this article illuminates anything about how noexcept works. The asm diff at the end suggests _there is no difference_ in the emitted code. I plugged it into godbolt myself and see absolutely no difference. https://godbolt.org/z/jdro5jdnG

It seems the selected example function may not be exercising noexcept. I suppose the assumption is that operator[] is something that can throw, but ... perhaps the machinery lives outside the function (so should really examine function calls), or is never emitted without a try/catch, or operator[] (though not marked noexcept...) doesn't throw b/c OOB is undefined behavior, or ... ?

terrymah · a year ago
You can't just look at the codegen of the function itself, you also have to consider the metadata, and the overhead of processing any metadata

Specifically here (as I said in other comments) where it goes from complicated/quality of implementation issue to "shit this is complicated" is when you consider inlining. If noexcept inhibits inlining in any conceivable circumstances then it's having a dramatic (slightly indirect) impact on performance

quuxplusone · a year ago
> I don't feel like this article illuminates anything about how noexcept works. The asm diff at the end suggests _there is no difference_ in the emitted code.

You are absolutely correct. The OP is basically testing the hypothesis "Wrapping a function in `noexcept` will magically make it faster," which is (1) nonsense to anyone who knows how C++ works, and also (2) trivially easy to falsify, because all you have to do is look at the compiled code. Same codegen? Then it's not going to be faster (or slower). You needn't spend all those CPU cycles to find out what you already know by looking.

There has been a fair bit of literature written on the performance of exceptions and noexcept, but OP isn't contributing anything with this particular post.

Here are two of my own blog posts on the subject. The first one is just an explanation of the "vector pessimization" which was also mentioned (obliquely) in OP's post — but with an actual benchmark where you can see why it matters. https://quuxplusone.github.io/blog/2022/08/26/vector-pessimi...https://godbolt.org/z/e4jEcdfT9

The second one is much more interesting, because it shows where `noexcept` can actually have an effect on codegen in the core language. TLDR, it can matter on functions that the compiler can't inline, such as when crossing ABI boundaries or when (as in this case) it's an indirect call through a function pointer. https://quuxplusone.github.io/blog/2022/07/30/type-erased-in...

hoten · a year ago
That's what I'm talking about! Thanks for sharing, I learned quite a few things about noexcept from your articles.
secondcoming · a year ago
The example is bad. Maybe this illustrates it better:

https://godbolt.org/z/1asa7Tjq9

olliej · a year ago
I would like to have seen a comparison that actually includes -fno-exceptions, rather than just noexcept. My assumption is that to get a consistent gain from noexcept, you would need every function called to be explicitly noexcept, because a bunch of the cost of exceptions is code size and state required to support unwinding. So if the performance cost exception handling is causing is due to that, then if _anything_ can cause an exception (or I guess more accurately unless every opaque call is explicitly indicated to not cause an exception) then that overhead remains.

That said, I'm still confused by the perf results of the article, especially the perlin noise vs MSVC one. It's sufficiently weird outlier that it makes me wonder if something in the compiler has a noexcept path that adds checks that aren't usually on (i.e imagine the code has a "debug" mode that did bounds checks or something, but the function resolution you hit in the noexcept path always does the bounds check - I'm really not sure exactly how you'd get that to happen, but "non-default path was not benchmarked" is not exactly an uncommon occurrence)

compiler-guy · a year ago
Even a speedup of around 1% (if it is consistent and in a carefully controlled experiment) is significant for many workloads, if the workload is big enough.

The OP has this as in the fuzz, which it may be for that particular workload. But across a giant distributed system like youtube or Google search, it is a real gain.

muth02446 · a year ago
RE: unexpected performance degradation

programs can be quite sensitive to how code is laid out because of cache line alignment, cache conflicts etc.

So random changes can have a surprising impact.

There was a paper a couple of years ago explaining this and how to measure compiler optimizations more reliably. Sadly, I do not recall the title/author.

Arech · a year ago
It would be super interesting to read the paper. Please post a link or some more details if you will remember them.