pizlonator (u/pizlonator)

pizlonator commented on Without the futex, it's futile h4x0r.org/futex/... · Posted by u/eatonphil

> ParkingLot just uses pthread mutex and cond.

That's interesting, I'm more familiar with the Rust parking-lot implementation, which uses futex on Linux [0].

> Sure that uses futex under the hood, but the point is, you use futexes on Linux because that’s just what Linux gives you

It's a little more than that though, using a pthread_mutex or even thread.park() on the slow path is less efficient than using a futex directly. A futex lets you manage the atomic condition yourself, while generic parking utilities encode that state internally. A mutex implementation generally already has a built-in atomic condition with simpler state transitions for each thread in the queue, and so can avoid the additional overhead by making the futex call directly.

[0]: https://github.com/Amanieu/parking_lot/blob/739d370a809878e4...

pizlonator · 5 days ago

> It's a little more than that though, using a pthread_mutex or even thread.park() on the slow path is less efficient than using a futex directly.

No, it absolutely isn’t.

The dominant cost of parking is whatever happens in the kernel and at the microarchitectural level when your thread goes to sleep. That cost is so dominant that whether you park with a futex wait or with a condition variables doesn’t matter at all.

(Source: I’ve done that experiment to death as a lock implementer back when I maintained Jikes RVM’s thin locks and then again when I wrote and maintained ParkingLot.)

pizlonator commented on Without the futex, it's futile h4x0r.org/futex/... · Posted by u/eatonphil

ibraheemdev · 5 days ago

> And futexes aren’t the only way to get there. Alternatives:

> - thin locks (what JVMs use)

> - ParkingLot (a futex-like primitive that works entirely in userland and doesn’t require that the OS have futexes)

Worth nothing that somewhere under the hood, any modern lock is going to be using a futex (if supported). futex is the most efficient way to park on Linux, so you even want to be using it on the slow path. Your language's thread.park() primitive is almost certainly using a futex.

pizlonator · 5 days ago

ParkingLot just uses pthread mutex and cond.

Sure that uses futex under the hood, but the point is, you use futexes on Linux because that’s just what Linux gives you

pizlonator commented on Without the futex, it's futile h4x0r.org/futex/... · Posted by u/eatonphil

viega · 5 days ago

If you baseline against what people do when they skip builtin locks, then yes, that's spot on.

Though, I was more coming from the assumption that most people are really learning about the primitives that are going to be there, and they're going to be reaching for in the real world.

So I was thinking more, "what does my language's standard library provide", which has mostly moved from sysv (or worse, sure) -> futex.

It's true though, there have been some recent defections to custom waiting. I haven't looked at any of the well-used parking lot implementations in any depth.

It's a bit of a tangent, but obviously, such constructs can be done completely in userland if you're running your own scheduler. But I assume most of them are not, so if they're not using a futex, I'd assume they're instead writing to a blocking FD and managing their own queues?

If so, how much of a win is that, really? I'm surprised it'd be worth the effort.

pizlonator · 5 days ago

> If you baseline against what people do when they skip builtin locks, then yes, that's spot on.

No idea what you mean by "builtin locks".

For example, the best baseline at the time futexes were introduced would have been MS's CRITICAL_SECTION, which was:

- Builtin in the sense that it shipped with the OS and part of that OS's standard library.

- Not a kernel call except on slow paths.

Prior to futexes, LinuxThreads already provided a lock implementation in pthread_mutex that was better than sysv, just worse than CRITICAL_SECTION, and worse than the futex-based one. But that old implementation already achieved the no-kernel-call fast path.

Also worth noting that today, the fastest widely available lock impl is MS's SRWLock, which obviously isn't futex based. So it's not like futexes are even the best

> So I was thinking more, "what does my language's standard library provide", which has mostly moved from sysv (or worse, sure) -> futex.

No, it hasn't. Before and after futexes, folks use pthread_mutex. Before and after futexes, pthread_mutex was faster than sysv. Before and after futexes, pthread_mutex was not based on sysv.

> It's a bit of a tangent, but obviously, such constructs can be done completely in userland if you're running your own scheduler. But I assume most of them are not, so if they're not using a futex, I'd assume they're instead writing to a blocking FD and managing their own queues?

Both ParkingLot and thin locks bottom out in a lock-and-condition-variable-per-thread implementation (or semaphore-per-thread if that's what the OS has). Then the lock and condition variable (or semaphore) may or may not use futexes.

> If so, how much of a win is that, really? I'm surprised it'd be worth the effort.

Portability. If you build a fast lock implementation on futexes, then it won't work on all OSes. You could get it to be portable to different OSes by virtue of the fact that the major OSes these days provide futex-like primitives, but even then, you'd be in a world of hurt because the details are different.

But ParkingLot can be built out of pthread_mutex/pthread_cond so it'll work well on any POSIX

pizlonator commented on Without the futex, it's futile h4x0r.org/futex/... · Posted by u/eatonphil

aktau · 6 days ago

Thanks for that reference. Do you know if the JVM still uses thin locks? Did they migrate to thin locks? I ask because I found a 9 year old reference with a JVM calling futex: https://stackoverflow.com/questions/32262946/java-periodical....

pizlonator · 6 days ago

OpenJDK has used thin locks since back when it was called HotSpot.

The fact that the JVM is hanging in futexes doesn’t meant anything for the purpose of this discussion. Futexes are _the_ OS locking primitive on Linux so I would expect modern JVM thin locks to bottom out in a futex syscall. That doesn’t mean that the JVM is “using” futexes in the sense of the original post.

pizlonator commented on Without the futex, it's futile h4x0r.org/futex/... · Posted by u/eatonphil

pizlonator · 6 days ago

> Going back to the original futuex paper in 2002, it was immediately clear that the futex was a huge improvement in highly concurrent environments. Just in that original paper, their tests with 1000 parallel tasks ran 20-120 times faster than sysv locks..

I think this is a misunderstanding.

The baseline isn’t sysv locks. The baseline isn’t even what Linux was doing before futexes (Linux had a very immature lock implementation before futexes).

The baseline is all of the ways folks implement locks if they don’t have futexes, which end up having roughly the same properties as a futex based lock:

- fast path that doesn’t hit kernel for either lock or unlock

- slow path that somehow makes the thread wait until the lock is available using some kernel waiting primitive.

The thing futexes improve is the size of the user level data structure that is used for representing the lock in the waiting state. That’s it.

And futexes aren’t the only way to get there. Alternatives:

- thin locks (what JVMs use)

- ParkingLot (a futex-like primitive that works entirely in userland and doesn’t require that the OS have futexes)

pizlonator commented on Undefined Behavior in C and C++ (2024) russellw.github.io/undefi... · Posted by u/imadr

gpderetta · 14 days ago

I think you have a non-standard definition. An escaping pointer is an address that the compiler cannot fully track (directly or indirectly). It could be to a syscall, it could be a separately compiled function (without LTO), it could even be to a function in the same translation unit if the compiler cannot inline that function nor do sufficient intraprocedural analysis.

Again, I'm not a compiler writer, but my understanding is that non escaping variables can be optimized in SSA form, escaped variables are otherwise treated as memory and the compiler must be significantly more conservative.

In any case, whether a pointer escapes or not depends purely on the compiler capabilities and optimization level, so it would not be sane making a code well defined or UB depending on the compiler or optimization level.

edit: to be more concrete, do you think that in my example the constant folding of the return into return 1 should be allowed? And if so, which variant of this code would prevent the optimization and why?

pizlonator · 13 days ago

> Again, I'm not a compiler write

I am a compiler writer.

The definition I gave in my post is general enough to cover all possible compilers (ones that have LTO, ones that are inside a DBT, etc).

Yes the constant folding should be allowed because the pointer to the local never escaped.

pizlonator commented on Undefined Behavior in C and C++ (2024) russellw.github.io/undefi... · Posted by u/imadr

OskarS · 14 days ago

> For example the reason why 2s complement took so long is because of some machine that ran C that still existed that was 1s complement.

You're misunderstanding me: as of C++20, there is no other representation in C++ for signed integers other than two's complement (no signed ones' complement, no signed magnitude, nothing else), but signed overflow is still UB. It's not because of obscure machines or hardware, such hardware is not relevant for C++20 and later. The reason for it is performance. From the accepted paper [1]:

> The following polls were taken, and corresponding modifications made to the paper. The main change between [P0907r0] and the subsequent revision is to maintain undefined behavior when signed integer overflow occurs, instead of defining wrapping behavior. This direction was motivated by:

> * Performance concerns, whereby defining the behavior prevents optimizers from assuming that overflow never occurs

You may disagree, you may think they're wrong, but their motivation is performance, that's why this is UB. It's right there in black and white. This was C++, not C, but it's not at all unthinkable that the C standard will also mandate two's complement at some point, and if they do, they almost certainly keep signed overflow undefined for exactly the same reason.

It's not hard to write code that optimizes much better when you use signed loop variables. One of my favorite examples is this function [2] to turn a 3D mesh inside out by flipping the edges of each triangle in a triangle mesh. The godbolt link has two versions of the same function, one with a signed loop variable, one with an unsigned one. The signed one auto-vectorizes and optimizes much better because it can assume that the loop variable never overflows (this version is C++, it's trivial to rewrite it in C and get the same results).

This is why signed overflow is UB.

[1]: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p09...

[2]: https://godbolt.org/z/a1P5Y17fn

pizlonator · 13 days ago

I agree that the stated motivation for continuing to keep UB is performance.

I know that this is misguided based on my own perf tests and others’ perf tests.

Also, it’s wrong to say flat out that UB on signed ints is somehow necessary for perf when even a simple perf test shows that it just doesn’t matter, and the optimization it enables is quite obscure.

pizlonator commented on Undefined Behavior in C and C++ (2024) russellw.github.io/undefi... · Posted by u/imadr

gpderetta · 14 days ago

I didn't claim that. What I mean is that if a pointer escapes into an inlined function and no further, it will still prevent further optimizations if we apply your rule that only non-escaping locals don't get addresses. The main benefit of inlining is that it is effectively a simple way to do interprocedurally optimizations. I.e.

  inline void add(int* to, int what) { *to += what; }
  void foo();
  void bar() {
      int x = 0;
      add(&x, 1);
      foo();
      return x;
  }

By your rules, optimizing bar to return the constant 1 would not be allowed.

pizlonator · 14 days ago

I think you’re applying a very strange strawman definition to “nonescaping”. It’s certainly not the definition I would pick.

The right definition is probably something like:

- pointers that come out of the outside world (syscalls) are escaped. They are just integers.

- pointers to locals have provenance. They point to an abstract location. It is up to the implementation to decide when the location gets an integer value (is in an actual address) and what that value is. The implementation must do this no later than when the pointer to the local escapes.

- pointer values passed to the outside world (syscalls) escape.

- pointer values stored in escaped memory also escape, transitively

That’s one possible definition that turns the UB into implementation defined behavior. I’m sure there are others

pizlonator commented on Undefined Behavior in C and C++ (2024) russellw.github.io/undefi... · Posted by u/imadr

tialaramex · 14 days ago

> Second, you could define the meaning of OOB by just saying “pointers are integers"

This means losing a lot of optimisations, so in fact when you say you "don't buy" this argument you only mean that you don't care about optimisation. Which is fine, but this does mean the "improved" C isn't very useful in a lot of applications, might as well choose Java.

pizlonator · 14 days ago

> This means losing a lot of optimisations

You won’t lose “a lot” of optimizations and you certainly won’t lose enough for it to make a noticeable difference in any workload that isn’t SPEC

pizlonator commented on Undefined Behavior in C and C++ (2024) russellw.github.io/undefi... · Posted by u/imadr

gpderetta · 14 days ago

> nonescaping locals don’t get addresses

inlining, interprocedural optimizations.

For example, something as an trivial accessor member function would be hard to optimize.

pizlonator · 14 days ago

Inlining doesn’t require UB