viega (u/viega) - Readit News

viega commented on Without the futex, it's futile h4x0r.org/futex/... · Posted by u/eatonphil

pizlonator · 5 days ago

> Going back to the original futuex paper in 2002, it was immediately clear that the futex was a huge improvement in highly concurrent environments. Just in that original paper, their tests with 1000 parallel tasks ran 20-120 times faster than sysv locks..

I think this is a misunderstanding.

The baseline isn’t sysv locks. The baseline isn’t even what Linux was doing before futexes (Linux had a very immature lock implementation before futexes).

The baseline is all of the ways folks implement locks if they don’t have futexes, which end up having roughly the same properties as a futex based lock:

- fast path that doesn’t hit kernel for either lock or unlock

- slow path that somehow makes the thread wait until the lock is available using some kernel waiting primitive.

The thing futexes improve is the size of the user level data structure that is used for representing the lock in the waiting state. That’s it.

And futexes aren’t the only way to get there. Alternatives:

- thin locks (what JVMs use)

- ParkingLot (a futex-like primitive that works entirely in userland and doesn’t require that the OS have futexes)

viega · 5 days ago

If you baseline against what people do when they skip builtin locks, then yes, that's spot on.

Though, I was more coming from the assumption that most people are really learning about the primitives that are going to be there, and they're going to be reaching for in the real world.

So I was thinking more, "what does my language's standard library provide", which has mostly moved from sysv (or worse, sure) -> futex.

It's true though, there have been some recent defections to custom waiting. I haven't looked at any of the well-used parking lot implementations in any depth.

It's a bit of a tangent, but obviously, such constructs can be done completely in userland if you're running your own scheduler. But I assume most of them are not, so if they're not using a futex, I'd assume they're instead writing to a blocking FD and managing their own queues?

If so, how much of a win is that, really? I'm surprised it'd be worth the effort.

viega commented on Without the futex, it's futile h4x0r.org/futex/... · Posted by u/eatonphil

aktau · 5 days ago

> In theory compare-and-swap or the equivalent instruction pair load-exclusive/store-conditional are more universal, but in practice they should be avoided whenever high contention is expected. The high performance algorithms for accessing shared resources are all based on using only fetch-and-add, atomic exchange, atomic bit operations and load-acquire/store-release instructions.

> This fact has forced ... there were no atomic read-modify-write operations, so they have added all such operations in the first revision of the ISA, i.e. Armv8.1-A.

I'm not sure if you meant for these two paragraphs to be related, but asking too make sure:

  - Isn't compare-and-swap (CMPXCHG on x86) also read-modify-write, which in the first quoted paragraph you mention is slow?
  - I think I've benchmarked LOCK CMPXCHG vs LOCK OR before, with various configurations of reading/writing threads. I was almost sure it was going to be an optimization, and it ended up being inobservable. IIRC, some StackOverflow posts lead me to the notion that LOCK OR still needs to acquire ownership of the target address in memory (RMW). Do you have any more insights? Cases where LOCK OR is better? Or should I have used a different instruction to set a single bit atomically?

viega · 5 days ago

In terms of the relative cycle cost for instructions, the answer definitely has changed a lot over time.

As CAS has become more and more important as the world has scaled out, hardware companies have been more willing to favor "performance" in the cost / performance tradeoff. Meaning, it shouldn't surprise you if uncontended CAS as fast as a fetch-and-or, even if the later is obviously a much simpler operation logically.

But hardware platforms are a very diverse place.

Generally, if you can design your algorithm with a load-and-store, there's a very good chance you're going to deal with contention much better than w/ CAS. But, if the best you can do is use load-and-store but then have a retry loop if the value isn't right, that probably isn't going to be better.

For instance, I have a in-memory debugging "ring buffer" that keeps an "epoch"; threads logging to the ring buffer fetch-and-add themselves an epoch, then mod by the buffer size to find their slot.

Typically, the best performance will happen when I'm keeping one ring buffer per thread-- not too surprising, as there's no contention (but impact of page faults can potentially slow this down).

If the ring buffer is big enough that there's never a collision where a slow writer is still writing when the next cycle through the ring buffer happens, then the only contention is around the counter, and everything still tends to be pretty good, but the work the hardware has to do around syncing the value will 100% slow it down, despite the fact that there is no retries. If you don't use a big buffer, you have to do something different to get a true ring buffer, or you can lock each record, and send the fast writer back to get a new index if it sees a lock. The contention still has the effect of slowing things down either way.

The worst performance will come with the CAS operation though, because when lots of threads are debugging lots of things, there will be lots of retries.

viega commented on Without the futex, it's futile h4x0r.org/futex/... · Posted by u/eatonphil

mananaysiempre · 6 days ago

Because mainstream modern architectures (practically speaking, x86-64-v2+ and ARMv8+) give you[1] a two-word compare-and-swap or LL/SC.

[1] https://ibraheem.ca/posts/128-bit-atomics/

viega · 6 days ago

One thing to add here, I've enjoyed reasonably extensive support for `atomic_compare_exchange_strong()` and the `_explicit` variant for quite a long time (despite the need for the cache line lock on x86).

But, last I checked (the last release, early last year) MUSL still does not provide a 128 bit version, which is disappointing, and hopefully the AVX related semantics changes will encourage them to add it? :)

viega commented on Without the futex, it's futile h4x0r.org/futex/... · Posted by u/eatonphil

ptspts · 6 days ago

Half of the source code is colored very-light-on-white, which is impossible to read. I'm using Chrome on Android.

viega · 6 days ago

Interesting, and thanks; that's not how it's supposed to look for sure. I'll ask someone to look at the whole mobile experience, definitely not my area.

viega commented on Without the futex, it's futile h4x0r.org/futex/... · Posted by u/eatonphil

f1shy · 6 days ago

Please think about removing the “2025” footer, as it takes almost half my screen on the phone if I put it horizontal. I had to switch to read mode. I suppose is ok, but I assume the article is to be read…

viega · 6 days ago

Yeah, a few other people have mentioned this too; I looked and it is indeed abysmal. I'm not the person responsible for the layout, but I will make sure it gets fixed before posting anything else, even if I have to go tweak it myself.

viega commented on Without the futex, it's futile h4x0r.org/futex/... · Posted by u/eatonphil

adrian_b · 6 days ago

Windows did not gain a WaitForMultipleObjects, it had it since the first Windows NT, more than 30 years ago.

While WaitForMultipleObjects was an advantage of Windows NT over UNIX, it was nothing new. IBM PL/I had an equivalent function already in 1965, almost 30 years before Windows NT.

The "wait" function of IBM PL/I was actually the model for the UNIX "wait", but the UNIX function was extremely simplified and much weaker than its model, like it was also the case with the many features inherited by UNIX from Multics. Unfortunately, many decades had to pass until the descendants of UNIX began to gain features comparable in sophistication with those of the ancestors of UNIX.

However the Microsoft WaitForSingleObject and WaitForMultipleObjects did not have an efficient implementation, which is why they had to add WaitOnAddress, the equivalent of Linux futex.

It is true however that the Linux futex had and still has some annoying limitations, like the size of only 32 bits of the futex value, instead of 64 bits, and the fact that it is possible to wait only on a single event. Using atomic bit operations on the futex value it is actually possible to wait on multiple events, though not in the most efficient way. However here is where the 32-bit size of the futex value becomes annoying.

Therefore the work that attempts to combine the advantages of "futex" with some of the advantages of WaitForMultipleObjects is very welcome.

However this does not ape Windows, but it just reimplements techniques that are much older than the Microsoft company, which were well known more than a half of century ago.

viega · 6 days ago

That's very interesting history, thanks.

I agree with Linux still only supporting 32-bit futexes is a bit baffling. The only reason the width matters is for the race condition check, but that's a huge reason. I'd want to have the option to wait on values as wide as whatever the underlying hardware supports, at least!

viega commented on Without the futex, it's futile h4x0r.org/futex/... · Posted by u/eatonphil

ajross · 6 days ago

Futex is a fine solution for locks and semaphores (FUTEX_WAIT/WAKE operations). It's been extended repeatedly to handle the needs of condition variables, priority inheritance, timeouts, interop with file descriptors and async/io_uring, etc... with the result that a lot of the API exists to support newer operations with oddball semantics and not a few genuine mistakes and traps (often undocumented). See the glibc condition variable code for how complicated this can get.

Also, while googling for some examples for you I was reminded of this LWN article from a few years back that details some of the issues: https://lwn.net/Articles/823513/

viega · 6 days ago

Just because the Linux futex call is currently a Swiss Army knife with some parts that add no value (which I do say in the article) doesn't mean that it's not valuable, or important.

The fact that Linux has extended it in so many ways is, in fact, a testament to it to how impactful the futex concept has been to the world of concurrency.

The fact that it's also at the core of other OS primitives does as well. At least on the MacOS side, those primitives do have much simpler APIs, as well. For instance, here's the main wait function:

`extern int os_sync_wait_on_address(void * addr, uint64_t value, size_t size, os_sync_wait_on_address_flags_t flags);`

There's also one with a timeout.

The wake side is equally simple, with two calls, one to wake one thread, one to wake all threads. No other number matters, so it's a great simplification in my view.

Your fundamental point is that the futex is actually a pretty unimportant construct. Clearly I don't agree, and it's okay not to agree, but I really am struggling to see your counter-argument.

If futexes aren't important to good locks, then, if modern OSes all felt compelled to jettison the futex for some reason, you'd have pthread implementations do ... what exactly??

viega commented on Without the futex, it's futile h4x0r.org/futex/... · Posted by u/eatonphil

ajross · 6 days ago

I think you're misunderstanding how futexes work, or else making an essentially irrelevant semantic argument around a definition for "keeper". The kernel is, 100%, absolutely, the "keeper" of that lock data for the duration of the system call. It knows that (hardware) address and matches it to any other such syscall from any other process on the system. And that requires tracking and intelligence and interaction with the VM system and arch layer up and down the stack.

It just doesn't "allocate" it on its own and lets the process use its own mapped memory. But to pretend that it doesn't have to do any work or that the memory is "separated" betrays some misunderstandings about what is actually happening here.

viega · 6 days ago

The kernel is responsible for maintaining the wait queues, and making sure that there is no race condition on the state that should preclude queueing.

It does not care how you use the queue, at all. It doesn't have to be done with a locking primitive, whatsoever. You absolutely can use the exact same mechanism to implement a thread pool with a set of dormant threads, for instance.

The state check in the basic futex is only done to avoid a race condition. None of the logic of preventing threads from entering critical sections is in the purview of the kernel, either. That's all application-level.

And most importantly, no real lock uses a futex for the locking parts. As mentioned in the article, typically a mutex will directly try to acquire the lock with an atomic operation, like an atomic fetch-and-or, fetch-and-add, or even compare-and-swap.

A single atomic op, even if you go for full sequential consistency (which comes w/ full pipeline stalls), is still a lot better than a trip into the kernel when you can avoid it.

Once again, I'm not saying you couldn't use the futex state check to decide what's locked and what's not. I'm saying nobody should, and it was never the intent.

The intent from the beginning was to separate out the locking from the waiting, and I think that's pretty clear in the original futex paper (linked to in my article).

viega commented on Without the futex, it's futile h4x0r.org/futex/... · Posted by u/eatonphil

garaetjjte · 6 days ago

> Many people won’t worry about crashed threads, as they often will crash the whole program. However, you can catch the signal a crash generates and keep the overall process from terminating.

That doesn't help if the entire process dies for any reason and you want to clean up the locks. Solution to that is called "robust" locks. You can register list of held futexes with the kernel using sys_set_robust_list, and when the thread dies kernel for each entry will set a specific bit and wake waiter if there's one.

viega · 6 days ago

Yes, good comment on something I glossed over for sure (I tried to stop the mutex discussion at the process boundary, to keep from going forever).

viega commented on Without the futex, it's futile h4x0r.org/futex/... · Posted by u/eatonphil

f1shy · 6 days ago

I’m right now working with this topic, so vey happy to find it here. The only problem: I like to read with the phone horizontally. If you do that the 2025 footer takes 45% of screen… I mias plain HTML so much!

viega · 6 days ago

Yeah, I don't love the mobile experience. I'll try to get that fixed soon, thanks.