> This fact has forced ... there were no atomic read-modify-write operations, so they have added all such operations in the first revision of the ISA, i.e. Armv8.1-A.
I'm not sure if you meant for these two paragraphs to be related, but asking too make sure:
- Isn't compare-and-swap (CMPXCHG on x86) also read-modify-write, which in the first quoted paragraph you mention is slow?
- I think I've benchmarked LOCK CMPXCHG vs LOCK OR before, with various configurations of reading/writing threads. I was almost sure it was going to be an optimization, and it ended up being inobservable. IIRC, some StackOverflow posts lead me to the notion that LOCK OR still needs to acquire ownership of the target address in memory (RMW). Do you have any more insights? Cases where LOCK OR is better? Or should I have used a different instruction to set a single bit atomically?
As CAS has become more and more important as the world has scaled out, hardware companies have been more willing to favor "performance" in the cost / performance tradeoff. Meaning, it shouldn't surprise you if uncontended CAS as fast as a fetch-and-or, even if the later is obviously a much simpler operation logically.
But hardware platforms are a very diverse place.
Generally, if you can design your algorithm with a load-and-store, there's a very good chance you're going to deal with contention much better than w/ CAS. But, if the best you can do is use load-and-store but then have a retry loop if the value isn't right, that probably isn't going to be better.
For instance, I have a in-memory debugging "ring buffer" that keeps an "epoch"; threads logging to the ring buffer fetch-and-add themselves an epoch, then mod by the buffer size to find their slot.
Typically, the best performance will happen when I'm keeping one ring buffer per thread-- not too surprising, as there's no contention (but impact of page faults can potentially slow this down).
If the ring buffer is big enough that there's never a collision where a slow writer is still writing when the next cycle through the ring buffer happens, then the only contention is around the counter, and everything still tends to be pretty good, but the work the hardware has to do around syncing the value will 100% slow it down, despite the fact that there is no retries. If you don't use a big buffer, you have to do something different to get a true ring buffer, or you can lock each record, and send the fast writer back to get a new index if it sees a lock. The contention still has the effect of slowing things down either way.
The worst performance will come with the CAS operation though, because when lots of threads are debugging lots of things, there will be lots of retries.
I think this is a misunderstanding.
The baseline isn’t sysv locks. The baseline isn’t even what Linux was doing before futexes (Linux had a very immature lock implementation before futexes).
The baseline is all of the ways folks implement locks if they don’t have futexes, which end up having roughly the same properties as a futex based lock:
- fast path that doesn’t hit kernel for either lock or unlock
- slow path that somehow makes the thread wait until the lock is available using some kernel waiting primitive.
The thing futexes improve is the size of the user level data structure that is used for representing the lock in the waiting state. That’s it.
And futexes aren’t the only way to get there. Alternatives:
- thin locks (what JVMs use)
- ParkingLot (a futex-like primitive that works entirely in userland and doesn’t require that the OS have futexes)
Though, I was more coming from the assumption that most people are really learning about the primitives that are going to be there, and they're going to be reaching for in the real world.
So I was thinking more, "what does my language's standard library provide", which has mostly moved from sysv (or worse, sure) -> futex.
It's true though, there have been some recent defections to custom waiting. I haven't looked at any of the well-used parking lot implementations in any depth.
It's a bit of a tangent, but obviously, such constructs can be done completely in userland if you're running your own scheduler. But I assume most of them are not, so if they're not using a futex, I'd assume they're instead writing to a blocking FD and managing their own queues?
If so, how much of a win is that, really? I'm surprised it'd be worth the effort.