It looks like both enqueue and dequeue are RMWing the "epochs" variable? This can pretty easily lead to something like O(N^2) runtime if N processors are banging this cache line at the same time.
The epoch isn’t CAS’d; it is FAA. The epoch is then used to determine if there is contention due to the tail meeting the head, or due to a wrap-around due to slow writes.
There’s also a back-off scheme to ease contention for a full queue.
Though, I did originally have a variant that adds a fairly straightforward ‘help’ mechanism that makes the algorithm wait-free and reduces the computational complexity.
However, the extra overhead didn’t seem worth it, so I took it out pretty quickly. Iirc, the only place where the ring in the queue wouldn’t out-perform it are on tiny queues with a huge write imbalance.
If you go run the tests in the repo associated w the article, you probably will see that a ring with only 16 entries will tend to start being non-performant at about a 4:1 writer to reader ratio. But iirc that effect goes away before 128 slots in the ring.
There, the ring still fits in a page, and even with a big imbalance I can’t remember seeing less than 1m ops per second on my Mac laptop.
Real world observational data beats worst case analysis, and I’ve never seen an issue for scenarios I consider reasonable.
But, if my unrealistic is realistic to you, let me know and I can break out the wait free version for you.
The terms relaxed, acquire, and release refer to how an atomic operation is ordered against other accesses to memory.
Counter to what the article states, a relaxed atomic is still atomic, meaning that it cannot tear and, for RMW atomic, no other access can go between the read and the write. But a relaxed atomic does not order other accesses, which can lead to unintuitive outcomes.
By contrast, once you've observed another thread's release store with an acquire load, you're guaranteed that your subsequent memory accesses "happen after" all of the other thread's accesses from before that release store -- which is what you'd intuitively expect, it's just that in modern systems (which are really highly distributed systems even on a single chip) there's a cost to establishing this kind of guarantee, which is why you can opt out of it with relaxed atomics if you know what you're doing.
Which is based on: https://ieeexplore.ieee.org/document/9490347
However, this is well written and very easy to read.
However, I definitely did not see the paper you've sited, but just spent a few minutes with the paper you cited. Section 5.2 seems to cover their version. It's far from clear what their algorithm is, but they are talking about page eviction; it doesn't seem like they're using even a single fixed array for the ring, but I'm not 100% sure because it's pretty light on any detail. a
was hoping to find at least one “cmon this is easy to avoid with X thing in the kernel/OS” info nugget dropped
The article does mention wanting to coordinate across multiple identical processes running on the same node in a wide variety of environments as the motivator.
So maybe it should be a feature, not a bug :)