Using a futex to allow ticket locks to block in the kernel does have a thundering herd problem. Fortunately you can do “smart sleeps” by measuring the time elapsed since the last wakeup and the change in the turn counter since then, and using that to estimate the time until the current thread’s turn (we don’t want to block the queue by oversleeping, so sleep for half of that estimate, then redo the estimate again, etc). (The same logic can be applied to spinning during the waiter’s timeslice, to avoid unnecessary cache coherence traffic.)
I assume you’re referring to Guy Blelloch’s work on wait-free hazard pointers? That’s very cool although I doubt it’s solving a practical problem. The practical issue with hazard pointers is the StoreLoad barrier on the read side (which can be moved to the GC side via membarrier(2) etc).
There's a couple of other ways to achieve wait freedom on hazard pointer loads (protect) but they haven't been published so they're kind of moot.
It seems like there’s these fundamental things in OSes that we just can’t improve, or I suppose can’t without breaking too much backward compatibility, so we are forced to do this.
There's a software equivalent of the Peter Principle where software or an API becomes increasingly complex to the point where no one understands it. They then attempt to fix that by adding more functionality (complexity).