I know the author doesn't intend this code to be "production ready", but I just wanted to point a problem that may not be completely obvious, if you are trying to use this structure for multithreaded communication.
The structure declares the read and write indices like so:
These two variables are going to be stored next to each other in memory. If IndexT is say a 32-bit int, then the read_idx and write_idx will be 4 bytes each. A cache line is 64 bytes on intel, so these two variables are going to end up on the same cache line.
The problem with this, is that cache coherence protocols work on the granularity of lines, not bytes. So when one core writes to anywhere on the line (for example the producer to write_idx), then the other core (say the consumer core) has to invalidate its cache entry for that line, and get the new value off the bus, or maybe even hit memory.
Regardless of the specifics, the point is the consumer and producer cores can no longer independently write to their respective variables. Whenever they write to their variable, it will cause some cache coherence operations to be done to update the line that variable is in, in the other core.
This is false sharing, since the producer and consumer don't actually need to access the others variable. The cores are just forced to share access to the variables because the variables are on the same cache line.
This can really have a big impact for multithreaded workloads, and bring down the overall thread throughput.
One solution is to add some padding between the variables. Something like:
Yes, it's not meant to be production ready. Still, the padding you added is particular to Intel/AMD CPUs and this article is more about constrained memory/embedded.
On Intel you would not care about the missing slot when you have 256GB memory available.
I've added the atomic<> in the github and updated the article as well.
However with atomic<> it adds an mfence instruction which is not really necessary and might add a ton of latency.
You can specify the memory ordering of operations to get the desired fences.
For an spsc queue just load aquires and store releases are sufficient and will have no overhead at all on x86.
Don't use operator++ to increment ( it will use an expensive lock xadd) just code the explicit load + add + store sequence. It is safe on this specific case.
Does this actually help? Both, producer and consumer, have to read both pointers on every operation to determine whether the buffer is full respectively empty even though each only updates one of the pointers making the buffer thread-safe. So if there is actually an advantage to not having both pointers in the same cache line, then it is not totally obvious, at least to me.
The usual trick is for each side to cache the last value of the other side index so you only need to refresh the remote value when you "catch up".
Even without caching, padding is still a win as the shared to exclusive transition is less expensive than a request for ownership of an invalidated cache line (aka "single writer principle")
I'm not confident that this helps. Both the read and write index must be loaded in order to write (avoid overflow) or read (avoid underflow) any value from the queue. They may as well be on the same cache line. In either the author's code or your code, one dirty cache line will be invalidated and then reloaded by the other core.
IMO, it takes a detailed protocol analysis for the cache coherence protocol in use by the target processor to determine which method actually results in fewer bus transactions.
Is there a way of hinting this to the compiler, rather than adding a explicit padding? Any way of saying "please don't put the following on the same cache line"?
I'm curious what you think the problems are that unbounded indices causes? Unsigned overflow is well defined in C++, so I don't see the problem here. AFAIK, the only issue would be with requiring power of 2 sizes.
Why is push() checking for overlap? I thought not caring about that was the point of a ring buffer?
[EDIT:] To clarify for those who didn't (couldn't?) read TFA's first sentence, it specifically invokes "I've been writing ring buffers wrong all these years" [0], recently featured on HN.
Depends how you use it. For buffering realtime data if latency is of chief concern then by all means, discard the old stuff. If the desired behavior is to block when the queue is full, then you need to check for an overlap.
I can't read the site because it's blocked for me, but isn't a ring-buffer a different class of data-structure than a queue? AIUI, queues are for non-lossy FIFO behavior.
I didn't see the referenced article and initially thought this was going to be a post about not using volatile instead of atomic load / store.
I suppose this could be a useful implementation, however, it's not clear how useful it would be since the main place it has an impact is if you have a large number of very small queues. The overhead associated with the unused element is sizeof(T) / (sizeof(T) * n + sizeof(RingBuffer) + malloc_overhead), where T is likely to be at most the size of a pointer on the platform. If we assume a standard x86-64 platform with a sizeof(T) to be 8, we can eyeball the struct and guess that GCC will probably spit out a 24 byte structure since it's going to want 8-byte alignment on the T* after the uint32 values. On top of that, not counting malloc overhead, the array will take up n * sizeof(T) bytes. For regular malloc, the overhead on an 8 byte allocation is going to be an extra 12 bytes (according to malloc.c). Assuming we are allocating our ring buffer on the stack, this brings our total size up to 48 bytes for the "correct" ring buffer vs 56 bytes for the "wrong" ring buffer. Actual savings will be at most 14% and go to 0% as the size of the buffer goes to infinity, proportional to 1 / n, which is less than what one might assume. Consequently, this means that for any buffer over length 2, the power of two buffer size requirement will likely waste more memory than it saves since we are saving at most 8 bytes, but losing the difference between a power of two and the desired queue size, so odds are it makes more sense to write a 1 or 2 element ring buffer as a special case than to use this implementation for all ring buffers.
Quite frankly I wrote this more like an exercise. I have never used it in practice other than the github test. I saw people picking up on the other post and didn't really like the indices going unbounded. Then I spent a few minutes thinking how to solve it.
While I think the implementation per se is not that useful (I agree with you), I believe the actual trick can be reused in a different situation or type of container.
Check out the github repo now, I added a choice of allocator. That is nice because the main use of this kind of tool is when you have a slab of mapped memory that you go partitioning.
I honestly don't see how any of these implementations are better than simply using a bool to track whether the queue is empty or not (as suggested in one of the comments on the original article). You could even use the highest bit of the write index to store the bool value if memory was an issue.
Some of these implementations also reduce the number of branches, which is generally a good performance idea in modern pipelined, branch-predicting CPUs, but not possible in all cases or even a guaranteed win for performance.
FWIW I checked what I did with the last ring buffer I wrote: I used position + length to eke out that last cell. As the original article notes, it's not a concurrency-friendly solution, but I wasn't writing for that in this case.
One possible reason is that storing a bool separately from the index makes it difficult to update the producer or consumer index atomically. With the implementations that store the full producer and consumer states as a single word each, only single-word atomic operations are necessary to build a lock-free ring. Storing the bool as the high bit of the index would also suffice.
Another approach gives each element a 'data ready' flag (ie the real element type inherits from T). This flag is set by a producer and cleared by the consumer. This would give you an empty/full indication.
This link: http://www.sunburst-design.com/papers/CummingsSNUG2002SJ_FIF...
was posted in the previous discussion on ring buffers. It provides a solution that seems to me to be the best of any I have seen. See page 2, section 2.2. For a buffer of size 2^N, use N+1 bit indices. If all bits of the indices are equal, the buffer is empty. If all but the most significant are equal, the buffer is full.
That is essentially what this implementation does, too, wrapping pointers after twice the size which is just another way of saying using one additional bit.
The structure declares the read and write indices like so:
These two variables are going to be stored next to each other in memory. If IndexT is say a 32-bit int, then the read_idx and write_idx will be 4 bytes each. A cache line is 64 bytes on intel, so these two variables are going to end up on the same cache line.The problem with this, is that cache coherence protocols work on the granularity of lines, not bytes. So when one core writes to anywhere on the line (for example the producer to write_idx), then the other core (say the consumer core) has to invalidate its cache entry for that line, and get the new value off the bus, or maybe even hit memory.
Regardless of the specifics, the point is the consumer and producer cores can no longer independently write to their respective variables. Whenever they write to their variable, it will cause some cache coherence operations to be done to update the line that variable is in, in the other core.
This is false sharing, since the producer and consumer don't actually need to access the others variable. The cores are just forced to share access to the variables because the variables are on the same cache line.
This can really have a big impact for multithreaded workloads, and bring down the overall thread throughput.
One solution is to add some padding between the variables. Something like:
will do. That will put the variables on different cache lines. And you might want to pad out some other variables to different cache lines as well.On Intel you would not care about the missing slot when you have 256GB memory available.
I've added the atomic<> in the github and updated the article as well.
However with atomic<> it adds an mfence instruction which is not really necessary and might add a ton of latency.
For an spsc queue just load aquires and store releases are sufficient and will have no overhead at all on x86.
Don't use operator++ to increment ( it will use an expensive lock xadd) just code the explicit load + add + store sequence. It is safe on this specific case.
Even without caching, padding is still a win as the shared to exclusive transition is less expensive than a request for ownership of an invalidated cache line (aka "single writer principle")
IMO, it takes a detailed protocol analysis for the cache coherence protocol in use by the target processor to determine which method actually results in fewer bus transactions.
http://en.cppreference.com/w/cpp/language/alignas
You probably know this is the official Computer science term for it, but for those who don't and want to learn more about it:
https://en.wikipedia.org/wiki/False_sharing
Power of 2 requirement stands, though.
[EDIT:] To clarify for those who didn't (couldn't?) read TFA's first sentence, it specifically invokes "I've been writing ring buffers wrong all these years" [0], recently featured on HN.
[0] https://www.snellman.net/blog/archive/2016-12-13-ring-buffer...
Deleted Comment
I suppose this could be a useful implementation, however, it's not clear how useful it would be since the main place it has an impact is if you have a large number of very small queues. The overhead associated with the unused element is sizeof(T) / (sizeof(T) * n + sizeof(RingBuffer) + malloc_overhead), where T is likely to be at most the size of a pointer on the platform. If we assume a standard x86-64 platform with a sizeof(T) to be 8, we can eyeball the struct and guess that GCC will probably spit out a 24 byte structure since it's going to want 8-byte alignment on the T* after the uint32 values. On top of that, not counting malloc overhead, the array will take up n * sizeof(T) bytes. For regular malloc, the overhead on an 8 byte allocation is going to be an extra 12 bytes (according to malloc.c). Assuming we are allocating our ring buffer on the stack, this brings our total size up to 48 bytes for the "correct" ring buffer vs 56 bytes for the "wrong" ring buffer. Actual savings will be at most 14% and go to 0% as the size of the buffer goes to infinity, proportional to 1 / n, which is less than what one might assume. Consequently, this means that for any buffer over length 2, the power of two buffer size requirement will likely waste more memory than it saves since we are saving at most 8 bytes, but losing the difference between a power of two and the desired queue size, so odds are it makes more sense to write a 1 or 2 element ring buffer as a special case than to use this implementation for all ring buffers.
While I think the implementation per se is not that useful (I agree with you), I believe the actual trick can be reused in a different situation or type of container.
Check out the github repo now, I added a choice of allocator. That is nice because the main use of this kind of tool is when you have a slab of mapped memory that you go partitioning.
FWIW I checked what I did with the last ring buffer I wrote: I used position + length to eke out that last cell. As the original article notes, it's not a concurrency-friendly solution, but I wasn't writing for that in this case.
Deleted Comment