ibraheemdev (u/ibraheemdev)

ibraheemdev commented on From Rust to reality: The hidden journey of fetch_max questdb.com/blog/rust-fet... · Posted by u/bluestreak

anematode · 6 months ago

With a relaxed ordering I'm not sure if that's right, since the ldumax would have no imposed ordering relation with the (atomic) decrement on another thread and so could very well have operated on the old value obtained by the non-atomic load

ibraheemdev · 6 months ago

It does make a difference of course if you're running fetch_max from multiple threads, adding a load fast-path introduces a race condition.

ibraheemdev commented on PEP 751: Pylock.toml peps.python.org/pep-0751/... · Posted by u/thatxliner

drzaiusx11 · 6 months ago

Opening with "there are at least five python lock formats pre-existing" then proposes a 6th. I'd like to know if poetry, uv, etc are willing to adopt this before going forward otherwise we're just splitting the community further.

That said, I do think having reproducible builds as an explicit goal is important here, as several pre-existing formats like requirements.txt are too lax on that front.

ibraheemdev · 6 months ago

pip, PDM, and uv already support PEP751 [0] and were involved in the design process.

[0]: https://discuss.python.org/t/community-adoption-of-pylock-to...

ibraheemdev commented on Shared_ptr<T>: the (not always) atomic reference counted smart pointer (2019) snf.github.io/2019/02/13/... · Posted by u/klaussilveira

Kranar · 7 months ago

The presentation you are making is both incorrect and highly misleading.

There are algorithms whose correctness depends on sequential consistency which can not be implemented in x86 without explicit barriers, for example Dekker's algorithm.

What x86 does provide is TSO semantics, not sequential consistency.

ibraheemdev · 7 months ago

I did not claim that x86 provides sequential consistency in general, I made that claim only for RMW operations. Sequentially consistent stores are typically lowered to an XCHG instruction on x86 without an explicit barrier.

From the Intel SDM:

> Synchronization mechanisms in multiple-processor systems may depend upon a strong memory-ordering model. Here, a program can use a locking instruction such as the XCHG instruction or the LOCK prefix to ensure that a read-modify-write operation on memory is carried out atomically. Locking operations typically operate like I/O operations in that they wait for all previous instructions to complete and for all buffered writes to drain to memory (see Section 8.1.2, “Bus Locking”).

ibraheemdev commented on Shared_ptr<T>: the (not always) atomic reference counted smart pointer (2019) snf.github.io/2019/02/13/... · Posted by u/klaussilveira

Kranar · 7 months ago

The concept of sequential consistency only exists within the context of a programming language's memory model. It makes no sense to speak about the performance of sequentially consistent operations without respect to the semantics of a programming language.

ibraheemdev · 7 months ago

Yes, what I meant was that the same instruction is generated by the compiler, regardless if the RMW operation is performed with relaxed or sequentially consistent ordering, because that instruction is strong enough in terms of hardware semantics to enforce C++'s definition of sequential consistency.

There is a pretty clear mapping in terms of C++ atomic operations to hardware instructions, and while the C++ memory model is not defined in terms of instruction reordering, that mapping is still useful to talk about performance. Sequential consistency is also a pretty broadly accepted concept outside of the C++ memory model, I think you're being a little too nitpicky on terminology.

ibraheemdev commented on Shared_ptr<T>: the (not always) atomic reference counted smart pointer (2019) snf.github.io/2019/02/13/... · Posted by u/klaussilveira

Kranar · 7 months ago

It's a common misconception to reason about memory models strictly in terms of hardware.

Sequential consistency is a property of a programming language's semantics and can not simply be inferred from hardware. It is possible for hardware operations to all be SC but for the compiler to still provide weaker memory orderings through compiler specific optimizations.

ibraheemdev · 7 months ago

I'm referring to the performance implications of the hardware instruction, not the programming language semantics. Incrementing or decrementing the reference count is going to require an RMW instruction, which is expensive on x86 regardless of the ordering.

ibraheemdev commented on Shared_ptr<T>: the (not always) atomic reference counted smart pointer (2019) snf.github.io/2019/02/13/... · Posted by u/klaussilveira

tialaramex · 7 months ago

There is no way the shared_ptr<T> is using the expensive sequentially consistent atomic operations.

Even if you're one of the crazy people who thinks that's the sane default, the value from analysing and choosing a better ordering rule for this key type is enormous and when you do that analysis your answer is going to be acquire-release and only for some edge cases, in many places the relaxed atomic ordering is fine.

ibraheemdev · 7 months ago

> There is no way the shared_ptr<T> is using the expensive sequentially consistent atomic operations.

All RMW operations have sequentially consistent semantics on x86.

It's not exactly a store buffer flush, but any subsequent loads in the pipeline will stall until the store has completed.

ibraheemdev commented on Without the futex, it's futile h4x0r.org/futex/... · Posted by u/eatonphil

pizlonator · 7 months ago

ParkingLot just uses pthread mutex and cond.

Sure that uses futex under the hood, but the point is, you use futexes on Linux because that’s just what Linux gives you

ibraheemdev · 7 months ago

> ParkingLot just uses pthread mutex and cond.

That's interesting, I'm more familiar with the Rust parking-lot implementation, which uses futex on Linux [0].

> Sure that uses futex under the hood, but the point is, you use futexes on Linux because that’s just what Linux gives you

It's a little more than that though, using a pthread_mutex or even thread.park() on the slow path is less efficient than using a futex directly. A futex lets you manage the atomic condition yourself, while generic parking utilities encode that state internally. A mutex implementation generally already has a built-in atomic condition with simpler state transitions for each thread in the queue, and so can avoid the additional overhead by making the futex call directly.

[0]: https://github.com/Amanieu/parking_lot/blob/739d370a809878e4...

ibraheemdev commented on Without the futex, it's futile h4x0r.org/futex/... · Posted by u/eatonphil

pizlonator · 7 months ago

> Going back to the original futuex paper in 2002, it was immediately clear that the futex was a huge improvement in highly concurrent environments. Just in that original paper, their tests with 1000 parallel tasks ran 20-120 times faster than sysv locks..

I think this is a misunderstanding.

The baseline isn’t sysv locks. The baseline isn’t even what Linux was doing before futexes (Linux had a very immature lock implementation before futexes).

The baseline is all of the ways folks implement locks if they don’t have futexes, which end up having roughly the same properties as a futex based lock:

- fast path that doesn’t hit kernel for either lock or unlock

- slow path that somehow makes the thread wait until the lock is available using some kernel waiting primitive.

The thing futexes improve is the size of the user level data structure that is used for representing the lock in the waiting state. That’s it.

And futexes aren’t the only way to get there. Alternatives:

- thin locks (what JVMs use)

- ParkingLot (a futex-like primitive that works entirely in userland and doesn’t require that the OS have futexes)

ibraheemdev · 7 months ago

> And futexes aren’t the only way to get there. Alternatives:

> - thin locks (what JVMs use)

> - ParkingLot (a futex-like primitive that works entirely in userland and doesn’t require that the OS have futexes)

Worth nothing that somewhere under the hood, any modern lock is going to be using a futex (if supported). futex is the most efficient way to park on Linux, so you even want to be using it on the slow path. Your language's thread.park() primitive is almost certainly using a futex.

ibraheemdev commented on Zig; what I think after months of using it strongly-typed-thoughts.n... · Posted by u/uaksom

ibraheemdev · a year ago

> The message has some weird mentions in (alloc565), but the actual useful information is there: a pointer is dangling.

The allocation ID is actually very useful for debugging. You can actually use the flags `-Zmiri-track-alloc-id=alloc565 -Zmiri-track-alloc-accesses` to track the allocation, deallocation, and any reads/writes to/from this location.