Always cool to see new mutex implementations and shootouts between them, but I don’t like how this one is benchmarked. Looks like a microbenchmark.
Most of us who ship fast locks use very large multithreaded programs as our primary way of testing performance. The things that make a mutex fast or slow seem to be different for complex workloads with varied critical section length, varied numbers of threads contending, and varying levels of contention.
(Source: I wrote the fast locks that WebKit uses, I’m the person who invented the ParkingLot abstraction for lock impls (now also used in Rust and Unreal Engine), and I previously did research on fast locks for Java and have a paper about that.)
To add to this, as the original/lead author of a desktop app that frequently runs with many tens of threads, I'd like to see numbers on performance in non-heavily contended cases. As a real-time (audio) programmer, I am more concerned with (for example) the cost to take the mutex even when it is not already locked (which is the overwhelming situation in our app). Likewise, I want to know the cost of a try-lock operation that will fail, not what happens when N threads are contending.
Of course, with Cosmopolitan being open source and all, I could do these measurements myself, but still ...
Pro tip: if you really do know that contention is unlikely, and uncontended acquisition is super important, then it's theoretically impossible to do better than a spinlock.
Reason: locks that have the ability to put the thread to sleep on a queue must do compare-and-swap (or at least an atomic RMW) on `unlock`. But spinlocks can get away with just doing a store-release (or just a store with a compiler fence on X86) to `unlock`.
Spinlocks also have excellent throughput under most contention scenarios, though at the cost of power and being unkind to other apps on the system. If you want your spinlock to be hella fast on contention just make sure you `sched_yield` before each retry (or `SwitchToThread` on Windows, and on Darwin you can do a bit better with `thread_switch(MACH_PORT_NULL, SWITCH_OPTION_DEPRESS, 1)`).
I should say, though, that if you're on Windows then I have yet to find a real workload where SRWLock isn't the fastest (provided you're fine with no recursion and with a lock that is word-sized). That lock has made some kind of deal with the devil AFAICT.
This style of mutex will also power PyMutex in Python 3.13. I have real-world benchmarks showing how much faster PyMutex is than the old PyThread_type_lock that was available before 3.13.
I wonder how much it will help in real code. The no-gil build is still easily 50% slower and the regular build showed a slowdown of 50% for Sphinx, which is why the incremental garbage collector was removed just this week.
Python development is in total chaos on all social and technical fronts due to incompetent and malicious leadership.
Definitely a microbenchmark and probably wouldn’t be generally representative of performance. This page gives pretty good standards for OS benchmarking practic, although admittedly geared more for academia https://gernot-heiser.org/benchmarking-crimes.html
Yeah, that specific benchmark is actually likely to prefer undesirable behaviors, for example pathological unfairness: clearly the optimal scheduling of those threads runs first all the increments from the first thread, then all of the second thread, etc... because this will minimize inter-processor traffic.
A mutex that sleeps for a fixed amount (for example 100us) on lock failure acquisition will probably get very close to that behavior (since it almost always bunches), and "win" the benchmark. Still, that would be a terrible mutex for any practical application where there is any contention.
This is not to say that this mutex is not good (or that pthread mutexes are not bad), just that the microbenchmark in question does not measure anything that predicts performance in a real application.
> The reason why Cosmopolitan Mutexes are so good is because I used a library called nsync. It only has 371 stars on GitHub, but it was written by a distinguished engineer at Google called Mike Burrows.
Indeed this is the first time I've heard of nsync, but Mike Burrows also wrote Google's production mutex implementation at https://github.com/abseil/abseil-cpp/blob/master/absl/synchr... I'm curious why this mutex implementation is absent from the author's benchmarks.
By the way if the author farms out to __ulock on macOS, this could be more simply achieved by just using the wait(), notify_one() member functions in the libc++'s atomic library.
A while ago there was also a giant thread related to improving Rust's mutex implementation at https://github.com/rust-lang/rust/issues/93740#issuecomment-.... What's interesting is that there's a detailed discussion of the inner workings of almost every popular mutex implementation.
Mike was a legend by the time I got to AV. The myth was that any time the search engine needed to be faster, he came in and rewrote a few core functions and went back to whatever else he was doing. Might be true, I just can't verify it personally. Extremely smart engineer who cares about efficiency.
We did not, however, run on one server for any length of time.
Although it does get there eventually, that Rust thread is about Mara's work, which is why it eventually mentions her January 2023 book.
The current Rust mutex implementation (which that thread does talk about later) landed earlier this year and although if you're on Linux it's not (much?) different, on Windows and Mac I believe it's new work.
That said, Mara's descriptions of the guts of other people's implementations is still interesting, just make sure to check if they're out-dated for your situation.
> although if you're on Linux it's not (much?) different
AFAIK one reason to switch was that mutexes on Linux and MacOS were not guaranteed to be moveable, so every rust's Mutex had to box the underlying os mutex and was not const-constructible. So this makes a considerable change.
MSVC 2022's std::mutex is listed, though. (That said, GCC's / clang's std::mutex is not listed for Linux or macOS.)
absl::Mutex does come with some microbenchmarks with a handful of points of comparison (std::mutex, absl::base_internal::SpinLock) which might be useful to get an approximate baseline.
> It's still a new C library and it's a little rough around the edges. But it's getting so good, so fast, that I'm starting to view not using it in production as an abandonment of professional responsibility.
What an odd statement. I appreciate the Cosmopolitan project, but these exaggerated claims of superiority are usually a pretty bad red flag.
I'd like to point out that Justine's claims are usually correct. It's just her shtick (or personality?) to use hyperbole and ego-stroking wording. I can see why some might see it as abrasive (it has caused drama before, namely in llamacpp).
I also meant to comment about the grandstanding in her post.
Technical achievement aside, when a person invents something new, the burden is on them to prove that the new thing is a suitable replacement of / improvement over the existing stuff. "I'm starting to view /not/ using [cosmo] in production as an abandonment of professional responsibility" is emotional manipulation -- it's guilt-tripping. Professional responsibility is the exact opposite of what she suggests: it's not jumping on the newest bandwagon. "a little rough around the edges" is precisely what production environments don't want; predictability/stability is frequently more important than peak performance / microbenchmarks.
Furthermore,
> The C library is so deeply embedded in the software supply chain, and so depended upon, that you really don't want it to be a planet killer.
This is just underhanded. She implicitly called glibc and musl "planet killers".
First, technically speaking, it's just not true; and even if the implied statement were remotely true (i.e., if those mutex implementations were in fact responsible for a significant amount of cycles in actual workloads), the emotional load / snide remark ("planet killer") is unjustified.
Second, she must know very well that whenever efficiency of computation is improved, we don't use that for running the same workloads as before at lower cost / smaller environmental footprint. Instead, we keep all CPUs pegged all the time, and efficiency improvements only ever translate to larger profit. A faster mutex too translates to more $$$ pocketed, and not to less energy consumed.
This case isn't abrasive, but it's certainly incoherent.
Name a single case where professional responsibility would require C code advertised as "rough around the edges" to be used anywhere near production. (The weasel words "starting to" do not help the logic of that sentence.)
I can definitely understand how OP could view this as a red flag. The author should amend it.
She claimed posix changed for her use case, and it's not true, posix disallows what she said.
Indeed the first few bullet points of her writeup on how the lock works (compare and swap for uncontended, futex for contended) is already how everybody implements locks for about 20 years since the futex was introduced for exactly this. Win32 critsec from even longer ago works the same way.
The drama in llamacpp was not due to language, it was due to jart making false claims and having absolutely no idea how memory maps work in addition to needlessly changing file headers to contain her name in vanity.
I don’t think anybody can deny that jart is a smart person, but I think there is more to it than just language abbrasiveness which people take issue with.
Justine seems like a fairly brilliant and creative person, but in production I don't think I'd want to use a libc that's "new" or "rough around the edges".
Getting "so fast" is not my first priority when I run things in production. That's stability, predictability, and reliability. Certainly performance is important: better-performing code can mean smaller infrastructure, which is great from a cost and environmental perspective. But fast comes last.
I feel that there’s a certain amount of hubris that comes along with spending long periods of time solo-coding on a computer, and perhaps unwittingly starved of social contact. Without any checks on you or your work’s importance (normally provided by your bog-standard “job”), your achievements take on a grandeur that they might not have broadly earned, as impressive as they might be.
An example is APE (which I otherwise feel is a very impressive hack). One criticism might be “oh, so I not only get to be insecure on one platform, I can be insecure on many all at once?”
The longer you spend in technology, the more you realize that there are extremely few win-wins and a very many win-somes, lose-somes (tradeoffs)
I think it's fair to comment not only on the subject, but on the writing itself, too.
And it might help Justine improve her writing (and reach a larger audience -- after all, blog posts intend to reach some audience, don't they?). Of course you can always say, "if you find yourself alienated, it's your loss".
It's especially a red-flag since an enormous number of projects (almost all of them?) will never tolerate shipping fat binaries (ie what cosmopolitan C is in reality).
The core of this a library called nsync. It appears most of the improvements by Justine are upstreamed into nsync itself which doesn't have any of the baggage of cosmopolitan. Whatever your opinion of the project or author, they've made good effort to not lock you in.
Agreed; this is what I've always (silently) thought of those fat binaries. Absolute stroke of genius, no doubt, and also a total abomination (IMO) from a sustainability perspective.
> I appreciate the Cosmopolitan project, but these exaggerated claims of superiority are usually a pretty bad red flag.
In general, I agree with your sentiment, but Justine is simply a different beast. She does tend to use hyperbolic language a fair amount, but she delivers so much awesomeness that I make an exception for her.
As gamedev I came to love slow mutexes that do a lot of debug things in all 'developer' builds. Have debug names/IDs, track owners, report time spent in contention to profiler, report ownership changes to profiler...
People tend to structure concurrency differently and games came to some patterns to avoid locks. But they are hard to use and require programmer to restructure things. Most of the code starts as 'lets slap a lock here and try to pass the milestone'. Even fast locks will be unpredictably slow and will destroy realtime guarantees if there were any. They can be fast on average but the tail end is never going to go away. I don't want to be that guy who will come back to it chasing 'oh our game has hitches' but I am usually that guy.
Use slow locks people. The ones that show big red in profiler. Refactor them out when you see them being hit.
I know its a tall order. I can count people who know how to use profiler by fingers on AAA production. And it always like that no matter how many productions I see :)
ps. sorry for a rant. Please, continue good research into fast concurrency primitives and algorithms.
Even more tangentially: This is one of the reasons I'm having a great time developing a game in Rust.
You never want lock contention in a game if at all avoidable, and in a lot of cases taking a lock is provably unnecessary. For example: Each frame is divided into phases, and mutable access to some shared resource only needs to happen in one specific phase (like the `update()` function before `render()`, or when hot-reloading assets). With scoped threads and Rust's borrowing rules, you can structure things to not even need a mutex, and be completely certain that you will receive a stern compiler error if the code changes in way so you do.
When possible, I'd always love to take a compiler error over a spike in the profiler.
Totally agree. Debugging features -- e.g. deadlock detection / introspection -- easily pay for themselves. If you're actually acquiring locks so frequently they matter, you should revisit your design. Sharing mutable state between thread should be avoided.
So on the one hand, all this Cosmo/ape/redbean stuff sounds incredible, and the comments on these articles are usually pretty positive and don’t generally debunk the concepts. But on the other hand, I never hear mention of anyone else using these things (I get that not everyone shares what they’re doing in a big way, but after so many years I’d expect to have seen a couple project writeups talk about them). Every mention of Cosmo/ape/redbean I’ve ever seen is from Justine’s site.
So I’ve gotta ask: Is there a catch? Are these tools doing something evil to achieve what they’re achieving? Is the whole thing a tom7-esque joke/troll that I don’t get because I’m not as deep into compilers/runtimes? Or are these really just ingenious tools that haven’t caught on yet?
APE works through cunning trickery that might get patched out any day now (and in OpenBSD, it has been).
Most people producing cross-platform software don't want a single executable that runs on every platform, they want a single codebase that works correctly on each platform they support.
With that in mind that respect, languages like go letting you cross compile for all your targets (provided you avoid CGO) is delightful... but the 3-ways-executable magic trick of APE, while really clever, doesn't inspire confidence it'll work forever, and for the most part it doesn't gain you anything. Each platform has their own packaging/signing requirements. You might as well compile a different target for each platform.
Wasn't elf format modified by upstream to accomodate for cosmo? That makes it kinda official. Still hard to see a use case for it. If you want everyone to be able to run your program, just write a web app, a win32 program, or a java applet. 20 years old java applets still run on modern JVMs.
We'll I'm used to not being most people, but I'd much rather be able to produce a single identical binary for my users that works everywhere than the platform specific nonsense I have to go through right now. Having to maintain different special build processes for different platforms is a stupid waste of time.
Frankly this is how it always should have worked except for the monopolistic behavior of various platforms in the past.
> Most people producing cross-platform software don't want a single executable that runs on every platform
They don't? Having one file to download instead of a maze of "okay so what do you have" is way easier than the current mess. It would be very nice not to have to ask users what platform they're on, they just click a link and get the thing.
https://github.com/rust-cross/rust-musl-cross
docker images for linux (yes, it's possible to use the same one for go, you just need to put go binary there) so it's static with musl libc and sure to not have some weird issues with old/new libc on old/new/whatever linuxes
(it might be over the top and the zig thing might effectively do the same thing, but I never got it to compile)
oh and apple-codesign rust cargo to get through the signing nonsense
I am only speaking for myself here. While cosmo and ape do seem very clever, I do not need this type of clever stuff in my work if the ordinary stuff already works fine.
Like for example if I can already cross-compile my project to other OSes and platforms or if I've got the infra to build my project for other OSes and platforms, I've no reason to look for a solution that lets me build one binary that works everywhere.
There's also the thing that ape uses clever hacks to be able to run on multiple OSes. What if those hacks break someday due to how executable formats evolve? What if nobody has the time to patch APE to make it compatible with those changes?
But my boring tools like gcc, clang, go, rust, etc. will continue to get updated and they will continue to work with evolving OSes. So I just tend to stick with the boring. That's why I don't bother with the clever because the boring just works for me.
Mozilla llamafile uses it. Bundles model weights and an executable into a single file, that can be run from any cosmo/ape platform, and spawns a redbean http server for you to interact with the LLM. Can also run it without the integrated weights, and read weights from the filesystem. It's the easiest "get up and go" for local LLMs you could possibly create.
reposting my comment from another time this discussion came up:
"Cosmopolitan has basically always felt like the interesting sort of technical loophole that makes for a fun blog post which is almost guaranteed to make it to the front page of HN (or similar) purely based in ingenuity & dedication to the bit.
as a piece of foundational technology, in the way that `libc` necessarily is, it seems primarily useful for fun toys and small personal projects.
with that context, it always feels a little strange to see it presented as a serious alternative to something like `glibc`, `musl`, or `msvcrt`; it’s a very cute hack, but if i were to find it in something i seriously depend on i think i’d be a little taken aback."
The problem with this take is that it is not grounded in facts that should determine if the thing is good or bad.
Logically having one thing instead of multiple makes sense as it simplifies the distribution and bytes stored/transmitted. I think the issue here is that it is not time tested. I believe multiple people in various places around the globe think about it and will try to test it. With time this will either bubble up or be stale.
> So I’ve gotta ask: Is there a catch? Are these tools doing something evil to achieve what they’re achieving?
it's not that complicated; they're fat binaries (plus i guess a lot of papering over the differences between the platforms) that exploit a quirk of tshell:
> One day, while studying old code, I found out that it's possible to encode Windows Portable Executable files as a UNIX Sixth Edition shell script, due to the fact that the Thompson Shell didn't use a shebang line.
> Is the whole thing a tom7-esque joke/troll that I don’t get because I’m not as deep into compilers/runtimes? Or are these really just ingenious tools that haven’t caught on yet?
If I went up to you in 2008, and said, hey, lets build a database that doesn't do schemas, isn't relational, doesn't do SQL, isn't ACID compliant, doesn't to joins, has transactions as an afterthought, and only does indexing sometimes, you'd think I was trolling you. And then in 2009, Mongodb came out and caught on in various places. So only time will tell if these are ingenious tools that haven't caught on yet. There's definitely a good amount of genius behind it, though only time will tell if it's remembered in the vein of tom7's harder drives or if it sees wider production use. I'll say that if golang supported it as a platform, I'd switch my build pipelines at work over to it, since it makes their output less complicated to manage as there's only a single binary to deal with instead of 3-4.
Last year I needed to make a small webapp to be hosted on a Windows server, and I thought RedBean would be ideal for it. Unfortunately it was too buggy (at least on Windows).
I don't know whether RedBean is production-ready now, but a year and a half ago, that was the catch.
Give the latest nightly build a try: https://cosmo.zip/pub/cosmos/bin/ Windows has been a long hard march, but we've recently hit near feature completion. As of last month, the final major missing piece of the puzzle was implemented, which is the ability to send UNIX signals between processes. Cosmopolitan does such a good job on Windows now that it's not only been sturdy for redbean, but much more mature and complicated software as well, like Emacs, GNU Make, clang, Qt, and more.
Most people aren't writing C as far as I know. We use Java, C#, Go, Python etc, some lunatics even use Node.
We generally don't care if some mutex is 3x faster than some other mutex. Most of the time if I'm even using a mutex which is rare in itself, the performance of the mutex is generally the least of my concerns.
I'm sure it matters to someone, but most people couldn't give two shits if they know what they're doing. We're not writing code where it's going to make a noticeable difference. There are thousands of things in our code we could optimize that would make a greater impact than a faster mutex, but we're not looking at those either because it's fast enough the way it is.
If it's so good, why haven't all C libraries adopted the same tricks?
My betting is that its tricks are only always-faster for certain architectures, or certain CPU models, or certain types of workload / access patterns... and a proper benchmarking of varied workloads on all supported hardware would not show the same benefits.
Alternatively, maybe the semantics of the pthread API (that cosmopolitan is meant to be implementing) are somehow subtly different and this implementation isn't strictly compliant to the spec?
I can't imagine it's that the various libc authors aren't keeping up in state-of-the-art research on OS primitives...
Those projects often have dozens of other priorities beyond just one specific API, and obsessing over individual APIs isn't a good way to spend the limited time they have. In any case, as a concrete example to disprove your claim, you can look at malloc and string routines in your average libc on Linux.
glibc's malloc is tolerable but fails handily to more modern alternatives in overall speed and scalability (it fragments badly and degrades over time, not to mention it has a dozen knobs that can deeply impact real life workloads like MALLOC_ARENA_MAX). musl malloc is completely awful in terms of performance at every level; in a multithreaded program, using musl's allocator will destroy your performance so badly that it nearly qualifies as malpractice, in my experience.
musl doesn't even have things like SIMD optimized string comparison routines. You would be shocked at how many CPU cycles in a non-trivial program are spent on those tasks, and yes it absolutely shows up in non-trivial profiles, and yes improving this improves all programs nearly universally. glibc's optimized routines are good, but they can always, it seems, become faster.
These specific things aren't "oh, they're hyper specific optimizations for one architecture that don't generalize". These two things in particular -- we're talking 2-5x wall clock reduction, and drastically improved long-term working set utilization, in nearly all workloads for any given program. These are well explored and understood spaces with good known approaches. So why didn't they take them? Because, as always, they probably had other things to do (or conflicting priorities like musl prioritizing simplicity over peak performance, even when that philosophy is actively detrimental to users.)
I'm not blaming these projects or anything. Nobody sets out and says "My program is slow as shit and does nothing right, and I designed it that way and I'm proud of it." But the idea that the people working on them have made only the perfect pareto frontier of design choices just isn't realistic in the slightest and doesn't capture the actual dynamics of how most of these projects are run.
> musl doesn't even have things like SIMD optimized string comparison routines. You would be shocked at how many CPU cycles in a non-trivial program are spent on those tasks
Building GNU Make with Cosmo or glibc makes cold startup go 2x faster for me on large repos compared to building it with Musl, due to vectorized strlen() alone (since SIMD is 2x faster than SWAR). I sent Rich a patch last decade adding sse to strlen(), since I love Musl, and Cosmo is based on it. But alas he didn't want it. Even though he seems perfectly comfortable using ARM's strlen() assembly.
> glibc's malloc is tolerable but fails handily to more modern alternatives in overall speed and scalability
The focus and attention I put into cosmo mutexes isn't unique. I put that care into everything else too, and malloc() is no exception. Cosmo does very well at multi-threaded memory allocation. I can pick benchmark parameters where it outperforms glibc and jemalloc by 100x. I can also pick params where jemalloc wins by 100x. But I'm reasonably certain cosmo can go faster than glibc and musl in most cases while using less memory too. You have Doug Lea to thank for that.
Another example is hash maps: all the large companies built better maps in cpp (folly, absl), but the number of apps that are performance sensitive and still use std::unordered_map will be astounding forever.
(Why not upstream? ABI compatibility which is apparently a sufficient veto reason for anything in cpp)
Politics, not-invented-here syndrome, old maintainers.
It takes forever to change something in glibc, or the C++ equivalent.
There are many kinds of synchronization primitives. pthreads only supports a subset. If you are limiting yourself to them, you are most likely leaving performance on the table, but you gain portability.
> I can't imagine it's that the various libc authors aren't keeping up in state-of-the-art research on OS primitives...
is this sarcasm?
(I don't know any libc maintainers, but as a maintainer of a few thingies myself, I do not try to implement state-of-the-art research, I try to keep my thingies stable and ensure the performance is acceptable; implementing research is out of my budget for "maintenance")
But if you maintain a few thingies, you'd probably know about rival thingies that do a similar thing, right?
If the rival thingies got a speed boost recently, and they were open source, you'd want to have a look at how they did it and maybe get a similar speed boost for yourself.
> If it's so good, why haven't all C libraries adopted the same tricks?
A man and a statiscian are walking down the street when they see a 50€ bill. The statistician keeps walking but the man stops and says "hey, look at this cash on the floor?". But the statistician, uninpressed, says: "must be fake, or someone would have already picked it up". And continues walking. The other man grabs the cash and takes it.
My guess is, because what’s in these current standard libraries and OSes are good enough.
Synchronizing multiple CPU cores together is fundamentally slow, there’s no ways around it. They are far apart on the chip, and sometimes even on different chips with some link between. When measuring time with CPU cycles that latency is rather slow.
Possible to avoid with good old software engineering, and over time people who wanted to extract performance from their multi-core CPUs became good at it.
When you’re computing something parallel which takes minutes, you’ll do great if you update the progress bar at a laughable 5 Hz. Synchronizing cores 5 times each second costs nothing regardless of how efficient is the mutex.
When you’re computing something interactive like a videogame, it’s often enough to synchronize cores once per rendered frame, which often happens at 60Hz.
Another notable example is multimedia frameworks. These handle realtime data coming at high frequencies like 48 kHz for audio, and they do non-trivial compute in these effect transforms and codecs so they need multiple cores. But they can tolerate a bit of latency so they’re just batching these samples. This dramatically saves IPC costs because you only need to lock these mutexes at 100Hz when batching 480 samples = 10ms of audio.
> When you’re computing something interactive like a videogame, it’s often enough to synchronize cores once per rendered frame, which often happens at 60Hz.
That's not really the right way to do it. If you're actually using multiple cores in your code, you want to build a data dependency graph and let the cores walk that graph. Max node size ends up being something loosely like 10k items to be processed. You'll typically see hundreds of synchronization points per frame.
This is the base architecture of stuff like Thread Building Blocks and rust's rayon library.
> They are far apart on the chip, and sometimes even on different chips with some link between.
They aren't always. On a NUMA machine some are closer than others; M-series is an example. The cores are in clusters where some of the cache is shared, so atomics are cheap as long as it doesn't leave the cluster.
Threads and mutexes are the most complicating things in computer science. I am always skeptical of new implementations until they've been used for several years at scale. Bugs in these threading mechanisms often elude even the most intense scrutiny. When Java hit the scene in the mid 90s it exposed all manner of thread and mutex bugs in Solaris. I don't want the fastest mutex implementation - I want a reliable one.
mutexes are far from the most 'complicated'. There are really not that many way to implement them (efficiently). In most cases there are best avoided, esp. on the read paths.
This code benchmarks mutex contention, not mutex lock performance. If you're locking like this, you should reevaluate your code. Each thread locks and unlocks the mutex for every increment of g_chores. This creates an overhead of acquiring and releasing the mutex frequently (100,000 times per thread). This overhead masks the real performance differences between locking mechanisms because the benchmark is dominated by lock contention rather than actual work. Benchmarks such as this one are useless.
Most of us who ship fast locks use very large multithreaded programs as our primary way of testing performance. The things that make a mutex fast or slow seem to be different for complex workloads with varied critical section length, varied numbers of threads contending, and varying levels of contention.
(Source: I wrote the fast locks that WebKit uses, I’m the person who invented the ParkingLot abstraction for lock impls (now also used in Rust and Unreal Engine), and I previously did research on fast locks for Java and have a paper about that.)
Of course, with Cosmopolitan being open source and all, I could do these measurements myself, but still ...
Pro tip: if you really do know that contention is unlikely, and uncontended acquisition is super important, then it's theoretically impossible to do better than a spinlock.
Reason: locks that have the ability to put the thread to sleep on a queue must do compare-and-swap (or at least an atomic RMW) on `unlock`. But spinlocks can get away with just doing a store-release (or just a store with a compiler fence on X86) to `unlock`.
Spinlocks also have excellent throughput under most contention scenarios, though at the cost of power and being unkind to other apps on the system. If you want your spinlock to be hella fast on contention just make sure you `sched_yield` before each retry (or `SwitchToThread` on Windows, and on Darwin you can do a bit better with `thread_switch(MACH_PORT_NULL, SWITCH_OPTION_DEPRESS, 1)`).
> uses an optimistic CAS (compare and swap) immediately, so that locking happens quickly when there's no contention
I should say, though, that if you're on Windows then I have yet to find a real workload where SRWLock isn't the fastest (provided you're fine with no recursion and with a lock that is word-sized). That lock has made some kind of deal with the devil AFAICT.
Python development is in total chaos on all social and technical fronts due to incompetent and malicious leadership.
A mutex that sleeps for a fixed amount (for example 100us) on lock failure acquisition will probably get very close to that behavior (since it almost always bunches), and "win" the benchmark. Still, that would be a terrible mutex for any practical application where there is any contention.
This is not to say that this mutex is not good (or that pthread mutexes are not bad), just that the microbenchmark in question does not measure anything that predicts performance in a real application.
And for all we know it’s absolute trash on real programs.
Deleted Comment
Indeed this is the first time I've heard of nsync, but Mike Burrows also wrote Google's production mutex implementation at https://github.com/abseil/abseil-cpp/blob/master/absl/synchr... I'm curious why this mutex implementation is absent from the author's benchmarks.
By the way if the author farms out to __ulock on macOS, this could be more simply achieved by just using the wait(), notify_one() member functions in the libc++'s atomic library.
A while ago there was also a giant thread related to improving Rust's mutex implementation at https://github.com/rust-lang/rust/issues/93740#issuecomment-.... What's interesting is that there's a detailed discussion of the inner workings of almost every popular mutex implementation.
We did not, however, run on one server for any length of time.
> The Burrows-Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2
The current Rust mutex implementation (which that thread does talk about later) landed earlier this year and although if you're on Linux it's not (much?) different, on Windows and Mac I believe it's new work.
That said, Mara's descriptions of the guts of other people's implementations is still interesting, just make sure to check if they're out-dated for your situation.
AFAIK one reason to switch was that mutexes on Linux and MacOS were not guaranteed to be moveable, so every rust's Mutex had to box the underlying os mutex and was not const-constructible. So this makes a considerable change.
Possibly because it's C++ (as opposed to C)? I am speculating.
MSVC 2022's std::mutex is listed, though. (That said, GCC's / clang's std::mutex is not listed for Linux or macOS.)
absl::Mutex does come with some microbenchmarks with a handful of points of comparison (std::mutex, absl::base_internal::SpinLock) which might be useful to get an approximate baseline.
https://github.com/abseil/abseil-cpp/blob/master/absl/synchr...
What an odd statement. I appreciate the Cosmopolitan project, but these exaggerated claims of superiority are usually a pretty bad red flag.
Technical achievement aside, when a person invents something new, the burden is on them to prove that the new thing is a suitable replacement of / improvement over the existing stuff. "I'm starting to view /not/ using [cosmo] in production as an abandonment of professional responsibility" is emotional manipulation -- it's guilt-tripping. Professional responsibility is the exact opposite of what she suggests: it's not jumping on the newest bandwagon. "a little rough around the edges" is precisely what production environments don't want; predictability/stability is frequently more important than peak performance / microbenchmarks.
Furthermore,
> The C library is so deeply embedded in the software supply chain, and so depended upon, that you really don't want it to be a planet killer.
This is just underhanded. She implicitly called glibc and musl "planet killers".
First, technically speaking, it's just not true; and even if the implied statement were remotely true (i.e., if those mutex implementations were in fact responsible for a significant amount of cycles in actual workloads), the emotional load / snide remark ("planet killer") is unjustified.
Second, she must know very well that whenever efficiency of computation is improved, we don't use that for running the same workloads as before at lower cost / smaller environmental footprint. Instead, we keep all CPUs pegged all the time, and efficiency improvements only ever translate to larger profit. A faster mutex too translates to more $$$ pocketed, and not to less energy consumed.
I find her tone of voice repulsive.
Name a single case where professional responsibility would require C code advertised as "rough around the edges" to be used anywhere near production. (The weasel words "starting to" do not help the logic of that sentence.)
I can definitely understand how OP could view this as a red flag. The author should amend it.
Indeed the first few bullet points of her writeup on how the lock works (compare and swap for uncontended, futex for contended) is already how everybody implements locks for about 20 years since the futex was introduced for exactly this. Win32 critsec from even longer ago works the same way.
I don’t think anybody can deny that jart is a smart person, but I think there is more to it than just language abbrasiveness which people take issue with.
Getting "so fast" is not my first priority when I run things in production. That's stability, predictability, and reliability. Certainly performance is important: better-performing code can mean smaller infrastructure, which is great from a cost and environmental perspective. But fast comes last.
An example is APE (which I otherwise feel is a very impressive hack). One criticism might be “oh, so I not only get to be insecure on one platform, I can be insecure on many all at once?”
The longer you spend in technology, the more you realize that there are extremely few win-wins and a very many win-somes, lose-somes (tradeoffs)
Why would you even post this here? Who do you think this is helping?
And it might help Justine improve her writing (and reach a larger audience -- after all, blog posts intend to reach some audience, don't they?). Of course you can always say, "if you find yourself alienated, it's your loss".
In general, I agree with your sentiment, but Justine is simply a different beast. She does tend to use hyperbolic language a fair amount, but she delivers so much awesomeness that I make an exception for her.
As gamedev I came to love slow mutexes that do a lot of debug things in all 'developer' builds. Have debug names/IDs, track owners, report time spent in contention to profiler, report ownership changes to profiler...
People tend to structure concurrency differently and games came to some patterns to avoid locks. But they are hard to use and require programmer to restructure things. Most of the code starts as 'lets slap a lock here and try to pass the milestone'. Even fast locks will be unpredictably slow and will destroy realtime guarantees if there were any. They can be fast on average but the tail end is never going to go away. I don't want to be that guy who will come back to it chasing 'oh our game has hitches' but I am usually that guy.
Use slow locks people. The ones that show big red in profiler. Refactor them out when you see them being hit.
I know its a tall order. I can count people who know how to use profiler by fingers on AAA production. And it always like that no matter how many productions I see :)
ps. sorry for a rant. Please, continue good research into fast concurrency primitives and algorithms.
You never want lock contention in a game if at all avoidable, and in a lot of cases taking a lock is provably unnecessary. For example: Each frame is divided into phases, and mutable access to some shared resource only needs to happen in one specific phase (like the `update()` function before `render()`, or when hot-reloading assets). With scoped threads and Rust's borrowing rules, you can structure things to not even need a mutex, and be completely certain that you will receive a stern compiler error if the code changes in way so you do.
When possible, I'd always love to take a compiler error over a spike in the profiler.
So I’ve gotta ask: Is there a catch? Are these tools doing something evil to achieve what they’re achieving? Is the whole thing a tom7-esque joke/troll that I don’t get because I’m not as deep into compilers/runtimes? Or are these really just ingenious tools that haven’t caught on yet?
Most people producing cross-platform software don't want a single executable that runs on every platform, they want a single codebase that works correctly on each platform they support.
With that in mind that respect, languages like go letting you cross compile for all your targets (provided you avoid CGO) is delightful... but the 3-ways-executable magic trick of APE, while really clever, doesn't inspire confidence it'll work forever, and for the most part it doesn't gain you anything. Each platform has their own packaging/signing requirements. You might as well compile a different target for each platform.
Even that is not a big deal in most of cases, if you use zig to wrap CC: https://dev.to/kristoff/zig-makes-go-cross-compilation-just-...
We'll I'm used to not being most people, but I'd much rather be able to produce a single identical binary for my users that works everywhere than the platform specific nonsense I have to go through right now. Having to maintain different special build processes for different platforms is a stupid waste of time.
Frankly this is how it always should have worked except for the monopolistic behavior of various platforms in the past.
They don't? Having one file to download instead of a maze of "okay so what do you have" is way easier than the current mess. It would be very nice not to have to ask users what platform they're on, they just click a link and get the thing.
https://github.com/elastic/golang-crossbuild docker images for macos and windows (not linux though)
https://github.com/rust-cross/rust-musl-cross docker images for linux (yes, it's possible to use the same one for go, you just need to put go binary there) so it's static with musl libc and sure to not have some weird issues with old/new libc on old/new/whatever linuxes
(it might be over the top and the zig thing might effectively do the same thing, but I never got it to compile)
oh and apple-codesign rust cargo to get through the signing nonsense
I am only speaking for myself here. While cosmo and ape do seem very clever, I do not need this type of clever stuff in my work if the ordinary stuff already works fine.
Like for example if I can already cross-compile my project to other OSes and platforms or if I've got the infra to build my project for other OSes and platforms, I've no reason to look for a solution that lets me build one binary that works everywhere.
There's also the thing that ape uses clever hacks to be able to run on multiple OSes. What if those hacks break someday due to how executable formats evolve? What if nobody has the time to patch APE to make it compatible with those changes?
But my boring tools like gcc, clang, go, rust, etc. will continue to get updated and they will continue to work with evolving OSes. So I just tend to stick with the boring. That's why I don't bother with the clever because the boring just works for me.
"Cosmopolitan has basically always felt like the interesting sort of technical loophole that makes for a fun blog post which is almost guaranteed to make it to the front page of HN (or similar) purely based in ingenuity & dedication to the bit.
as a piece of foundational technology, in the way that `libc` necessarily is, it seems primarily useful for fun toys and small personal projects.
with that context, it always feels a little strange to see it presented as a serious alternative to something like `glibc`, `musl`, or `msvcrt`; it’s a very cute hack, but if i were to find it in something i seriously depend on i think i’d be a little taken aback."
Logically having one thing instead of multiple makes sense as it simplifies the distribution and bytes stored/transmitted. I think the issue here is that it is not time tested. I believe multiple people in various places around the globe think about it and will try to test it. With time this will either bubble up or be stale.
Whether that in turn has any practical use beyond quickly trying out small models is another question.
it's not that complicated; they're fat binaries (plus i guess a lot of papering over the differences between the platforms) that exploit a quirk of tshell:
> One day, while studying old code, I found out that it's possible to encode Windows Portable Executable files as a UNIX Sixth Edition shell script, due to the fact that the Thompson Shell didn't use a shebang line.
(https://justine.lol/ape.html)
so the answer is simple: i can't think of anyone that wants to ship fat binaries.
If I went up to you in 2008, and said, hey, lets build a database that doesn't do schemas, isn't relational, doesn't do SQL, isn't ACID compliant, doesn't to joins, has transactions as an afterthought, and only does indexing sometimes, you'd think I was trolling you. And then in 2009, Mongodb came out and caught on in various places. So only time will tell if these are ingenious tools that haven't caught on yet. There's definitely a good amount of genius behind it, though only time will tell if it's remembered in the vein of tom7's harder drives or if it sees wider production use. I'll say that if golang supported it as a platform, I'd switch my build pipelines at work over to it, since it makes their output less complicated to manage as there's only a single binary to deal with instead of 3-4.
I don't know whether RedBean is production-ready now, but a year and a half ago, that was the catch.
Deleted Comment
We generally don't care if some mutex is 3x faster than some other mutex. Most of the time if I'm even using a mutex which is rare in itself, the performance of the mutex is generally the least of my concerns.
I'm sure it matters to someone, but most people couldn't give two shits if they know what they're doing. We're not writing code where it's going to make a noticeable difference. There are thousands of things in our code we could optimize that would make a greater impact than a faster mutex, but we're not looking at those either because it's fast enough the way it is.
My betting is that its tricks are only always-faster for certain architectures, or certain CPU models, or certain types of workload / access patterns... and a proper benchmarking of varied workloads on all supported hardware would not show the same benefits.
Alternatively, maybe the semantics of the pthread API (that cosmopolitan is meant to be implementing) are somehow subtly different and this implementation isn't strictly compliant to the spec?
I can't imagine it's that the various libc authors aren't keeping up in state-of-the-art research on OS primitives...
glibc's malloc is tolerable but fails handily to more modern alternatives in overall speed and scalability (it fragments badly and degrades over time, not to mention it has a dozen knobs that can deeply impact real life workloads like MALLOC_ARENA_MAX). musl malloc is completely awful in terms of performance at every level; in a multithreaded program, using musl's allocator will destroy your performance so badly that it nearly qualifies as malpractice, in my experience.
musl doesn't even have things like SIMD optimized string comparison routines. You would be shocked at how many CPU cycles in a non-trivial program are spent on those tasks, and yes it absolutely shows up in non-trivial profiles, and yes improving this improves all programs nearly universally. glibc's optimized routines are good, but they can always, it seems, become faster.
These specific things aren't "oh, they're hyper specific optimizations for one architecture that don't generalize". These two things in particular -- we're talking 2-5x wall clock reduction, and drastically improved long-term working set utilization, in nearly all workloads for any given program. These are well explored and understood spaces with good known approaches. So why didn't they take them? Because, as always, they probably had other things to do (or conflicting priorities like musl prioritizing simplicity over peak performance, even when that philosophy is actively detrimental to users.)
I'm not blaming these projects or anything. Nobody sets out and says "My program is slow as shit and does nothing right, and I designed it that way and I'm proud of it." But the idea that the people working on them have made only the perfect pareto frontier of design choices just isn't realistic in the slightest and doesn't capture the actual dynamics of how most of these projects are run.
Building GNU Make with Cosmo or glibc makes cold startup go 2x faster for me on large repos compared to building it with Musl, due to vectorized strlen() alone (since SIMD is 2x faster than SWAR). I sent Rich a patch last decade adding sse to strlen(), since I love Musl, and Cosmo is based on it. But alas he didn't want it. Even though he seems perfectly comfortable using ARM's strlen() assembly.
> glibc's malloc is tolerable but fails handily to more modern alternatives in overall speed and scalability
The focus and attention I put into cosmo mutexes isn't unique. I put that care into everything else too, and malloc() is no exception. Cosmo does very well at multi-threaded memory allocation. I can pick benchmark parameters where it outperforms glibc and jemalloc by 100x. I can also pick params where jemalloc wins by 100x. But I'm reasonably certain cosmo can go faster than glibc and musl in most cases while using less memory too. You have Doug Lea to thank for that.
Every day is a good day working on cosmo, because I can always find an opportunity to dive into another rabbit hole. Even ones as seemingly unimportant as clocks: https://github.com/jart/cosmopolitan/commit/dd8544c3bd7899ad...
(Why not upstream? ABI compatibility which is apparently a sufficient veto reason for anything in cpp)
It takes forever to change something in glibc, or the C++ equivalent.
There are many kinds of synchronization primitives. pthreads only supports a subset. If you are limiting yourself to them, you are most likely leaving performance on the table, but you gain portability.
is this sarcasm?
(I don't know any libc maintainers, but as a maintainer of a few thingies myself, I do not try to implement state-of-the-art research, I try to keep my thingies stable and ensure the performance is acceptable; implementing research is out of my budget for "maintenance")
If the rival thingies got a speed boost recently, and they were open source, you'd want to have a look at how they did it and maybe get a similar speed boost for yourself.
A man and a statiscian are walking down the street when they see a 50€ bill. The statistician keeps walking but the man stops and says "hey, look at this cash on the floor?". But the statistician, uninpressed, says: "must be fake, or someone would have already picked it up". And continues walking. The other man grabs the cash and takes it.
Synchronizing multiple CPU cores together is fundamentally slow, there’s no ways around it. They are far apart on the chip, and sometimes even on different chips with some link between. When measuring time with CPU cycles that latency is rather slow.
Possible to avoid with good old software engineering, and over time people who wanted to extract performance from their multi-core CPUs became good at it.
When you’re computing something parallel which takes minutes, you’ll do great if you update the progress bar at a laughable 5 Hz. Synchronizing cores 5 times each second costs nothing regardless of how efficient is the mutex.
When you’re computing something interactive like a videogame, it’s often enough to synchronize cores once per rendered frame, which often happens at 60Hz.
Another notable example is multimedia frameworks. These handle realtime data coming at high frequencies like 48 kHz for audio, and they do non-trivial compute in these effect transforms and codecs so they need multiple cores. But they can tolerate a bit of latency so they’re just batching these samples. This dramatically saves IPC costs because you only need to lock these mutexes at 100Hz when batching 480 samples = 10ms of audio.
That's not really the right way to do it. If you're actually using multiple cores in your code, you want to build a data dependency graph and let the cores walk that graph. Max node size ends up being something loosely like 10k items to be processed. You'll typically see hundreds of synchronization points per frame.
This is the base architecture of stuff like Thread Building Blocks and rust's rayon library.
They aren't always. On a NUMA machine some are closer than others; M-series is an example. The cores are in clusters where some of the cache is shared, so atomics are cheap as long as it doesn't leave the cluster.