Readit News logoReadit News
pizlonator · a year ago
Always cool to see new mutex implementations and shootouts between them, but I don’t like how this one is benchmarked. Looks like a microbenchmark.

Most of us who ship fast locks use very large multithreaded programs as our primary way of testing performance. The things that make a mutex fast or slow seem to be different for complex workloads with varied critical section length, varied numbers of threads contending, and varying levels of contention.

(Source: I wrote the fast locks that WebKit uses, I’m the person who invented the ParkingLot abstraction for lock impls (now also used in Rust and Unreal Engine), and I previously did research on fast locks for Java and have a paper about that.)

PaulDavisThe1st · a year ago
To add to this, as the original/lead author of a desktop app that frequently runs with many tens of threads, I'd like to see numbers on performance in non-heavily contended cases. As a real-time (audio) programmer, I am more concerned with (for example) the cost to take the mutex even when it is not already locked (which is the overwhelming situation in our app). Likewise, I want to know the cost of a try-lock operation that will fail, not what happens when N threads are contending.

Of course, with Cosmopolitan being open source and all, I could do these measurements myself, but still ...

pizlonator · a year ago
Totally!

Pro tip: if you really do know that contention is unlikely, and uncontended acquisition is super important, then it's theoretically impossible to do better than a spinlock.

Reason: locks that have the ability to put the thread to sleep on a queue must do compare-and-swap (or at least an atomic RMW) on `unlock`. But spinlocks can get away with just doing a store-release (or just a store with a compiler fence on X86) to `unlock`.

Spinlocks also have excellent throughput under most contention scenarios, though at the cost of power and being unkind to other apps on the system. If you want your spinlock to be hella fast on contention just make sure you `sched_yield` before each retry (or `SwitchToThread` on Windows, and on Darwin you can do a bit better with `thread_switch(MACH_PORT_NULL, SWITCH_OPTION_DEPRESS, 1)`).

ataylor284_ · a year ago
The writeup suggests this implementation is optimized for the not-locked case:

> uses an optimistic CAS (compare and swap) immediately, so that locking happens quickly when there's no contention

fulafel · a year ago
I'm sure you know but this is the opposite of what real-time conventionally means (you have hard deadlines -> worst case is the chief concern).
uvdn7 · a year ago
I was thinking the same. There are many mutexes out there, some are better at certain workloads than the rest. DistributedMutex and SharedMutex come to mind (https://github.com/facebook/folly/blob/main/folly/synchroniz..., https://github.com/facebook/folly/blob/main/folly/SharedMute...) Just like hashmaps, it's rarely the case that a single hashmap is better under _all_ possible workloads.
pizlonator · a year ago
Yeah.

I should say, though, that if you're on Windows then I have yet to find a real workload where SRWLock isn't the fastest (provided you're fine with no recursion and with a lock that is word-sized). That lock has made some kind of deal with the devil AFAICT.

ngoldbaum · a year ago
This style of mutex will also power PyMutex in Python 3.13. I have real-world benchmarks showing how much faster PyMutex is than the old PyThread_type_lock that was available before 3.13.
0xDEADFED5 · a year ago
Can I use PyMutex from my own Python code?
miohtama · a year ago
Any rough summary?
kagrt · a year ago
I wonder how much it will help in real code. The no-gil build is still easily 50% slower and the regular build showed a slowdown of 50% for Sphinx, which is why the incremental garbage collector was removed just this week.

Python development is in total chaos on all social and technical fronts due to incompetent and malicious leadership.

lovidico · a year ago
Definitely a microbenchmark and probably wouldn’t be generally representative of performance. This page gives pretty good standards for OS benchmarking practic, although admittedly geared more for academia https://gernot-heiser.org/benchmarking-crimes.html
ot · a year ago
Yeah, that specific benchmark is actually likely to prefer undesirable behaviors, for example pathological unfairness: clearly the optimal scheduling of those threads runs first all the increments from the first thread, then all of the second thread, etc... because this will minimize inter-processor traffic.

A mutex that sleeps for a fixed amount (for example 100us) on lock failure acquisition will probably get very close to that behavior (since it almost always bunches), and "win" the benchmark. Still, that would be a terrible mutex for any practical application where there is any contention.

This is not to say that this mutex is not good (or that pthread mutexes are not bad), just that the microbenchmark in question does not measure anything that predicts performance in a real application.

pizlonator · a year ago
Yeah! For all we know this performs great on real programs.

And for all we know it’s absolute trash on real programs.

Deleted Comment

intelVISA · a year ago
Fast locks are an oxymoron vs optimized CAS
xxs · a year ago
Indeed, b/c locks need to have their CAS.
kccqzy · a year ago
> The reason why Cosmopolitan Mutexes are so good is because I used a library called nsync. It only has 371 stars on GitHub, but it was written by a distinguished engineer at Google called Mike Burrows.

Indeed this is the first time I've heard of nsync, but Mike Burrows also wrote Google's production mutex implementation at https://github.com/abseil/abseil-cpp/blob/master/absl/synchr... I'm curious why this mutex implementation is absent from the author's benchmarks.

By the way if the author farms out to __ulock on macOS, this could be more simply achieved by just using the wait(), notify_one() member functions in the libc++'s atomic library.

A while ago there was also a giant thread related to improving Rust's mutex implementation at https://github.com/rust-lang/rust/issues/93740#issuecomment-.... What's interesting is that there's a detailed discussion of the inner workings of almost every popular mutex implementation.

BryantD · a year ago
Mike was a legend by the time I got to AV. The myth was that any time the search engine needed to be faster, he came in and rewrote a few core functions and went back to whatever else he was doing. Might be true, I just can't verify it personally. Extremely smart engineer who cares about efficiency.

We did not, however, run on one server for any length of time.

the-rc · a year ago
Burrows is also responsible for the Burrows Wheeler Transform, Bigtable, Dapper and Chubby, among others.
kristianp · a year ago
https://en.m.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_tran...

> The Burrows-Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2

lokar · a year ago
Also a really nice person. And funny.
tialaramex · a year ago
Although it does get there eventually, that Rust thread is about Mara's work, which is why it eventually mentions her January 2023 book.

The current Rust mutex implementation (which that thread does talk about later) landed earlier this year and although if you're on Linux it's not (much?) different, on Windows and Mac I believe it's new work.

That said, Mara's descriptions of the guts of other people's implementations is still interesting, just make sure to check if they're out-dated for your situation.

SkiFire13 · a year ago
> although if you're on Linux it's not (much?) different

AFAIK one reason to switch was that mutexes on Linux and MacOS were not guaranteed to be moveable, so every rust's Mutex had to box the underlying os mutex and was not const-constructible. So this makes a considerable change.

loeg · a year ago
> I'm curious why [Abseil's] mutex implementation is absent from the author's benchmarks.

Possibly because it's C++ (as opposed to C)? I am speculating.

vitus · a year ago
> Possibly because it's C++ (as opposed to C)?

MSVC 2022's std::mutex is listed, though. (That said, GCC's / clang's std::mutex is not listed for Linux or macOS.)

absl::Mutex does come with some microbenchmarks with a handful of points of comparison (std::mutex, absl::base_internal::SpinLock) which might be useful to get an approximate baseline.

https://github.com/abseil/abseil-cpp/blob/master/absl/synchr...

diedyesterday · a year ago
He seems to have won an ACM award also and there as a (possible) picture of him there. https://awards.acm.org/award-recipients/burrows_9434147
jonathrg · a year ago
> It's still a new C library and it's a little rough around the edges. But it's getting so good, so fast, that I'm starting to view not using it in production as an abandonment of professional responsibility.

What an odd statement. I appreciate the Cosmopolitan project, but these exaggerated claims of superiority are usually a pretty bad red flag.

tredre3 · a year ago
I'd like to point out that Justine's claims are usually correct. It's just her shtick (or personality?) to use hyperbole and ego-stroking wording. I can see why some might see it as abrasive (it has caused drama before, namely in llamacpp).
another-acct · a year ago
I also meant to comment about the grandstanding in her post.

Technical achievement aside, when a person invents something new, the burden is on them to prove that the new thing is a suitable replacement of / improvement over the existing stuff. "I'm starting to view /not/ using [cosmo] in production as an abandonment of professional responsibility" is emotional manipulation -- it's guilt-tripping. Professional responsibility is the exact opposite of what she suggests: it's not jumping on the newest bandwagon. "a little rough around the edges" is precisely what production environments don't want; predictability/stability is frequently more important than peak performance / microbenchmarks.

Furthermore,

> The C library is so deeply embedded in the software supply chain, and so depended upon, that you really don't want it to be a planet killer.

This is just underhanded. She implicitly called glibc and musl "planet killers".

First, technically speaking, it's just not true; and even if the implied statement were remotely true (i.e., if those mutex implementations were in fact responsible for a significant amount of cycles in actual workloads), the emotional load / snide remark ("planet killer") is unjustified.

Second, she must know very well that whenever efficiency of computation is improved, we don't use that for running the same workloads as before at lower cost / smaller environmental footprint. Instead, we keep all CPUs pegged all the time, and efficiency improvements only ever translate to larger profit. A faster mutex too translates to more $$$ pocketed, and not to less energy consumed.

I find her tone of voice repulsive.

jancsika · a year ago
This case isn't abrasive, but it's certainly incoherent.

Name a single case where professional responsibility would require C code advertised as "rough around the edges" to be used anywhere near production. (The weasel words "starting to" do not help the logic of that sentence.)

I can definitely understand how OP could view this as a red flag. The author should amend it.

asveikau · a year ago
She claimed posix changed for her use case, and it's not true, posix disallows what she said.

Indeed the first few bullet points of her writeup on how the lock works (compare and swap for uncontended, futex for contended) is already how everybody implements locks for about 20 years since the futex was introduced for exactly this. Win32 critsec from even longer ago works the same way.

__turbobrew__ · a year ago
The drama in llamacpp was not due to language, it was due to jart making false claims and having absolutely no idea how memory maps work in addition to needlessly changing file headers to contain her name in vanity.

I don’t think anybody can deny that jart is a smart person, but I think there is more to it than just language abbrasiveness which people take issue with.

kelnos · a year ago
Justine seems like a fairly brilliant and creative person, but in production I don't think I'd want to use a libc that's "new" or "rough around the edges".

Getting "so fast" is not my first priority when I run things in production. That's stability, predictability, and reliability. Certainly performance is important: better-performing code can mean smaller infrastructure, which is great from a cost and environmental perspective. But fast comes last.

pmarreck · a year ago
I feel that there’s a certain amount of hubris that comes along with spending long periods of time solo-coding on a computer, and perhaps unwittingly starved of social contact. Without any checks on you or your work’s importance (normally provided by your bog-standard “job”), your achievements take on a grandeur that they might not have broadly earned, as impressive as they might be.

An example is APE (which I otherwise feel is a very impressive hack). One criticism might be “oh, so I not only get to be insecure on one platform, I can be insecure on many all at once?”

The longer you spend in technology, the more you realize that there are extremely few win-wins and a very many win-somes, lose-somes (tradeoffs)

gr4vityWall · a year ago
It came off as humor to me, at least.
orochimaaru · a year ago
I agree. It seemed like humor to me. I have been accused of using off color humor with hyperbole before. So maybe the appeal is limited.
fefe23 · a year ago
Have you considered that you may have a different kind of humor than Justine?

Why would you even post this here? Who do you think this is helping?

another-acct · a year ago
I think it's fair to comment not only on the subject, but on the writing itself, too.

And it might help Justine improve her writing (and reach a larger audience -- after all, blog posts intend to reach some audience, don't they?). Of course you can always say, "if you find yourself alienated, it's your loss".

jonathrg · a year ago
It doesn't clearly come across as a joke.
almostgotcaught · a year ago
It's especially a red-flag since an enormous number of projects (almost all of them?) will never tolerate shipping fat binaries (ie what cosmopolitan C is in reality).
ataylor284_ · a year ago
The core of this a library called nsync. It appears most of the improvements by Justine are upstreamed into nsync itself which doesn't have any of the baggage of cosmopolitan. Whatever your opinion of the project or author, they've made good effort to not lock you in.
another-acct · a year ago
Agreed; this is what I've always (silently) thought of those fat binaries. Absolute stroke of genius, no doubt, and also a total abomination (IMO) from a sustainability perspective.
intelVISA · a year ago
Ironic given the generous size of the average Go binary.
jay-barronville · a year ago
> I appreciate the Cosmopolitan project, but these exaggerated claims of superiority are usually a pretty bad red flag.

In general, I agree with your sentiment, but Justine is simply a different beast. She does tend to use hyperbolic language a fair amount, but she delivers so much awesomeness that I make an exception for her.

SleepyMyroslav · a year ago
Completely tangential:

As gamedev I came to love slow mutexes that do a lot of debug things in all 'developer' builds. Have debug names/IDs, track owners, report time spent in contention to profiler, report ownership changes to profiler...

People tend to structure concurrency differently and games came to some patterns to avoid locks. But they are hard to use and require programmer to restructure things. Most of the code starts as 'lets slap a lock here and try to pass the milestone'. Even fast locks will be unpredictably slow and will destroy realtime guarantees if there were any. They can be fast on average but the tail end is never going to go away. I don't want to be that guy who will come back to it chasing 'oh our game has hitches' but I am usually that guy.

Use slow locks people. The ones that show big red in profiler. Refactor them out when you see them being hit.

I know its a tall order. I can count people who know how to use profiler by fingers on AAA production. And it always like that no matter how many productions I see :)

ps. sorry for a rant. Please, continue good research into fast concurrency primitives and algorithms.

simonask · a year ago
Even more tangentially: This is one of the reasons I'm having a great time developing a game in Rust.

You never want lock contention in a game if at all avoidable, and in a lot of cases taking a lock is provably unnecessary. For example: Each frame is divided into phases, and mutable access to some shared resource only needs to happen in one specific phase (like the `update()` function before `render()`, or when hot-reloading assets). With scoped threads and Rust's borrowing rules, you can structure things to not even need a mutex, and be completely certain that you will receive a stern compiler error if the code changes in way so you do.

When possible, I'd always love to take a compiler error over a spike in the profiler.

loeg · a year ago
Totally agree. Debugging features -- e.g. deadlock detection / introspection -- easily pay for themselves. If you're actually acquiring locks so frequently they matter, you should revisit your design. Sharing mutable state between thread should be avoided.
Uehreka · a year ago
So on the one hand, all this Cosmo/ape/redbean stuff sounds incredible, and the comments on these articles are usually pretty positive and don’t generally debunk the concepts. But on the other hand, I never hear mention of anyone else using these things (I get that not everyone shares what they’re doing in a big way, but after so many years I’d expect to have seen a couple project writeups talk about them). Every mention of Cosmo/ape/redbean I’ve ever seen is from Justine’s site.

So I’ve gotta ask: Is there a catch? Are these tools doing something evil to achieve what they’re achieving? Is the whole thing a tom7-esque joke/troll that I don’t get because I’m not as deep into compilers/runtimes? Or are these really just ingenious tools that haven’t caught on yet?

amiga386 · a year ago
APE works through cunning trickery that might get patched out any day now (and in OpenBSD, it has been).

Most people producing cross-platform software don't want a single executable that runs on every platform, they want a single codebase that works correctly on each platform they support.

With that in mind that respect, languages like go letting you cross compile for all your targets (provided you avoid CGO) is delightful... but the 3-ways-executable magic trick of APE, while really clever, doesn't inspire confidence it'll work forever, and for the most part it doesn't gain you anything. Each platform has their own packaging/signing requirements. You might as well compile a different target for each platform.

cyberax · a year ago
> With that in mind that respect, languages like go letting you cross compile for all your targets (provided you avoid CGO)

Even that is not a big deal in most of cases, if you use zig to wrap CC: https://dev.to/kristoff/zig-makes-go-cross-compilation-just-...

0x1ceb00da · a year ago
Wasn't elf format modified by upstream to accomodate for cosmo? That makes it kinda official. Still hard to see a use case for it. If you want everyone to be able to run your program, just write a web app, a win32 program, or a java applet. 20 years old java applets still run on modern JVMs.
hyperion2010 · a year ago
> Most people

We'll I'm used to not being most people, but I'd much rather be able to produce a single identical binary for my users that works everywhere than the platform specific nonsense I have to go through right now. Having to maintain different special build processes for different platforms is a stupid waste of time.

Frankly this is how it always should have worked except for the monopolistic behavior of various platforms in the past.

fragmede · a year ago
> Most people producing cross-platform software don't want a single executable that runs on every platform

They don't? Having one file to download instead of a maze of "okay so what do you have" is way easier than the current mess. It would be very nice not to have to ask users what platform they're on, they just click a link and get the thing.

karel-3d · a year ago
what worked for me for cross-compiling go:

https://github.com/elastic/golang-crossbuild docker images for macos and windows (not linux though)

https://github.com/rust-cross/rust-musl-cross docker images for linux (yes, it's possible to use the same one for go, you just need to put go binary there) so it's static with musl libc and sure to not have some weird issues with old/new libc on old/new/whatever linuxes

(it might be over the top and the zig thing might effectively do the same thing, but I never got it to compile)

oh and apple-codesign rust cargo to get through the signing nonsense

blenderob · a year ago
> Is there a catch?

I am only speaking for myself here. While cosmo and ape do seem very clever, I do not need this type of clever stuff in my work if the ordinary stuff already works fine.

Like for example if I can already cross-compile my project to other OSes and platforms or if I've got the infra to build my project for other OSes and platforms, I've no reason to look for a solution that lets me build one binary that works everywhere.

There's also the thing that ape uses clever hacks to be able to run on multiple OSes. What if those hacks break someday due to how executable formats evolve? What if nobody has the time to patch APE to make it compatible with those changes?

But my boring tools like gcc, clang, go, rust, etc. will continue to get updated and they will continue to work with evolving OSes. So I just tend to stick with the boring. That's why I don't bother with the clever because the boring just works for me.

sublimefire · a year ago
If the executable format changes then it would affect more than just ape/cosmopolitan. All of the old binaries would just stop working.
dundarious · a year ago
Mozilla llamafile uses it. Bundles model weights and an executable into a single file, that can be run from any cosmo/ape platform, and spawns a redbean http server for you to interact with the LLM. Can also run it without the integrated weights, and read weights from the filesystem. It's the easiest "get up and go" for local LLMs you could possibly create.
cowsandmilk · a year ago
Mozilla’s llamafile is primarily developed by jart. I wouldn’t view it as anyone else actually using cosmo/ape
jkachmar · a year ago
reposting my comment from another time this discussion came up:

"Cosmopolitan has basically always felt like the interesting sort of technical loophole that makes for a fun blog post which is almost guaranteed to make it to the front page of HN (or similar) purely based in ingenuity & dedication to the bit.

as a piece of foundational technology, in the way that `libc` necessarily is, it seems primarily useful for fun toys and small personal projects.

with that context, it always feels a little strange to see it presented as a serious alternative to something like `glibc`, `musl`, or `msvcrt`; it’s a very cute hack, but if i were to find it in something i seriously depend on i think i’d be a little taken aback."

sublimefire · a year ago
The problem with this take is that it is not grounded in facts that should determine if the thing is good or bad.

Logically having one thing instead of multiple makes sense as it simplifies the distribution and bytes stored/transmitted. I think the issue here is that it is not time tested. I believe multiple people in various places around the globe think about it and will try to test it. With time this will either bubble up or be stale.

wmf · a year ago
But what's actually bad about it?
int_19h · a year ago
Mozilla has a project called Llamafile (https://github.com/Mozilla-Ocho/llamafile) that's based on Cosmopolitan libc. And they do regularly publish popular models repackaged in that format on Hugging Face: https://huggingface.co/models?search=llamafile.

Whether that in turn has any practical use beyond quickly trying out small models is another question.

almostgotcaught · a year ago
> So I’ve gotta ask: Is there a catch? Are these tools doing something evil to achieve what they’re achieving?

it's not that complicated; they're fat binaries (plus i guess a lot of papering over the differences between the platforms) that exploit a quirk of tshell:

> One day, while studying old code, I found out that it's possible to encode Windows Portable Executable files as a UNIX Sixth Edition shell script, due to the fact that the Thompson Shell didn't use a shebang line.

(https://justine.lol/ape.html)

so the answer is simple: i can't think of anyone that wants to ship fat binaries.

fragmede · a year ago
Anyone focused on customer experience wants to ship a binary that just works, without customers having to know what a fat binary even is.
cycomanic · a year ago
Correct me if I'm wrong, but I always thought Apple universal binaries are fat binaries? So why did Apple build this ability if nobody wants it?
fragmede · a year ago
> Is the whole thing a tom7-esque joke/troll that I don’t get because I’m not as deep into compilers/runtimes? Or are these really just ingenious tools that haven’t caught on yet?

If I went up to you in 2008, and said, hey, lets build a database that doesn't do schemas, isn't relational, doesn't do SQL, isn't ACID compliant, doesn't to joins, has transactions as an afterthought, and only does indexing sometimes, you'd think I was trolling you. And then in 2009, Mongodb came out and caught on in various places. So only time will tell if these are ingenious tools that haven't caught on yet. There's definitely a good amount of genius behind it, though only time will tell if it's remembered in the vein of tom7's harder drives or if it sees wider production use. I'll say that if golang supported it as a platform, I'd switch my build pipelines at work over to it, since it makes their output less complicated to manage as there's only a single binary to deal with instead of 3-4.

gavindean90 · a year ago
Tbh I don’t know that there is a catch except that there is one core person behind whole project. I like that there are fresh ideas in the C space.
tangus · a year ago
Last year I needed to make a small webapp to be hosted on a Windows server, and I thought RedBean would be ideal for it. Unfortunately it was too buggy (at least on Windows).

I don't know whether RedBean is production-ready now, but a year and a half ago, that was the catch.

jart · a year ago
Give the latest nightly build a try: https://cosmo.zip/pub/cosmos/bin/ Windows has been a long hard march, but we've recently hit near feature completion. As of last month, the final major missing piece of the puzzle was implemented, which is the ability to send UNIX signals between processes. Cosmopolitan does such a good job on Windows now that it's not only been sturdy for redbean, but much more mature and complicated software as well, like Emacs, GNU Make, clang, Qt, and more.

Deleted Comment

sfn42 · a year ago
Most people aren't writing C as far as I know. We use Java, C#, Go, Python etc, some lunatics even use Node.

We generally don't care if some mutex is 3x faster than some other mutex. Most of the time if I'm even using a mutex which is rare in itself, the performance of the mutex is generally the least of my concerns.

I'm sure it matters to someone, but most people couldn't give two shits if they know what they're doing. We're not writing code where it's going to make a noticeable difference. There are thousands of things in our code we could optimize that would make a greater impact than a faster mutex, but we're not looking at those either because it's fast enough the way it is.

secondcoming · a year ago
What a pointless contribution
amiga386 · a year ago
If it's so good, why haven't all C libraries adopted the same tricks?

My betting is that its tricks are only always-faster for certain architectures, or certain CPU models, or certain types of workload / access patterns... and a proper benchmarking of varied workloads on all supported hardware would not show the same benefits.

Alternatively, maybe the semantics of the pthread API (that cosmopolitan is meant to be implementing) are somehow subtly different and this implementation isn't strictly compliant to the spec?

I can't imagine it's that the various libc authors aren't keeping up in state-of-the-art research on OS primitives...

aseipp · a year ago
Those projects often have dozens of other priorities beyond just one specific API, and obsessing over individual APIs isn't a good way to spend the limited time they have. In any case, as a concrete example to disprove your claim, you can look at malloc and string routines in your average libc on Linux.

glibc's malloc is tolerable but fails handily to more modern alternatives in overall speed and scalability (it fragments badly and degrades over time, not to mention it has a dozen knobs that can deeply impact real life workloads like MALLOC_ARENA_MAX). musl malloc is completely awful in terms of performance at every level; in a multithreaded program, using musl's allocator will destroy your performance so badly that it nearly qualifies as malpractice, in my experience.

musl doesn't even have things like SIMD optimized string comparison routines. You would be shocked at how many CPU cycles in a non-trivial program are spent on those tasks, and yes it absolutely shows up in non-trivial profiles, and yes improving this improves all programs nearly universally. glibc's optimized routines are good, but they can always, it seems, become faster.

These specific things aren't "oh, they're hyper specific optimizations for one architecture that don't generalize". These two things in particular -- we're talking 2-5x wall clock reduction, and drastically improved long-term working set utilization, in nearly all workloads for any given program. These are well explored and understood spaces with good known approaches. So why didn't they take them? Because, as always, they probably had other things to do (or conflicting priorities like musl prioritizing simplicity over peak performance, even when that philosophy is actively detrimental to users.)

I'm not blaming these projects or anything. Nobody sets out and says "My program is slow as shit and does nothing right, and I designed it that way and I'm proud of it." But the idea that the people working on them have made only the perfect pareto frontier of design choices just isn't realistic in the slightest and doesn't capture the actual dynamics of how most of these projects are run.

jart · a year ago
> musl doesn't even have things like SIMD optimized string comparison routines. You would be shocked at how many CPU cycles in a non-trivial program are spent on those tasks

Building GNU Make with Cosmo or glibc makes cold startup go 2x faster for me on large repos compared to building it with Musl, due to vectorized strlen() alone (since SIMD is 2x faster than SWAR). I sent Rich a patch last decade adding sse to strlen(), since I love Musl, and Cosmo is based on it. But alas he didn't want it. Even though he seems perfectly comfortable using ARM's strlen() assembly.

> glibc's malloc is tolerable but fails handily to more modern alternatives in overall speed and scalability

The focus and attention I put into cosmo mutexes isn't unique. I put that care into everything else too, and malloc() is no exception. Cosmo does very well at multi-threaded memory allocation. I can pick benchmark parameters where it outperforms glibc and jemalloc by 100x. I can also pick params where jemalloc wins by 100x. But I'm reasonably certain cosmo can go faster than glibc and musl in most cases while using less memory too. You have Doug Lea to thank for that.

Every day is a good day working on cosmo, because I can always find an opportunity to dive into another rabbit hole. Even ones as seemingly unimportant as clocks: https://github.com/jart/cosmopolitan/commit/dd8544c3bd7899ad...

wrsh07 · a year ago
Another example is hash maps: all the large companies built better maps in cpp (folly, absl), but the number of apps that are performance sensitive and still use std::unordered_map will be astounding forever.

(Why not upstream? ABI compatibility which is apparently a sufficient veto reason for anything in cpp)

dist-epoch · a year ago
Politics, not-invented-here syndrome, old maintainers.

It takes forever to change something in glibc, or the C++ equivalent.

There are many kinds of synchronization primitives. pthreads only supports a subset. If you are limiting yourself to them, you are most likely leaving performance on the table, but you gain portability.

jitl · a year ago
> I can't imagine it's that the various libc authors aren't keeping up in state-of-the-art research on OS primitives...

is this sarcasm?

(I don't know any libc maintainers, but as a maintainer of a few thingies myself, I do not try to implement state-of-the-art research, I try to keep my thingies stable and ensure the performance is acceptable; implementing research is out of my budget for "maintenance")

amiga386 · a year ago
But if you maintain a few thingies, you'd probably know about rival thingies that do a similar thing, right?

If the rival thingies got a speed boost recently, and they were open source, you'd want to have a look at how they did it and maybe get a similar speed boost for yourself.

leni536 · a year ago
Are there ABI considerations for changing a pthread mutex implementation?
pnt12 · a year ago
> If it's so good, why haven't all C libraries adopted the same tricks?

A man and a statiscian are walking down the street when they see a 50€ bill. The statistician keeps walking but the man stops and says "hey, look at this cash on the floor?". But the statistician, uninpressed, says: "must be fake, or someone would have already picked it up". And continues walking. The other man grabs the cash and takes it.

Const-me · a year ago
My guess is, because what’s in these current standard libraries and OSes are good enough.

Synchronizing multiple CPU cores together is fundamentally slow, there’s no ways around it. They are far apart on the chip, and sometimes even on different chips with some link between. When measuring time with CPU cycles that latency is rather slow.

Possible to avoid with good old software engineering, and over time people who wanted to extract performance from their multi-core CPUs became good at it.

When you’re computing something parallel which takes minutes, you’ll do great if you update the progress bar at a laughable 5 Hz. Synchronizing cores 5 times each second costs nothing regardless of how efficient is the mutex.

When you’re computing something interactive like a videogame, it’s often enough to synchronize cores once per rendered frame, which often happens at 60Hz.

Another notable example is multimedia frameworks. These handle realtime data coming at high frequencies like 48 kHz for audio, and they do non-trivial compute in these effect transforms and codecs so they need multiple cores. But they can tolerate a bit of latency so they’re just batching these samples. This dramatically saves IPC costs because you only need to lock these mutexes at 100Hz when batching 480 samples = 10ms of audio.

monocasa · a year ago
> When you’re computing something interactive like a videogame, it’s often enough to synchronize cores once per rendered frame, which often happens at 60Hz.

That's not really the right way to do it. If you're actually using multiple cores in your code, you want to build a data dependency graph and let the cores walk that graph. Max node size ends up being something loosely like 10k items to be processed. You'll typically see hundreds of synchronization points per frame.

This is the base architecture of stuff like Thread Building Blocks and rust's rayon library.

astrange · a year ago
> They are far apart on the chip, and sometimes even on different chips with some link between.

They aren't always. On a NUMA machine some are closer than others; M-series is an example. The cores are in clusters where some of the cache is shared, so atomics are cheap as long as it doesn't leave the cluster.

dumdood · a year ago
Threads and mutexes are the most complicating things in computer science. I am always skeptical of new implementations until they've been used for several years at scale. Bugs in these threading mechanisms often elude even the most intense scrutiny. When Java hit the scene in the mid 90s it exposed all manner of thread and mutex bugs in Solaris. I don't want the fastest mutex implementation - I want a reliable one.
xxs · a year ago
mutexes are far from the most 'complicated'. There are really not that many way to implement them (efficiently). In most cases there are best avoided, esp. on the read paths.
deviantbit · a year ago
This code benchmarks mutex contention, not mutex lock performance. If you're locking like this, you should reevaluate your code. Each thread locks and unlocks the mutex for every increment of g_chores. This creates an overhead of acquiring and releasing the mutex frequently (100,000 times per thread). This overhead masks the real performance differences between locking mechanisms because the benchmark is dominated by lock contention rather than actual work. Benchmarks such as this one are useless.