Jemalloc Postmortem - Readit News

I understand the decision to archive the upstream repo; as of when I left Meta, we (i.e. the Jemalloc team) weren’t really in a great place to respond to all the random GitHub issues people would file (my favorite was the time someone filed an issue because our test suite didn’t pass on Itanium lol). Still, it makes me sad to see. Jemalloc is still IMO the best-performing general-purpose malloc implementation that’s easily usable; TCMalloc is great, but is an absolute nightmare to use if you’re not using bazel (this has become slightly less true now that bazel 7.4.0 added cc_static_library so at least you can somewhat easily export a static library, but broadly speaking the point still stands).

I’ve been meaning to ask Qi if he’d be open to cutting a final 6.0 release on the repo before re-archiving.

At the same time it’d be nice to modernize the default settings for the final release. Disabling the (somewhat confusingly backwardly-named) “cache oblivious” setting by default so that the 16 KiB size-class isn’t bloated to 20 KiB would be a major improvement. This isn’t to disparage your (i.e. Jason’s) original choice here; IIRC when I last talked to Qi and David about this they made the point that at the time you chose this default, typical TLB associativity was much lower than it is now. On a similar note, increasing the default “page size” from 4 KiB to something larger (probably 16 KiB), which would correspondingly increase the large size-class cutoff (i.e. the point at which the allocator switches from placing multiple allocations onto a slab, to backing individual allocations with their own extent directly) from 16 KiB up to 64 KiB would be pretty impactful. One of the last things I looked at before leaving Meta was making this change internally for major services, as it was worth a several percent CPU improvement (at the cost of a minor increase in RAM usage due to increased fragmentation). There’s a few other things I’d tweak (e.g. switching the default setting of metadata_thp from “disabled” to “auto”, changing the extent-sizing for slabs from using the nearest exact multiple of the page size that fits the size-class to instead allowing ~1% guaranteed wasted space in exchange for reducing fragmentation), but the aforementioned settings are the biggest ones.

matoro · 3 months ago

That was me that filed the Itanium test suite failure. :)

apaprocki · 3 months ago

Ah, porting to HP Superdome servers. It’s like being handed a brochure describing the intricate details of the iceberg the ship you just boarded is about to hit in a few days.

A fellow traveler, ahoy!

boulos · 3 months ago

The Itanic was kind of great :). I'm convinced it helped sink SGI.

kabdib · 3 months ago

one of the best books on Linux architecture i've read was the one on the Itanium port

i think, because Itanic broke a ton of assumptions

kstrauser · 3 months ago

Stuff like this is what keeps me coming back here. Thanks for posting this!

What's hard about using TCMalloc if you're not using bazel? (Not asking to imply that it's not, but because I'm genuinely curious.)

Svetlitski · 3 months ago

It’s just a huge pain to build and link against. Before the bazel 7.4.0 change your options were basically:

1. Use it as a dynamically linked library. This is not great because you’re taking at a minimum the performance hit of going through the PLT for every call. The forfeited performance is even larger if you compare against statically linking with LTO (i.e. so that you can inline calls to malloc, get the benefit of FDO , etc.). Not to mention all the deployment headaches associated with shared libraries.

2. Painfully manually create a static library. I’ve done this, it’s awful; especially if you want to go the extra mile to capture as much performance as possible and at least get partial LTO (i.e. of TCMalloc independent of your application code, compiling all of TCMalloc’s compilation units together to create a single object file).

When I was at Meta I imported TCMalloc to benchmark against (to highlight areas where we could do better in Jemalloc) by pain-stakingly hand-translating its bazel BUILD files to buck2 because there was legitimately no better option.

As a consequence of being so hard to use outside of Google, TCMalloc has many more unexpected (sometimes problematic) behaviors than Jemalloc when used as a general purpose allocator in other environments (e.g. it basically assumes that you are using a certain set of Linux configuration options [1] and behaves rather poorly if you’re not)

[1] https://google.github.io/tcmalloc/tuning.html#system-level-o...

gazpacho · 3 months ago

I would love to see these changes - or even some sort of blog post or extended documentation explaining rational. As is the docs are somewhat barren. I feel that there’s a lot of knowledge that folks like you have right now from all of the work that was done internally at Meta that would be best shared now before it is lost.

EnPissant · 3 months ago

Do you have any opinions on mimalloc?

michaelcampbell · 3 months ago

> filed an issue because our test suite didn’t pass on Itanium lol

For the non low-level programmers in the bowels of memory allocators among us, why is this a "lol"?

Svetlitski · 3 months ago

The Itanium ISA was an infamous failure, never seeing widespread usage, hence people often referring to it as “The Itanic” (i.e. the much-touted ship that immediately sunk). The fact that anyone would be using it today at all is sort of hilariously niche, and is illustrative of how wide-ranging and obscure the issues filed to the GitHub repo could be. On a similar token I recall seeing an issue (or maybe it was a PR?) to fix our build on GNU Herd.

klabb3 · 3 months ago

> we (i.e. the Jemalloc team) weren’t really in a great place to respond to all the random GitHub issues people would file

Why not? I mean this is complete drive-by comment, so please correct me, but there was a fully staffed team at Meta that maintained it, but was not in the best place to manage the issues?

anonymoushn · 3 months ago

Well, to be blunt, the company does not care about this, so it does not get done.

xcrjm · 3 months ago

They said the team was not in a great place to do it, eg. they probably had competing priorities that overshadowed triaging issues.

einpoklum · 3 months ago

> TCMalloc is great, but is an absolute nightmare to use if you’re not using bazel

custom-malloc-newbie question: Why is the choice of build system (generator) significant when evaluating the usability of a library?

fc417fc802 · 3 months ago

Because you need to build it to use it, and you likely already have significant build related infrastructure, and you are going to need to integrate any new dependencies into that. I'm increasingly convinced that the various build systems are elaborate and wildly successful ploys intended only to sap developer time and energy.

CamouflagedKiwi · 3 months ago

Because you have to build it. If they don't use the same build system as you, you either want to invoke their system, or import it into yours. The former is unappealing if it's 'heavy' or doesn't play well as a subprocess; the latter can take a lot of time if the build process you're replicating is complex.

I've done both before, and seen libraries at various levels of complexity; there is definitely a point where you just want to give up and not use the thing when it's very complex.

Thaxll · 3 months ago

It's kind of wild that great software is hindered by a complicated build and integration process.

I’ve wondered about this before but never when around people who might know. From my outsider view, jemalloc looked like a strict improvement over glibc’s malloc, according to all the benchmarks I’d seen when the subject came up. So, why isn’t it the default allocator?

toast0 · 3 months ago

It is on FreeBSD. :P Change your malloc, change your life? May as well change your libc while you're there and use FreeBSD libc too, and that'll be easier if you also adopt the FreeBSD kernel.

I will say, the Facebook people were very excited to share jemalloc with us when they acquired my employer, but we were using FreeBSD so we already had it and thought it was normal. :)

favorited · 3 months ago

Disclaimer: I'm not an allocator engineer, this is just an anecdote.

A while back, I had a conversation with an engineer who maintained an OS allocator, and their claim was that custom allocators tend to make one process's memory allocation faster at the expense of the rest of the system. System allocators are less able to make allocation fair holistically, because one process isn't following the same patterns as the rest.

Which is why you see it recommended so frequently with services, where there is generally one process that you want to get preferential treatment over everything else.

mort96 · 3 months ago

The only way I can see that this would be true is if a custom allocator is worse about unmapping unused memory than the system allocator. After all, processes aren't sharing one heap, it's not like fragmentation in one process's address space is visible outside of that process... The only aspects of one process's memory allocation that's visible to other processes is, "that process uses N pages worth of resident memory so there's less available for me". But one of the common criticisms against glibc is that it's often really bad at unmapping its pages, so I'd think that most custom allocators are nicer to the system?

It would be interested in hearing their thoughts directly, I'm also not an allocator engineer and someone who maintains an OS allocator probably knows wayyy more about this stuff than me. I'm sure there's some missing nuance or context or which would've made it make sense.

jeffbee · 3 months ago

I don't think that's really a position that can be defended. Both jemalloc and tcmalloc evolved and were refined in antagonistic multitenant environments without one overwhelming application. They are optimal for that exact thing.

jeffbee · 3 months ago

These allocators often have higher startup cost. They are designed for high performance in the steady state, but they can be worse in workloads that start a million short-lived processes in the unix style.

kstrauser · 3 months ago

Oh, interesting. If that's the case, I can see why that'd be a bummer for short-lived command line tools. "Makes ls run 10x slower" would not be well received. OTOH, FreeBSD uses it by default, and it's not known for being a sluggish OS.

o11c · 3 months ago

For a long time, one of the major problems with alternate allocators is that they would never return free memory back to the OS, just keep the dirty pages in the process. This did eventually change, but it remains a strong indicator of different priorities.

There's also the fact that ... a lot of processes only ever have a single thread, or at most have a few background threads that do very little of interest. So all these "multi-threading-first allocators" aren't actually buying anything of value, and they do have a lot of overhead.

Semi-related: one thing that most people never think about: it is exactly the same amount of work for the kernel to zero a page of memory (in preparation for a future mmap) as for a userland process to zero it out (for its own internal reuse)

senderista · 3 months ago

> Semi-related: one thing that most people never think about: it is exactly the same amount of work for the kernel to zero a page of memory (in preparation for a future mmap) as for a userland process to zero it out (for its own internal reuse)

Possibly more work since the kernel can't use SIMD

vlovich123 · 3 months ago

That’s actually particular try to alternate allocators and not true for glibc if I recall correctly (it’s much worse at returning memory).

sanxiyn · 3 months ago

As far as I know there is no technical reason why jemalloc shouldn't be the default allocator. In fact, as pointed out in the article, it IS the default allocator on FreeBSD. My understanding is it is largely political.

kstrauser · 3 months ago

Now that I think about it, I could easily imagine it being left out of glibc because it doesn't build on Hurd or something.

Dead Comment

adityapatadia · 3 months ago

Jason, here is a story about how much your work impacts us. We run a decently sized company that processes hundreds of millions of images/videos per day. When we first started about 5 years ago, we spent countless hours debugging issues related to memory fragmentation.

One fine day, we discovered Jemalloc and put it in our service, which was causing a lot of memory fragmentation. We did not think that those 2 lines of changes in Dockerfile were going to fix all of our woes, but we were pleasantly surprised. Every single issue went away.

Today, our multi-million dollar revenue company is using your memory allocator on every single service and on every single Dockerfile.

Thank you! From the bottom of our hearts!

thewisenerd · 3 months ago

indeed! most image processing golang services suggest/use jemalloc

the top 3 from https://github.com/topics/resize-images (as of 2025-06-13)

imaginary: https://github.com/h2non/imaginary/blob/1d4e251cfcd58ea66f83...

imgproxy: https://web.archive.org/web/20210412004544/https://docs.imgp... (linked from a discussion in the imaginary repo)

imagor: https://github.com/cshum/imagor/blob/f6673fa6656ee8ef17728f2...

tecleandor · 3 months ago

Yep, imgproxy seems to use libvips, that recommends jemalloc. I was checking and this is a funny (not) bug report:

https://github.com/libvips/libvips/discussions/3019

jcupitt · 3 months ago

Those three all use libvips as the image processing engine, fwiw, so it's maybe not a very wide survey.

libvips is fairly highly threaded and does a lot of alloc/free, so it's challenging for most heap implementations.

laszlojamf · 3 months ago

I really don't mean to be snarky, but honest question: Did you donate? Nothing says thank you like some $$$...

onli · 3 months ago

It was a meta project and development ceased. For a regular project that expectation is fine, but here it does not apply IMHO.

We regularly donate to project via open collective. We frankly did not see here due to FB involvement I think.

masklinn · 3 months ago

> jemalloc was probably booted from Rust binaries sooner than the natural course of development might have otherwise dictated.

FWIW while it was a factor it was just one of a number: https://github.com/rust-lang/rust/issues/36963#issuecomment-...

And jemalloc was only removed two years after that issue was opened: https://github.com/rust-lang/rust/pull/55238

Aissen · 3 months ago

Interesting that one of the factor listed in there, the hardcoded page-size on arm64, is still is an unsolved issue upstream, and that forces app developers to either ship multiple arm64 linux binaries, or drop support for some platforms.

I wonder if some kind of dynamic page-size (with dynamic ftrace-style binary patching for performance?) would have been that much slower.

pkhuong · 3 months ago

You can run jemalloc configured with 16KB pages on a 4KB page system.

dazzawazza · 3 months ago

I've used jemalloc in every game engine I've written for years. It's just the thing to do. WAY faster on win32 than the default allocator. It's also nice to have the same allocator across all platforms.

I learned of it from it's integration in FreeBSD and never looked back.

jemalloc has help entertained a lot of people :)

Iwan-Zotow · 3 months ago

windows def allocator is pos. Jemalloc rules

ahartmetz · 3 months ago

>windows def allocator is pos

Wow, still? I remember allocator benchmarks from 10-15 years ago where there were some notable differences between allocators... and then Windows with like 20% the performance of everything else!

int_19h · 3 months ago

> windows def allocator

Which one of them? These days it could mean HeapAlloc, or it could mean malloc from uCRT.

chubot · 3 months ago

Nice post -- so does Facebook no longer use jemalloc at all? Or is it maintenance mode?

Or I wonder if they could simply use tcmalloc or another allocator these days?

Facebook infrastructure engineering reduced investment in core technology, instead emphasizing return on investment.

As of when I left Meta nearly two years ago (although I would be absolutely shocked if this isn’t still the case) Jemalloc is the allocator, and is statically linked into every single binary running at the company.

> Or I wonder if they could simply use tcmalloc or another allocator these days?

Jemalloc is very deeply integrated there, so this is a lot harder than it sounds. From the telemetry being plumbed through in Strobelight, to applications using every highly Jemalloc-specific extension under the sun (e.g. manually created arenas with custom extent hooks), to the convergent evolution of applications being written in ways such that they perform optimally with respect to Jemalloc’s exact behavior.

charcircuit · 3 months ago

Meta has a fork that they still are working on, where development is continuing.

https://github.com/facebook/jemalloc

nh2 · 3 months ago

The point of the blog post is that repo is over-focused on Facebook's needs instead of "general utility":

> as a result of recent changes within Meta we no longer have anyone shepherding long-term jemalloc development with an eye toward general utility

> we reached a sad end for jemalloc in the hands of Facebook/Meta

> Meta’s needs stopped aligning well with those of external uses some time ago, and they are better off doing their own thing.

burnt-resistor · 3 months ago

They take everything FLOSS and ruin it with bureaucracy, churn, breakage, and inconsideration to external use. They may claim FOSS broadly but it's mostly FOSS-washed, unusable garbage except for a few popular things.

The big recent change is that jemalloc no longer has any of its previous long-term maintainers. But it is receiving more attention from Facebook than it has in a long time, and I am somewhat optimistic that after some recent drama where some of that attention was aimed in a counterproductive direction that the company can aim the rest of it in directions that Qi and Jason would agree with, and that are well aligned with the needs of external users.

schrep · 3 months ago

Your work was so impactful over a long period from Firefox to Facebook. Honored to have been a small part of it.

lbrandy · 3 months ago

Suppose this is as good a place to pile-on as any.

Though this was not the post I was expecting to show up today, it was super awesome for me to get to have played my tiny part in this big journey. Thanks for everything @je (and qi + david -- and all the contributors before and after my time!).

liuliu · 3 months ago

Your leadership on continuing investing in core technologies in Facebook were as fruitful as it could ever being. GraphQL, PyTorch, React to name a few cannot happen without.

dao- · 3 months ago

Hmm, if I had to choose between not having Facebook and having React, I'd pick the former in a heartbeat. Not that this was a real choice, but it was nonetheless bitter to see colleagues join the behemoth that was Facebook.

Deleted Comment

meisel · 3 months ago

I believe there’s no other allocator besides jemalloc that can seamlessly override macOS malloc/free like people do with LD_PRELOAD on Linux (at least as of ~2020). jemalloc has a very nice zone-based way of making itself the default, and manages to accommodate Apple’s odd requirements for an allocator that have tripped other third-party allocators up when trying to override malloc/free.

glandium · 3 months ago

Note this requires hackery that relies on Apple not changing things in its system allocator, which has happened at least twice IIRC.

adgjlsfhk1 · 3 months ago

I believe mimalloc works here (but might be wrong).