Mimalloc Cigarette: Losing one week of my life catching a memory leak (Rust)

We had learned helplessness on a drag and drop bug in jquery UI. I had like three hours every second or third Friday and would just step through the code trying to find the bug. That code was so sketchy the jquery team was trying to rewrite it from scratch one component at a time, and wouldn’t entertain any bug discussions on the old code even though they were a year behind already.

After almost six months, I finally found a spot where I could monkey patch a function to wrap it with a short circuit if the coordinates were out of bounds. Not only fixed the bug but made drag and drop several times faster. Couldn’t share this with the world because they weren’t accepting PRs against the old widgets.

I’ve worked harder on bug fixes, but I think that’s the longest I’ve worked on one.

giancarlostoro · a year ago

One of my favorite most elusive bugs was a one liner change. I didn't understand the problem because nobody could reproduce it, or show it. Months later, after my boss told his boss it was fixed, despite never being able to test that it was fixed, I figured it out and fixed it. We had a gift card form, and we stored it in localStorage, if for any reason the person left the tab, and came back months later, it would show the old gift card with its old dated balance, it was a client-side bug. The fix was to use sessionStorage.

arghwhat · a year ago

For web, my favorite is JIT miscompilations. A tie between a mobile Safari bug that caused basic math operations to return 0 regardless of input values (basic, positive Numbers, no shenanigans), or a mobile Samsung browser bug where concatenating a specific single-character string with another single-character string would yield a Number.

Debugging errors in JS crypto and compression implementations that only occur at random, after at least some ten thousand iterations, on a mobile browser back when those were awful, and only if the debugger is closed/detached as opening it disabled the JIT was not fun.

It taught me to go into debugging with no assumptions about what can and cannot be to blame, which has been very useful later in even trickier scenarios.

contingencies · a year ago

It seems in the context of your story the old adage that organizations reproduce software in their own architecture again rings true, with multilayered bureaucracy, lies and promises resulting in "client state".

WalterBright · a year ago

My longest one was an uninitialized declaration of a local variable, which acquired ever-changing values.

This is why D, by default, initializes all variables. Note that the optimizer removes dead assignments, so this is runtime cost-free. D's implementation of C, ImportC, also default initializes all locals. Why let that stupid C bug continue?

Another that repeatedly bit me was adding a field, and neglecting to add initialization of it to all the constructors.

This is why D guarantees that all fields are initialized.

hinkley · a year ago

The first bug I remember writing was making native calls in Java to process data. I didn’t understand why in the examples they kept rerunning the handle dereference in every loop.

If native code calls back into Java, and the GC kicks in, all the objects the native code can see can be compacted and moved. So my implementation worked fine for all of the smaller test fixtures, and blew up half the time with the largest. Because I skipped a line to make it “go faster”.

I finally realized I was seeing raw Java objects in the middle of my “array” and changing the value of final fields into illegal pairings which blew everything the fuck up.

ckocagil · a year ago

Valgrind didn't catch it?

I wonder if there is something that could be done on language design level to have better "sympathy" to memory allocation, i.e. built upon having mmap/munmap as primitives instead of malloc/free; where language patterns are built around allocating pages instead of arbitrarily sized objects. Probably not practical for general high-level languages, but for e.g. embedded or high-performance stuff might make sense?

PaulDavisThe1st · a year ago

This seems to fail to understand that we already have both levels.

Every OS will provide some mechanism to get more pages. But it turns out that managing the use of those pages requires specialized handling, depending on the use case, as well as a bunch of boilerplate. Hence, we also have malloc and its many, many cousins to allocate arbitrary size objects.

You're always welcome to use brk(2) or your OS's equivalent if you just want pages. The question is, what are you going to do with each page once you have it? That's where the next level comes in ...

eschneider · a year ago

In general for embedded, you don't page memory even if you're running something like embedded linux.

For high performance stuff where you need low, predictable latency, you're probably not going to want to use dynamic memory at all.

loeg · a year ago

Not exactly what you're getting at, but you could maybe imagine an explicit version of malloc where allocations are destined either for thread-local only use, or shared use. Then locally freeing remote thread-local memory is an invalid operation and these kinds of assume-locality optimizations are valid on many structures. I think you can imagine a version of mmap that allows for thread-local mappings to help detect accidental misuse of local allocation.

bsder · a year ago

Zig passes allocators around explicitly. There is no implicit memory allocator.

The downside is that it makes things like "print" a pain in the ass.

The upside is that you can have multiple memory allocators with hugely different characteristics (arena for per frame resources, bump allocator for network resources, etc.).

dathinab · a year ago

most modern memory allocators use internally mmap, this is why it most times makes sense to not use the system allocate for long running programs

Generally given that page size isn't something you know at compiler (or even install size) and it can vary between each restart and it being between anything between ~4KiB and 1GiB and most natural memory objects being much less then 4KiB but some being potentially much more then 1GiB you kind don't want to leak anything related to page sizes into your business logic if it can be helper. If you still need to most languages have memory/allocation pools you can use to get a bit more control about memory allocation/free and reuse.

Also the performance issues mentioned have not much to do with memory pages or anything like that _instead they are rooted in concurrency controls of a global resource (memory)_. I.e. thread local concurrency syncronization vs. process concurrency synchronization.

mainly instead of using a fully general purpose allocator they used an allocator whiche is still general purpose but has a design bias which improves same-thread (de)allocation perf at cost of cross thread (de)allocation perf. And they where doing a ton of cross thread (de)allocations leading to noticeable performance degradation.

The thing is even if you hypothetically only had allocations at sizes multiple of a memory page or use a ton of manual mmap you still would want to use a allocator and not always directly free freed memory back to the OS as doing so and doing a syscall on every allocation tends to lead to major performance degradation (in many use cases). So you still need concurrency controls but they come at a cost, especially for cross thread synchronization. Even just lock-free controls based on atomic have a cost over thread local controls caused often largely by cache invalidation/synchronization.

kibwen · a year ago

Level 1 systems programmer: "wow, it feels so nice having control over my memory and getting out from under the thumb of a garbage collector"

Level 2 systems programmer: "oh no, my memory allocator is a garbage collector"

matklad · a year ago

The answer is clear: just don’t have a malloc implementation in your process' address space!

thebruce87m · a year ago

Welcome to embedded! It’s no heaps of fun!

poikroequ · a year ago

A bump allocator is all anyone really needs

seanthemon · a year ago

At the very bottom of everything is a garbage collector..

Soil is just the biggest swap meet in the world. Where every microbe, invertebrate and tree is just looking for someone else’s trash to turn into treasure.

riwsky · a year ago

Market forces: the ultimate garbage collector

"stackoverflow please help me how do i fix memory fragmentation"

amelius · a year ago

Level 3 system programmer: "get me out of this straight jacket and give me my garbage collector back so I can get stuff done"

ComputerGuru · a year ago

That's not how system programmers think..

forrestthewoods · a year ago

No. Just no.

For as painful as the debugging story was I have spent vastly more amounts of time working around garbage collectors to ship performant code.

Arnavion · a year ago

jemalloc also has its own funny problem with threads - if you have a multi-threaded application that uses jemalloc on all threads except the main thread, then the cleanup that jemalloc runs on main thread exit will segfault. In $dayjob we use jemalloc as a sub-allocator in specific arenas. (*) The application itself is fine in production because it allocates from the main thread too, but the unit test framework only runs tests in spawned threads and the main thread of the test binary just orchestrates them. So the test binary triggers this segfault reliably.

( https://github.com/jemalloc/jemalloc/issues/1317 Unlike what the title says, it's not Windows-specific.)

(*): The application uses libc malloc normally, but at some places it allocates pages using `mmap(non_anonymous_tempfile)` and then uses jemalloc to partition them. jemalloc has a feature called "extent hooks" where you can customize how jemalloc gets underlying pages for its allocations, which we use to give it pages via such mmap's. Then the higher layers of the code that just want to allocate don't have to care whether those allocations came from libc malloc or mmap-backed disk file.

CraigJPerry · a year ago

Tangent: what’s the ideal data structure for this problem?

If there were 20million rooms in the world with a price for each day of the year, we’d be looking at around 7billion prices per year. That’d be say 4Tb of storage without indexes.

The problem space seems to have a bunch of options to partition - by locality, by date etc.

I’m curious if there’s a commonly understood match for this problem?

FWIW with that dataset size, my first experiments would be with SQL server because that data will fit in ram. I don’t know if that’s where I’d end up - but I’m pretty sure it’s where I’d start my performance testing grappling with this problem.

jrpelkonen · a year ago

I think your premise is somewhat off. There might be 20 million hotel rooms in a world, but surely they are not individually priced, e.g. all king bed rooms in a given hotel have the same price per given day.

Sort of tl;dr: mimalloc doesn't actually free memory in a way that it can be reused on threads other than the one that allocated it; the free call marks regions for eventual delayed reclaim by the original thread. If the original thread calls malloc again, those regions are collected (1/N malloc calls). Or (C) you can explicitly invoke mi_collect[1] in the allocating thread (the Rust crate does not seem to expose this API).

[1]: https://github.com/microsoft/mimalloc/blob/dev/src/heap.c#L1...

The mimalloc crate just provides the GlobalAlloc impl that can be registered with libstd as the global allocator using the `#[global_allocator]` attr.

The underlying sys crate provides the binding for mimalloc API like `mi_collect`: https://docs.rs/libmimalloc-sys/0.1.39/libmimalloc_sys/fn.mi...

rurban · a year ago

The Annotated C++ Reference Manual:

“C programmers think memory management is too important to be left to the computer. LISP programmers think memory management is too important to be left to the user.”

IceTDrinker · a year ago

PSA: do not use floating point for monetary amounts

SAI_Peregrinus · a year ago

MS Excel uses floating point, and it's used a ton in finance. Don't use floating-point for monetary amounts if you don't know what rounding mode you've set.

koverstreet · a year ago

It's somewhat acceptable with double precision floats - never single precision floats.

But far better to just use integer cents.

nurettin · a year ago

I have used single precision floats in my latest project just to disprove this baloney.

smh · a year ago

You are using 32 bit floats to represent money?

Does your project correctly calculate $300,000.00 + $0.01, (or even just correctly represent the value $300,000.01) and if so, how?

zokier · a year ago