Boehm-Demers-Weiser conservative C/C++ Garbage Collector

Ah the spectre of the 80s, most famously used by the ancient fork of Mono that Unity uses, leading to GC-pause induced stutters that have plagued gaming for over a decade.

I'd estimate that the bad rap GC languages get can be traced back to this piece of code.

rbehrends · 2 years ago

This needs some qualifications.

The above problem is about latency of stop the world collectors in a domain that requires extremely low latency. And if you think that stop the world collections are representative of garbage collection as a whole (i.e. the "bad rap"), this is just not being up to the state of the art. (It's just that generational/incremental collection is hard to impossible without having support for write barriers/read barriers/snapshotting on the compiler or hardware side, which makes that a practical no-go for a language that was never designed to have GC support.)

But in terms of throughput, the BDW GC actually performs pretty well, competitive with or beating malloc()/free() libraries. This is plenty enough for a number of applications, especially when it comes to batch processing. In fact, even the stop the world latency is (combined with parallel marking) plenty good enough for a number of types of regular GUI applications, where you don't have the fairly extreme latency requirements of standard video games.

It is also worth noting that manual deallocation (especially via RAII) isn't a panacea for latency, as that is just as prone to large pauses due to cascading deletes [1].

[1] While in theory you can do that lazily, in this case you're losing timeliness of deletion and may actually have worse worst case latency than a modern GC that can stagger its work to have bounded pause times. The benefit of RAII is that you may be able to plan these deletions, but even that can be surprisingly difficult at times, requiring extra management to keep data structures alive longer than normal. [2]

[2] Note that lazily deleting objects in RC to avoid pause times is not necessarily the answer; for one thing, you're introducing much of the complexity associated with incremental GC again (especially having to do X amount of deallocation work for each allocation), or risking considerable space overhead. https://dl.acm.org/doi/abs/10.1145/964001.964019 It also remains a difficult challenge when you're dealing with objects that aren't of bounded size (e.g. large arrays).

gumby · 2 years ago

> It's just that generational/incremental collection is hard to impossible without having support for write barriers/read barriers/snapshotting on the compiler or hardware side, which makes that a practical no-go for a language that was never designed to have GC support.

It was demonstrated in the mid 80s that the MMU could be used to implement the write barrier (when I saw that I knew lisp machines were a dead end, though I kept using them for as long as was feasible). You can use this and other compiler support to implement it in a language not designed for GC support, no problem.

The issue I think you meant is that you can’t have a transporting collector (which also compacts your working set, even allowing you to return memory to the system). A language like C++ doesn’t anticipate an object’s address ever changing, so that’s a no go.

I suppose you could write special shared_ptr / unique_ptr that supported address mutation but you’d need a compiler flag to check for an object’s address being taken… but to some degree the destructor paradigm renders a lot of this moot.

zozbot234 · 2 years ago

The compiler support you need in practice is quite limited. There is an implementation of incremental cycle collection in Rust: https://github.com/chc4/samsara It's made possible because Rust can tell apart read-only and read-write references (except for references to interior mutable objects, but these are known to the compiler and references to them can be treated as read-write). This avoids a global stop-the-world for the entire program.

Cascading deletes are rare in practice, and if anything they are inherent to deterministic deletion, which is often a desirable property. When they're possible, one can often use arena allocation to avoid the issue altogether since arenas are managed as a single allocation.

dataangel · 2 years ago

I have been writing low latency code for a decade and never seen cascading deletes be a problem. GC on the other hand is a constant latency problem for people.

cplusplusfellow · 2 years ago

For all of the reasons you note above, this is why I've almost exclusively dealt with arena-style allocations of fixed-object-sizes when doing ultra low latency work.

torginus · 2 years ago

The problem isn't mark-and-sweep GCs it's that the Boehm GC is the Ford T model of garbage collectors.

yxhuvud · 2 years ago

Another big reason is the lack of value types in languages like Java. Modern languages with a GC tend to have those, and where it makes sense to use them they help, a lot.

pjmlp · 2 years ago

Note about "modern languages with GC", this was already a thing in Common Lisp, Eiffel, Modula-2+, Modula-3, Mesa/Cedar, Oberon (and its linage),...

Java's mistake was ignoring them, as its was designed as set-top box programming language as Oak, pivoted to applets, and then pushed all over the place.

amelius · 2 years ago

What are "value types"?

p_l · 2 years ago

Mono is nowhere close to major use case of it, it has been succesfully deployed in maaaany projects.

C++ crazies denigrating GC even when it visibly fixed their issues, or maybe especially when it did, was a phenomenon since 1990. Even when code with Boehm GC ran faster - and most importantly, didn't crash all the time.

jstimpfle · 2 years ago

How did GC visibly fix their issues?

Manually managed code doesn't crash all the time. It probably crashed way more in the 90s. There is an art to writing manually managed code that is maintainable and fast, sure, and it doesn't involve malloc() and free(). And then there is RAII, automating lots of the work.

The C++ crazies are AFAICS still most everywhere where there is lots of temporary objects and serious requirements on throughput & latency. There are frameworks and languages that support garbage collection that get used in this space too, but I don't know how much can be inferred from that.

Seems like C and C++ guys are constantly looking for ways to automate their garbage cleanup.

And seems like GC guys are constantly looking for ways to fix their web-of-objects memory layout, and to prevent GC from ruining their performance. Plus, GC isn't deterministic, it mostly works for memory but otherwise you'll see _lots_ of manual cleanup code (not making use of the GC), and it always feels a bit tricky with the GC still being part of the game. Not that I take a huge issue with C-like cleanup code -- I think RAII can be tricky too because it's hard to "read" what happens, and I often prefer the manual cleanup code. But compared to GC, at least RAII is deterministic.

pyeri · 2 years ago

You don't need to go that far for GC's bad rep considering that excruciatingly slow Python can be found TODAY!

adev_ · 2 years ago

> You don't need to go that far for GC's bad rep considering that excruciatingly slow Python can be found TODAY!

To be fair, the GC is probably the last reason of Python speed problems.

Many languages also designed with GC performs significantly better than python in term of raw performance [^1].

The extremely dynamic nature of python associated with its reference semantics are probably significantly more to blame here.

[^1] https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

torginus · 2 years ago

Afaik (C)Python uses reference counting, it's slowness comes from the interpreter itself.

amelius · 2 years ago

Python wasn't designed for speed.

DarkNova6 · 2 years ago

But you must understand, Python is safe and C is difficult. Therefore, to be fast we need Python (fast development that is because nobody _really_ cares about performance (or maintainability)).

If only more than two languages would exist!

ack_complete · 2 years ago

Sadly it was not just the GC, there were plenty of problems in Mono itself that adversely affected GC performance.

A significant one was a problem in the way that the Mono C# compiler generated foreach loops, which caused it to allocate an object for the enumerator that the regular .NET compiler was able to avoid. This caused foreach() to produce unnecessary garbage and resulted in a lot of Unity developers writing manual for() indexing loops instead of a more straightforward foreach() loop.

Another problem was a bug in the Mono standard library where one of the array containers -- I forget whether it was ArrayList or List<T> -- would fail to clear elements at the end on removals. This was innocuous for arithmetic types, but for reference types it would result in hidden entries in unused capacity holding onto stale references. In some cases this caused rather large trees of objects to persist unnecessarily, pinned invisibly by containers that had already been deleted from or were even empty.

mihaaly · 2 years ago

I assume that real time computing should have used real time computing compatible techniques. But luckily those exposed to other than games and game development might feel less aversion towards GC, probably even the other way around.

pjmlp · 2 years ago

Another reason being that many people don't get GC theory, nor that because a language has some form of automatic resource management, does not mean the language doesn't offer other mechanisms.

My experience with Cpp isn't that extensive, but what is the use-case of a garbage collector in this language? I always had the impression that with the well-defined lifetimes of objects, you wouldn't really create garbage to begin with, but I guess there are some use-cases I don't yet know about?

mike_hearn · 2 years ago

It's pretty useful. Chrome uses one extensively for example (called Oilpan). So does Unreal Engine. GC in C++ is much more widely used than people realize.

The problem is that big programs in the real world often don't have well defined lifetimes for everything. The idea that everything in your app can have its lifetime worked out in advance isn't true when you're modelling the world (a world), or lifetimes are controlled by other people (e.g. website authors).

Generally what you see is that these apps start out trying to do manual memory management, decide it's too difficult to do reliably at scale, and switch to a conservatively GCd heap with a custom collector or Boehm.

Note that Rust doesn't fix this problem. Rust just encodes the assumption of pre-known lifetimes much more strongly in the language. If you're not in such a domain then you have to fall back to refcounting, which is (a) slow and (b) easy to get wrong such that you leak anyway. Early versions of Rust used refcounting for everything and iirc they found anywhere between 10-15% of the resulting binaries was refcounting instructions!

pcwalton · 2 years ago

> Note that Rust doesn't fix this problem. Rust just encodes the assumption of pre-known lifetimes much more strongly in the language. If you're not in such a domain then you have to fall back to refcounting, which is (a) slow and (b) easy to get wrong such that you leak anyway. Early versions of Rust used refcounting for everything and iirc they found anywhere between 10-15% of the resulting binaries was refcounting instructions!

Well, modern idiomatic Rust only uses Arc/Rc on the few objects where it's needed, so the overhead of reference count adjustment is so tiny as to never show up. You typically only see reference count traffic be expensive when either (a) everything is reference counted, as in ancient Rust; or (b) on super-inefficient implementations of reference counting, as in COM where AddRef() and Release() are virtual calls.

steveklabnik · 2 years ago

> Early versions of Rust used refcounting for everything and iirc they found anywhere between 10-15% of the resulting binaries was refcounting instructions!

Do you happen to have a citation for this? I don’t remember ever hearing about it, but it’s possible this was before my time, as I started in the smart pointers era.

bfgeek · 2 years ago

Oilpan can theoretically be used as a library as well - see: https://v8.dev/blog/oilpan-library https://v8.dev/blog/tags/cppgc https://chromium.googlesource.com/v8/v8.git/+/HEAD/include/c...

Due to the nature of web engine workloads migrating objects to being GC'd isn't performance negative (as most people would expect). With care it can often end up performance positive.

There are a few tricks that Oilpan can apply. Concurrent tracing helps a lot (e.g. instead of incrementing/decrementing refs, you can trace on a different thread), in addition when destructing objects, the destructors typically become trivial meaning the object can just be dropped from memory. Both these free up main thread time. (The tradeoff with concurrent tracing is that you need atomic barriers when assigning pointers which needs care).

This is on top of the safey improvements you gain from being GC'd vs. smart pointers, etc.

One major tradeoff that UAF bugs become more difficult to fix, as you are just accessing objects which "should" be dead.

uluyol · 2 years ago

Not sure early versions of rust is the best example of refcounting overhead. There are a bunch of tricks you can use to decrease that, and it usually doesn't make sense to invest too much time into that type of thing while there is so much flux in the language.

Swift is probably a better baseline.

adgjlsfhk1 · 2 years ago

it's worth noting that percent of instructions is a bad metric since modern CPUs have lots of extra compute, so adding simple integer instructions that aren't in the critical path will often not affect the wall time at all.

FartyMcFarter · 2 years ago

> what is the use-case of a garbage collector in this language?

Same as other languages.

> I always had the impression that with the well-defined lifetimes of objects, you wouldn't really create garbage to begin with

There's no well defined lifetime of objects when it comes to dynamic allocation. For example, if you allocate something with the new keyword, there are no language guarantees that this won't leak memory.

Jeaye · 2 years ago

I'm using C++ to build jank, a native Clojure dialect on LLVM with C++ interop. All of Clojure's runtime objects are dynamically allocated and the churn of reference counting is far too slow, compared to garbage collection. I had originally started with an intrusive_ptr and atomic count and the Boehm GC was about 2x faster for that benchmark (and at least as much for every later benchmark).

Even outside of writing languages on top of C++, if you're using something like immer, for persistent immutable data structures in C++ (as jank does), it has memory policies for reference counting or garbage collection. This is because immutable data generally results in more garbage, even when transients are used for the most pathological cases. That garbage is the trade off for knowing your values will never be mutated in place. The huge win of that is complete thread safety for reading those values, as well as complete trust on reproducibility/purity and trivial memoization.

pjmlp · 2 years ago

Targeting .NET, hence C++/CLI, integration with Blueprints, hence Unreal C++.

I have been using the Boehm-Demers-Weiser GC for jank [1], the native Clojure dialect on LLVM with C++ interop. Since jank is a Clojure dialect, and Clojure is built primarily on persistent, immutable data structures, there are potentially a lot of references to objects, potentially cyclical, across many threads, and there's a lot of garbage being churned. I originally started with reference counting and RAII, using boost::intrusive_ptr and an atomic count. The GC was actually 2x faster, in the sequence benchmark I was trying to optimize.

At this point, jank is generally several times faster than the equivalent Clojure JVM code. I'm sure there are better GCs out there, in terms of performance, and I have my eye on MMTK [2] for a future upgrade, but the fact that remains is this: the Boehm GC is stupid simple to integrate and it's surprisingly fast. Compare it to MPS, MMTK, and others and both the documentation and the actual dev work required are worlds apart.

For a project which needs a GC but doesn't need to pick the absolute fastest one first, it seems like the best option, based on my research and my experience.

1: https://jank-lang.org/

2: https://www.mmtk.io/code

mtlmtlmtlmtl · 2 years ago

Just want to say, Never heard of Jank before and it looks awesome, great work!.

As more of a CL guy who's dabbled in Clojure but found it a bit too slow(but conversely I also miss the Clojure's data structures and transactional state in CL; multithread programming really isn't CLs strong suit, for obvious historical reasons), this might reinvigorate me. I may sit down one day soon and churn out some AOC or something in Jank just to get a feel.

Cheers :)

The shameless plugging pays off! One more HN reader converted. :D

Check it out, join our Slack, and stay tuned. jank's still under heavy development right now, and not ready for AOC this year, but it's getting close. The blog is the best way to watch the progress: https://jank-lang.org/blog/

Thanks for taking the time to comment!

pharmakom · 2 years ago

Immutable structures cannot be cyclical. Maybe I’m missing something?

Exactly as nlitened mentioned, Clojure has a few difference reference types, which are boxes that hold pointers to more immutable values. The values within those boxes can be changed transactionally. This allows for cyclical references.

nlitened · 2 years ago

In Clojure, the data structures may usually contain references to mutable cells, which in turn may form cycles.

NelsonMinar · 2 years ago

What I love about the Boehm collector is the hack that enables it to work in a language with no runtime typing like C. It literally guesses what's a pointer. In C pointers are basically integers but they're unusual numbers, often quite large, and you know the entire set of valid numbers if you control malloc(). Surprisingly this worked even in 32 bit architectures.

chubot · 2 years ago

I don't see that this is something to "love", and what you describe doesn't capture the full extent of the approximation.

What scared me off of Boehm GC is that there a bazillion config options -- e.g. it requires tuning and blacklisting to avoid memory leaks -- and probably even correctness because register scanning isn't portable.

The codebase is large and a bit scary, and there are seemingly lots of ways to use it poorly. People kept saying it's "drop in", but that seems like a myth.

At least personally it doesn't seem like the kind of thing I want to debug.

The opinions on Boehm generally seem very split -- a bunch of bad stories resulting in ripping it out of the codebase, and also people who think it's surprisingly good, and codebases that have used it for a very long time.

I think the Nix evaluator uses it and is carrying around patches, and Guile Scheme uses it. Though I think there is some desire to move away from it in both cases. But again people have used it for a very long time, which isn't nothing!

---

e.g. https://www.hboehm.info/gc/gcdescr.html

Recent versions of the collector use an approximate best fit algorithm by keeping free lists for several large block sizes. The actual implementation of GC_allochblk is significantly complicated by black-listing issues (see below).

...

The collector implements black-listing of pages, as described in Boehm, ``Space Efficient Conservative Collection'', PLDI '93, also available here.

During the mark phase, the collector tracks ``near misses'', i.e. attempts to follow a ``pointer'' to just outside the garbage-collected heap, or to a currently unallocated page inside the heap. Pages that have been the targets of such near misses are likely to be the targets of misidentified ``pointers'' in the future. To minimize the future damage caused by such misidentifications they will be allocated only to small pointerfree objects.

The collector understands two different kinds of black-listing. A page may be black listed for interior pointer references (GC_add_to_black_list_stack), if it was the target of a near miss from a location that requires interior pointer recognition, e.g. the stack, or the heap if GC_all_interior_pointers is set. In this case, we also avoid allocating large blocks that include this page.

"love" in the sense of someone loving their snaggletooth mutt. What's remarkable to me is despite the obviously terrible idea at the core of it, it seems to work pretty well for some practical use. (IIRC it was at the core of GNU Java as well, and I could have sworn the earliest Sun JVMs used it.)

I don't doubt it caused all sorts of problems too. It's just an interesting reminder that sometimes a really wacky hack can be pretty useful.

matheusmoreira · 2 years ago

> register scanning isn't portable

Certainly not but it wasn't particularly hard to implement either, I managed to do it with some simple inline assembly. Here's my programming language's register spilling code:

https://github.com/lone-lang/lone/blob/master/architecture/x...

https://github.com/lone-lang/lone/blob/master/architecture/a...

It walks the entire native stack and marks every pointer it finds! Handles pointers in registers by spilling them on the stack and just letting the stack scanner find them like all the others. It guesses pointers but in a way that makes false negatives impossible.

I assume everything must be properly aligned to work. Interestingly, it uses jmpbuf to spill the registers. There's also a compiler builtin to get the current stack frame which can be used to obtain the stack root if called at the beginning of the program. Really neat, and educational.

rurban · 2 years ago

I have a library which has an extremely slow free, around 2m for large files, because of unnaturally scattered allocation patterns, but this old conservative GC didn't help at all. It was about 40% slower with libgc. mimalloc was a bit better. Best would be an arena allocator without free, or at least a properly fast GC, like mps https://github.com/Ravenbrook/mps, but this would be too much work.

CyberDildonics · 2 years ago

If you have a program that takes 2 minutes to free allocations, that implies 100 million to 1 billion small allocations. An arena is not the answer, a normal data structure is (most likely a vector). I'm not sure an arena is ever really the answer, because it seems like a poor work around to multiple small allocations, which shouldn't happen in the first place.

Maybe that time is coming from something even more ridiculous like scanning an entire linked list to find each element, but if you can open up that library there are without a doubt terrible memory allocation techniques that should never happen going on.

naasking · 2 years ago

> An arena is not the answer, a normal data structure is (most likely a vector). I'm not sure an arena is ever really the answer, because it seems like a poor work around to multiple small allocations, which shouldn't happen in the first place.

The only way I can make sense of this claim is if you're saying the items he needs to operate on should be inlined into a vector. But that's basically just an arena.

shrubble · 2 years ago

Bjarne Stroustrup's view: https://www.stroustrup.com/bs_faq.html#garbage-collection

m4nu3l · 2 years ago

I once wrote a non-conservative generational GC for C++ just as an exercise, with the constraint that I could only use standard C++ (it was in C++17).

It worked based on a template type I called "gc_ptr<T>". So you could create one of these with the function template "make_gc(...)". This was the only way you could allocate a new garbage-collected object.

"make_gc" would check if a type descriptor for the type was already initialized and if not it would allocate a new one. It would then set a thread_local with the descriptor and the address of the current object being created.

Then it would call the object constructor. If the type being instantiated had gc_ptr members, these would be initialized by their constructor. The gc_ptr constructor would, in turn, check the descriptor of the parent object and add a record to it representing the member. The record would store the member offset calculated from its address and the one of the parent object.

Garbage-collected objects would also use reference counting for gc_ptr(s) on the stack or inside objects that were not garbage-collected. It's easy to know if a gc_ptr is being constructed inside a garbage-collected object or not: If it is, then the code is also executing inside make_gc, so I just need a thread_local flag.

For the mark and sweep, I would use an atomic flag to mark the GC as running. If the flag was set, then all gc_ptr would stop the thread (by waiting on a mutex) as soon as some code would try to swap/reassign them. This means that code that didn't touch them would be kept running during garbage collection.

I would start marking objects from those allocated in the custom arena I used that had a reference count different from zero.

I wanted to add containers that could allocate contiguous GC'ed objects by wrapping the standard containers but never finished that. GC'ed containers of gc_ptr(s) just worked fine.

I just re-ran the benchmarks I wrote to compare it against the Boehm-Demser-Weiser GC. It is quite faster in doing the mark-and-sweep (up to an order of magnitude faster with thousands of objects and a bit less than 100 MBs of GC'ed memory in total).

However, because of the atomic check, it was slower in swapping pointers by roughly an order of magnitude.

I never used it in any project.

frutiger · 2 years ago

Did you ever publish this library?

Animats · 2 years ago

> In our experience, the only examples we have found of a failure with the current collector, even in multi-threaded code, were contrived.

I once found a real incompatibilty. This collector assumes that a pointer to a block must point to somewhere in the block, or, in later versions, at least somewhere really close.

The original Numerical Recipes in C was a conversion of Numerical Recipes in FORTRAN. FORTRAN arrays start at 1. So, NR in C had an array allocator which returned an array address computed such that subscripts starting at 1 would work.[1] If you allocated a 100x100 array of 4-byte f32 values, the array address would be offset downward by (100+1)*4 bytes. That would not look like a pointer to the array to the garbage collector.

By the third edition, in 1986, the Numerical Recipes people had converted to C++, dropped FORTRAN array compatibility, and the problem went away.

[1] http://s3.amazonaws.com/nrbook.com/book_C210.html page 19.

itsphilos · 2 years ago