The case of a leaky goroutine

Thread leaks, which happen more frequently due to threaded async abstractions, such as the goroutine, are less often discussed than memory or CPU leaks, but are much more dangerous in multi-tenant container environments.

A thread leak can lock up your entire node, including all the control plane processes. A container spec doesn't provide an easy way to control thread/nproc/ulimit limits (you can still do it, but it's not straightforward), which in turns leaves pretty much every k8s deployment misconfigured and vulnerable to thread leaks.

kbolino · 2 years ago

Goroutines in a single process map onto a fixed number of threads. Even if you have goroutine leaks, you should not have thread leaks. Your program may deadlock or run out of memory, but it will not take the whole system down (at least, not in this way).

usrnm · 2 years ago

> Goroutines in a single process map onto a fixed number of threads

Not necessarilly true if you're using cgo

hedora · 2 years ago

My #1 complaint about Rust is that leaking a future is safe. It means the compiler can’t check for async coroutine leaks, and it breaks the borrow checker’s ability to say “nothing else has a reference to this any more”.

Anyway, we’re using golang for some stuff at work, and holy crap, I forgot how terrible it was to work in high level languages that don’t statically check for correct synchronization.

If C++-style concurrency is like a chainsaw, then golang concurrency is like a chainsaw in a bouncy castle.

masklinn · 2 years ago

> My #1 complaint about Rust is that leaking a future is safe.

There’s no other way given the leakpocalypse decision. You’d need an entirely new leak-proof language to fix that, and that means you need alternatives for Rc and Arc (or a way to prevent them creating a cycle).

favflam · 2 years ago

For what it is worth, I totally agree. I programmed a fairly large project only to find that goroutines were mysteriously blocking in extremely hard-to-debug subtle ways. So, I had to rewrite a huge chunk. I think goroutines are like Ackbar "It's a trap" trap.

I was liberal in usage with goroutines and channels. But after that experience, I decided to religiously track each goroutine in my head. If I reached the point of not being able to mentally map all goroutines running, I would cut out go. I started using callbacks more too to avoid the pernicious blocking I have experienced.

duped · 2 years ago

> My #1 complaint about Rust is that leaking a future is safe.

Do you mean futures that aren't polled to completion, tasks that aren't joined, or literal memory leaks that happen to own futures?

f_gergo · 2 years ago

What is a "correct synchronization"?

js2 · 2 years ago

Isn't this addressed by CPU requests/limits + pid limiting?

https://kubernetes.io/docs/concepts/policy/pid-limiting/

lxe · 2 years ago

It's just more annoying to set up and isn't as widely known, and doesn't work on a per-pod basis.

> PID limiting is a an important sibling to compute resource requests and limits. However, you specify it in a different way: rather than defining a Pod's resource limit in the .spec for a Pod, you configure the limit as a setting on the kubelet. Pod-defined PID limits are not currently supported.

People hate on Haskell async exceptions (with good reason), but one cool thing about them and the Haskell RTS is that you can almost [1] always cancel a thread from the outside. No need for the thread to cooperate like in Golang.

The entire `async` package is built on this. The `race` combinator is an especially cool application.

After doing a big project in Golang, I appreciated this more. We had our fair share of goroutine leaks.

[1] iirc, if the thread is not blocking on a syscall or allocating memory, it will not be yielding to the RTS.

treyd · 2 years ago

Rust's futures also get this through being poll-based. You cancel a future by dropping it. Futures compose really easily, so you can combine a bunch of futures in interesting trees of selects and joins and any that are not completed get cleaned up automatically and with little/no overhead when they go out of scope / when the task they're a part of completes. You don't think about them like separate chunks of work, you just think about them like types you can await on and yield a value, and the compiler flattens it all out into a state machine. All of the sync and composition combinators are implemented just in the traits/types of the tokio/futures libraries because the poll/waker abstraction is low level and versatile enough. As long as you don't go out of your way to write bad async code, there's no leaks for the same reasons there's no leaks in Rust code.

It feels a lot like the monadic composition of Haskell even if the means it achieves it are very different.

pdimitar · 2 years ago

> As long as you don't go out of your way to write bad async code

I've seen the light of async Rust and I believe (heh, sorry for the semi-flippant sarcastic remark here but I do genuinely love it when I do things right with async Rust) but I feel that writing bad async code is also not something that the compiler will actively dissuade you from. It goes to certain lengths but not quite far enough IMO.

I didn't want to do it but at one point I started reading how is async implemented and that actually lifted a big part of the mystical veil and helped me understand it better. Now if I can also completely internalize the lifetime semantics combined with async I'd be very proud of myself. (But it doesn't help that I am not working with Rust currently.)

atombender · 2 years ago

I wish Go recorded the timestamp of goroutine and let you access them.

An app I work on recently had a bug where goroutines would slowly build up over time. Turns out the bug is in the Growthbook SDK [1]. We can monitor the number of goroutines, but having a large number of goroutines waiting in the location that gets stuck is normal — we can only see such a problem over multiple days, in that the minimum value slowly goes up.

If Go could tell you the timestamp of the oldest goroutines as part of the pprof dump, we could have an alert, and it would work for any such leak.

[1] https://github.com/growthbook/growthbook-golang/pull/28

prosunpraiser · 2 years ago

Execution traces have a goroutine profile which outputs the count of goroutines as well. That can be used for an alert as well - though it would require parsing the trace output. They recently made some changes to give a structured API over trace data - maybe use that?

jmholla · 2 years ago

Language support would be great, but could you add logs that record the creating and destruction of goroutines, giving them a unique UUID so you can track which one's haven't exited?

Edit: Also, maybe the tool at this comment could've helped you? https://news.ycombinator.com/item?id=39817775

How would one log that? In our case the bug was in a goroutine started by a third-party library, so we don't have control over when it starts or exits.

I don't think Goleak would have helped here, because I believe it doesn't support concurrency. It's really designed to run in tests, not in production. It parses and searches stack traces, so it's not going to be performant.

mholt · 2 years ago

This article doesn't go into depth about capturing the profile to identify the leak, so if you want more instruction on one way to do that, I wrote an article recently called Profiling Caddy, which is geared toward Caddy but really works for any Go programs: https://caddyserver.com/docs/profiling

whateveracct · 2 years ago

shp0ngle · 2 years ago

Didn't Uber have some leaky goroutine detector? I vaguely remember seeing something like that, 5 years ago...

Ah yeah it's here.

https://github.com/uber-go/goleak

latchkey · 2 years ago

Uber also made something called fx, which is fantastic.

You don't have to use it, but when you do, it helps ensure that you organize your code in a way that becomes very easily testable. It enforces a modular approach to composing together golang services.

Being more easily testable helps prevent bugs, like these leaky goroutines.

Karrot_Kream · 2 years ago

`fx` is mostly just Dependency Injection in Go which has been a thing in Java forever.

I'm curious though, when do you reach for `fx` in a non-industrial project and when do you not? I still use the same patterns of separating out the implementation from the interface but I've been wiring in the dependencies by hand. I'm curious if folks reach for `fx` immediately or if it's something that requires thought to add. There's also Google's wire library [1] that does similar stuff but takes a compile time approach so it's a little easier to reason about if struct initialization screws up due to weird implicit things.

I still wire dependencies up by hand, but I'm curious what others do.

[1]: https://github.com/google/wire/tree/main

I have seen fx used in production and it was an unholy mess. I never wish this upon anyone. It makes Go into Java.

bakul · 2 years ago

If Go allowed something like "handle = go foo()", goroutine could be automatically terminated when handle goes out of scope or becomes dead and is garbage collected. You can also use handle to cancel a goroutine etc.

Go designers specifically avoided this model (of having a goroutine "id") for reasons I don't remember any more (may be to avoid making them heavier weight?) but this would be one way to stop leaky goroutines.

It's because just terminating a goroutine isn't safe with respect to I/O and defer chains.

I think Go has the internal plumbing to theoretically support this, though it might require inserting checks more often. Another way would be to make contexts first-class and automatically insert context checks even when not done (e.g. selects). And also all I/O has to be cancellable.

I suspect Go's designers prefers the current way in which cancellation is explicit.

> just terminating a goroutine isn't safe with respect to I/O and defer chains.

Agreed. The idea is to panic() if a goroutine has to be forcibly terminated due to GC, instead of a slow leak. Requires more thought though.

dagss · 2 years ago

The convention of passing ctx does almost the same thing though. Make a new context.WitCancel and pass it to the goroutine.

It just requires programmer cooperation, but as long as you pass ctx all through the stack down and handle err on the way back, it is not often you deal with it explicitly.

silisili · 2 years ago

It does, but feels extremely bolted on(because it was). To do properly , near every single function and method has to take a context, which gets repetitive. Then every single thing in the call stack needs to actually check for cancellation.

It feels like something similar should be baked in, and inherited from its parent by default(but overrideable). And a cancel would cancel the callstack. Would be nice to make this cleaner in go 2, I think.

ikiris · 2 years ago

If you actually code in go there, they expect you to handle it yourself though by not orphaning goroutines. Err groups or similar will take care of this. Basically it's people writing bad code because they can. There's already options to handle it available.

psnehanshu · 2 years ago

Just wondering if threads in Rust can suffer such problems?

Background: I am coming from the JS/TS/Node world, and have decided to jump onto a compiled language. I narrowed down my choices to Go and Rust and eventually decided to go with Rust, because it didn't use GC for memory management.

Filligree · 2 years ago

Less likely, but it certainly can. The problem isn't the garbage collector; it's the overall approach to threading. Rust has a different culture that makes problems like this probably less of an issue, but nothing stops you implementing a memory leak.

In fact, since it doesn't have a GC, you can trivially create a memory leak by creating a reference loop... though the ownership checker makes that in itself really difficult, and so it's again less likely to happen than it otherwise would be. At the cost of loops being hard to make even if you want them.

I have several production Rust projects under my belt and while I am nowhere near the level of most full-time Rust devs (I don't work with Rust full-time, it's opportunity based for me) I have noticed that it's very difficult to introduce such leaks with Rust.

It is possible to do it by using Arc / Rc and cyclical references (which the borrow checker makes infuriatingly difficult -- for good reason!) but you really have to go out of your way for it. And while there are real projects where you need such idioms I have found that you should not try to twist Rust's arm and just opt for something like arena allocator.

Thaxll · 2 years ago

Every language can leak memory and threads.

The explanation of the issue in ToDoneInterface really is not clear to me because of this:

> The defer close() seems to close well, but it’s on the wrong channel.

The `done` input channel is supposed to be closed by a caller, and the goroutine is closing the output channel, surely that's the point?

Now from what I know of go channels and understand of the code involved, the `done` channel may never get closed by the parents (and can never get closed at all if it's nil?), in which case the goroutine never receives a signal, never terminates, and leaks. But the explanation below the snippet confuses me completely.

And if that's that... what's the fix? Aside from not doing this sort of conversions? Just "git good scrub", try to make sure you don't rely on cancellation for progress, and hope you don't use raw background channels again, and don't forget to cancel your non-background channels?

subomi · 2 years ago

At this point, I think done chan should be an anti-pattern. Just use context for coordination and cancellation.

I get how a Context can be used for cancellation but how is it used for coordination?