How much memory do you need in 2024 to run 1M concurrent tasks?

I feel this benchmark compares apples to oranges in some cases.

For example, for node, the author puts a million promises into the runtime event loop and uses `Promise.all` to wait for them all.

This is very different from, say, the Go version where the author creates a million goroutines and puts `waitgroup.Done` as a defer call.

While this might be the idiomatic way of concurrency in the respective languages, it does not account for how goroutines are fundamentally different from promises, and how the runtime does things differently. For JS, there's a single event loop. Counting the JS execution threads, the event loop thread and whatever else the runtime uses for async I/O, the execution model is fundamentally different from Go. Go (if not using `GOMAXPROCS`) spawns an OS thread for every physical thread that your machine has, and then uses a userspace scheduler to distribute goroutines to those threads. It may spawn more OS threads to account for OS threads sleeping on syscalls. Although I don't think the runtime will spawn extra threads in this case.

It also depends on what the "concurrent tasks" (I know, concurrency != parallelism) are. Tasks such as reading a file or doing a network call are better done with something like promises, but CPU-bound tasks are better done with goroutines or Node worker_threads. It would be interesting to see how the memory usage changes when doing async I/O vs CPU-bound tasks concurrently in different languages.

n2d4 · 9 months ago

Actually, I think this benchmark did the right thing, that I wish more benchmarks would do. I'm much less interested in what the differences between compilers are than in what the actual output will be if I ask a professional Go or Node.js dev to solve the same task. (TBF, it would've been better if the task benchmarked was something useful, eg. handling an HTTP request.)

Go heavily encourages a certain kind of programming; JavaScript heavily encourages a different kind; and the article does a great job at showing what the consequences are.

rtpg · 9 months ago

But you wouldn't call a million tasks with `Promise.all` in Node, right? That's just not a thing that one does.

Instead, there's usually going to be some queue outside the VM that will leave you with _some_ sort of chunking and otherwise working in smaller, more manageable bits (that might, incidentally, be shaped in ways that the VM can handle in interesting ways).

It's definitely true to say that the "idioamatic" way of handling things is worth going into, but if part of your synthetic benchmark involves doing something quite out of the ordinary, it feels suspicious.

I generally agree that a "real" benchmark here would be nice. It would be interesting if someone could come up with the "minimum viable non-trivial business logic" that people could use for these benchmarks (perhaps coupled with automation tooling to run the benchmarks)

YetAnotherNick · 9 months ago

The fundamental problem is there are two kind of sleep function. One that actually sleeps and other that is a actually a timer that just calls certain callback after a desired interval. Promise is just a syntactic sugar on top of second type. Go certainly could call another function after desired interval using `Timer`.

I think better comparison would be wasting CPU for 10 seconds instead of sleep.

jonathanstrange · 9 months ago

No professional Go programmer would spawn 1M goroutines unless they're sure they have the memory for it (and even then, only if benchmarks indicate it, which is unlikely). Goroutines have a static stack overhead between 2KiB to 8KiB depending on the platform. You'd use a work stealing approach with a reasonable number of goroutines instead. How many are reasonable needs to be tested because it depends on how long each Goroutine spends waiting for I/O or sleeping.

But I can go further than that: No professional programmer should run 1M concurrent tasks on an ordinary CPU no matter which language because it makes no sense if the CPU has several orders of magnitudes less cores. The tasks are not going to run in parallel anyway.

Quothling · 9 months ago

> Go heavily encourages a certain kind of programming;

True, but it really doesn't encourage you to run 1m goroutines with the standard memory setting. Though it's probably fair to run Go wastefully when you're comparing it to Promise.All.

SPascareli13 · 9 months ago

As far as I know there is no way to do Promise like async in go, you HAVE to create a goroutine for each concurrent async task. If this is really the case then I believe the submition is valid.

But I do think that spawning a goroutine just to do a non-blocking task and get its return is kinda wasteful.

n2d4 · 9 months ago

You could in theory create your own event loop and then get the exact same behaviour as Promises in Go, but you probably shouldn't. Goroutines are the way to do this in Go, and it wouldn't be useful to benchmark code that would never be written in real life.

threeseed · 9 months ago

The requirement is to run 1 million concurrent tasks.

Of course each language will have a different way of achieving this task each of which will have their unique pros/cons. That's why we have these different languages to begin with.

jakewins · 9 months ago

The accounting here is weird though; Go isn’t using that RAM, it’s expecting the application to. The reason that doesn’t happen is because it’s a micro benchmark that produces no useful work..

The way the results are presented a reader may think the Go memory usage sounds equivalent to the others - boilerplate, ticket-to-play - and then the Go usage sounds super high.

But they are not the same; that memory is in anticipation of a real world program using it

Deleted Comment

lmm · 9 months ago

> The requirement is to run 1 million concurrent tasks.

That's not a real requirement though. No business actually needs to run 1 million concurrent tasks with no concern for what's in them.

gleenn · 9 months ago

Also, for Java, Virtual Threads are a very new feature (Java 21 IIRC or somewhere around there). OS threads have been around for decades. As a heavy JVM user it would have been nice to actually see those both broken out to compare as well!

codetiger · 9 months ago

The original benchmark had the comparison between Java thread and Java Virtual thread. https://pkolaczk.github.io/memory-consumption-of-async/

package main import ( "os" "strconv" "sync" "time" ) func main() { numRoutines, _ := strconv.Atoi(os.Args[1]) var wg sync.WaitGroup for i := 0; i < numRoutines; i++ { wg.Add(1) time.AfterFunc(10*time.Second, wg.Done) } wg.Wait() }

open System open System.Threading open System.Threading.Tasks let argv = Environment.GetCommandLineArgs() [1..int argv[1]] |> Seq.map (fun _ -> task { let timer = PeriodicTimer(TimeSpan.FromSeconds 1.0) let mutable count = 10 while! timer.WaitForNextTickAsync() do count <- count - 1 if count = 0 then timer.Dispose() } :> Task) |> Task.WaitAll

numTimers, _ := strconv.Atoi(os.Args[1]) timerChan := make(chan struct{}) // Goroutine 1: Schedule timers go func() { for i := 0; i < numTimers; i++ { timer := time.NewTimer(10 * time.Second) go func(t *time.Timer) { <-t.C timerChan <- struct{}{} }(timer) } }() // Goroutine 2: Receive and process timer signals for i := 0; i < numTimers; i++ { <-timerChan }

This depends a lot on how you define "concurrent tasks", but the article provides a definition:

Let's launch N concurrent tasks, where each task waits for 10 seconds and then the program exists after all tasks finish. The number of tasks is controlled by the command line argument.

Leaving aside semantics like "since the tasks aren't specified as doing anything with side effects, the compiler can remove them as dead code", all you really need here is a timer and a continuation for each "task" -- i.e 24 bytes on most platforms. Allowing for allocation overhead and a data structure to manage all the timers efficiently, you might use as much as double that; with some tricks (e.g. function pointer compression) you could get it down to half that.

Eyeballing the graph, it looks like the winner is around 200MB for 1M concurrent tasks, so about 4x worse than a reasonably efficient but not heavily optimized implementation would be.

I have no idea what Go is doing to get 2500 bytes per task.

masklinn · 9 months ago

> I have no idea what Go is doing to get 2500 bytes per task.

TFA creates a goroutine (green thread) for each task (using a waitgroup to synchronise them). IIRC goroutines default to 2k stacks, so that’s about right.

One could argue it’s not fair and it should be timers which would be much lighter. There’s no “efficient wait” for them but that’s essentially the same as the appendix rust program.

jakewins · 9 months ago

Fair or not, it’s a strange way to count - Go isn’t using that RAM. It’s preallocating it because any real world program will.

cperciva · 9 months ago

Aha, 2k stacks. I figured that stacks would be page size (or more) so 2500 seemed both too small for the thread to have a stack and too large for it to not have a stack.

2k stacks are an interesting design choice though... presumably they're packed, in which case stack overflow is a serious concern. Most threading systems will do something like allocating a single page for the stack but reserving 31 guard pages in case it needs to grow.

liveoneggs · 9 months ago

https://tpaschalis.me/goroutines-size/

https://github.com/golang/go/blob/master/src/runtime/stack.g...

AkshitGarg · 9 months ago

xargon7 · 9 months ago

There's a difference between "running a task that waits for 10 seconds" and "scheduling a wakeup in 10 seconds".

The code for several of the languages that are low-memory usage that do the second while the high memory usage results do the first. For example, on my machine the article's go code uses 2.5GB of memory but the following code uses only 124MB. That difference is in-line with the rust results.

mrighele · 9 months ago

I agree with you. Even something as simple as a loop like (pseudocode)

for (n=0;n<10;n++) { sleep(1 second); }

Changes the results quite a bit: for some reasons java use a _lot_ more memory and takes longer (~20 seconds), C# uses more that 1GB of memory, while python struggles with just scheduling all those tasks and takes more than one minute (beside taking more memory). node.js seems unfazed by this change.

I think this would be a more reasonable benchmark

neonsunset · 9 months ago

Indeed, looping over a Task.Delay likely causes a lot of churn in timer queues - that's 10M timers allocated and scheduled! If it is replaced with 'PeriodicTimer', the end result becomes more reasonable.

This (AOT-compiled) F# implementation peaks at 566 MB with WKS GC and 509 MB with SRV GC:

To Go's credit, it remains at consistent 2.53 GB and consumes quite a bit less CPU.

We're really spoiled with choice these days in compiled languages. It takes 1M coroutines to push the runtime and even at 100k the impact is easily tolerable, which is far more than regular applications would see. At 100K .NET consumes ~57 MB and Go consumes ~264 MB (and wins at CPU by up to 2x).

Spawning a periodically waking up Task in .NET (say every 250ms) that performs work like sending out a network request would retain comparable memory usage (in terms of async overhead itself).

Even at 100k tasks the bottleneck is going to be the network stack (sending outgoing 400k RPS takes a lot of CPU and syscall overhead, even with SocketAsyncEngine!).

Doing so in Go would require either spawning Goroutines, or performing scheduling by hand or through some form of aggregation over channel readers. Something that Tasks make immediately available.

The concurrency primitive overhead becomes more important if you want to quickly interleave multiple operations at once. In .NET you simply do not await them at callsite until you need their result later - this post showcases how low the overhead of doing so is.

piterrro · 9 months ago

I don't know what's a fair way to do this for all languages listed in the benchmark, but for Go vs Node the only fair way would be to use a single goroutine to schedule timers and another one to pick them up when they tick, this way we don't create a huge stack and it's much more comparable to what you're really doing in Node.

Consider the following code:

package main

import ( "fmt" "os" "strconv" "time" )

func main() {

}

Also for Node it's weird not to have Bun and Deno included. I suppose you can have other runtimes for other languages too.

In the end I think this benchmark is comparing different things and not really useful for anything...

theamk · 9 months ago

> high number of concurrent tasks can consume a significant amount of memory

note absolute numbers here: in the worst case, 1M tasks consumed 2.7 GB of RAM, with ~2700 bytes overhead per task. That'd still fit in the cheapest server with room to spare.

My conclusion would be opposite: as long as per-task data is more than a few KB, the memory overhead of task scheduler is negligible.

pkulak · 9 months ago

Except it’s more than that. Go and Java maintain a stack for every virtual thread. They are clever about it, but it’s very possible that doing anything more than a sleep would have blown up memory on those two systems.

bilbo0s · 9 months ago

I have a sneaky suspicion if you do anything other than the sleep during these 1 million tasks, you'll blow up memory on all of these systems.

That's kind of the Achille's Heel of the benchmark. Any business needing to spawn 1 million tasks, certainly wants to do something on them. It's the "do something on them" part that usually leads to difficulties for these things. Not really the "spawn a million tasks" part.

Mawr · 9 months ago

> Now Go loses by over 13 times to the winner. It also loses by over 2 times to Java, which contradicts the general perception of the JVM being a memory hog and Go being lightweight.

Well, if it isn't the classic unwavering confidence that an artificial "hello world"-like benchmark is in any way representative of real world programs.

phillipcarter · 9 months ago

Yes, but also, languages like Java and C# have caught up a great deal over the past 10 years and run incredibly smoothly. Most peoples' perception of them being slow is really just from legacy tech that they encountered a long time ago, or (oof) being exposed to some terrible piece of .NET Framework code that's still running on an underprovisioned IIS server.

fulafel · 9 months ago

Also of course Java and C# overwhelmingly use threads and not async for this kind of thing.

Very few .NET projects rely on explicit threading. There is almost never a reason to do so when tasks can be used instead, like in this benchmark.

blixt · 9 months ago

While it’s nice to compare languages with simple idiomatic code I think it’s unfair to developers to show them the performance of an entirely empty function body and graphs with bars that focus on only one variable. It paints a picture that you can safely pick language X because it had the smaller bar.

I urge anyone making decisions from looking at these graphs to run this benchmark themselves and add two things:

- Add at least the most minimal real world task inside of these function bodies to get a better feel for how the languages use memory

- Measure the duration in addition to the memory to get a feel for the difference in scheduling between the languages

tossandthrow · 9 months ago

This urge is as old as statistics. And I dare to say that most people after reading the article in question are well prepared to use the results for what they are.

I can’t say I share your optimism. I’ve seen plenty of developers point to graphs like these as a reason for why they picked a language or framework for a problem. And it comes down to the benchmark how good of a proxy it actually is for such problems. I just hope that with enough feedback the author would consider making the benchmark more nuanced to paint a picture of why these differences in languages exist (as opposed to saying which languages “lose” or “win”).

sfn42 · 9 months ago

And by use the results for what they are you mean ignore them because they are completely useless?

JyB · 9 months ago

I’m still baffled that some people are bold enough to voluntarily posts those kind of most-of-the-time useless “benchmark” that will inevitably be riddled with errors. I don’t know what pushes them. In the end you look like a clown more often than not.

wiseowise · 9 months ago

The fastest way to learn truth is by posting wrong thing on the internet, or something.

enginoid · 9 months ago

Trying things casually out of curiosity isn’t harmful. I expect people understand that these kinds of blog posts aren’t rigorous science to draw foundational conclusions from.

And the errors are a feature — I learn the most from the errata!