> This can lead to scalability issues in large clusters, as the number of connections that each node needs to maintain grows quadratically with the number of nodes in the cluster.
No, the total number of dist connections grows quadratically with the number of nodes, but the number of dist connections each node makes grows linearally.
> Not only that, in order to keep the cluster connected, each node periodically sends heartbeat messages to every other node in the cluster.
IIRC, heat beats are once every 30 seconds by default.
> This can lead to a lot of network traffic in large clusters, which can put a strain on the network.
Lets say I'm right about 30 seconds between heart beats, and you've got 1000 nodes. Every 30 seconds each node sends out 999 heartbeats (which almost certainly fit in a single tcp packet each, maybe less if they're piggybacking on real data exchanges). That's 999,000 packets every second, or 33k pps across your whole cluster. For reference, GigE line rate with full 1500 mtu packets is 80k pps. If you actually have 1000 nodes worth of work, the heartbeats are not at all a big deal.
> Historically, a "large" cluster in Erlang was considered to be around 50-100 nodes. This may have changed in recent years, but it's still something to be aware of when designing distributed Erlang systems.
I don't have recent numbers, but Rick Reed's presentation at Erlang Factory in 2014 shows a dist cluster with 400 nodes. I'm pretty sure I saw 1000+ node clusters too. I left WhatsApp in 2019, and any public presentations from WA are less about raw scale, because it's passe.
Really, 1000 dist connections is nothing when you're managing 500k client connections. Dist connections weren't even a big deal when we went to smaller nodes in FB.
It's good to have a solid backend network, and to try to bias towards fewer larger nodes, rather than more smaller nodes. If you want to play with large scale dist, so you spin up 1000 low cpu, low memory VMs, you might have some trouble. It makes sense to start with small nodes and whatever number makes you comfortable for availability, and then when you run into limits, reach for bigger nodes until you get to the point where adding nodes is more cost effective: WA ran dual xeon 2690 servers before the move to FB infra; facebook had better economics with smaller single Xeon D nodes; I dunno what makes sense today, maybe a single socket Epyc?
> That's 999,000 packets every second, or 33k pps across your whole cluster. For reference, GigE line rate with full 1500 mtu packets is 80k pps. If you actually have 1000 nodes worth of work, the heartbeats are not at all a big deal.
Using up almost half of your pps every 30 seconds for cluster maintenance certainly seems like it's more than "not a big deal", no?
(Oops, I meant to say 999k packets every 30 seconds. Thanks everyone for running with the pps number)
If your switching fabric can only deal with 1Gbps, yes, you've used it halfway up with heartbeats. But if your network is 1x 48 port 1G switch and 44x 24 port 1G switches, you won't bottleneck on heartbeats, because that spine switch should be able to simultaneously send and receive at line rate on all ports whicj is plenty of bandwidth. You might well bottleneck on other transmissions, but the nice thing about dist heartbeats is on a connection, each node is sending heartbeats on a timer and will close the connection if it doesn't see a heartbeat in some timeframe; it's a requirement for progress, it's not a requirement for a timely response, so you can end up with epic round trip times for net_adm:ping ... I've seen on the order of an hour once over a long distance dist connection with an unexpected bandwidth constraint.
It would probably be a lot more comfortable if your spine switch was 10g and your node switches had a 10g uplink, and you may want to consider LACP and double up all the connections. You might also want to consider other topologies, but this is just an illustration.
> If you actually have 1000 nodes worth of work, the heartbeats are not at all a big deal.
I think you’re missing the fact that the heart beats will be combined with existing packets. Hence the quoted bit. If you’ve got 1000 nodes, they should be doing something with that network such that an extra 50 bytes (or so) every 30s would not be an issue.
If you're at the point where you decided the 1000 nodes are required, you should probably have already considered the sensible alternatives first, and concluded this was somehow better.
The network saturation is just a necessary cost of running such a massive cluster.
I really have no idea what kind of system would require 1000 nodes, that couldn't be replaced by 100, 10x larger, nodes instead. And at that point, you should probably be thinking of ways to scale the network itself as well.
That's for 1000 nodes all sharing a single 1Gbps connection and sending 1500-byte heartbeats for some reason. Make them 150 bytes (more realistic) and now it's almost half the throughput of the 100Mbps router that for some reason you are sending every single packet through, for 1000 nodes.
I doubt these are even sent if there's actual traffic going through the network. I mean, why do you need to ping nodes, if they are already talking to you?
Wouldn't you have a lot more than one gigabit connection? I'm struggling to imagine the 1000-node cluster layout that sends all intra-cluster traffic through a single cable.
ahoy!! thanks for spotting the issue I wrote the post in a stream of consciousness after a long day! I'll make that edit and call it out!
the statement about what historically constitutes a large erlang cluster was an anecdote told to by Francesco Cesarini during lunch a few years ago — I'm not actually sure of the time frame (or my memory!)
likewise I'll update the post to reflect that! thanks ( ◜ ‿ ◝ ) ♡
> WA ran dual xeon 2690 servers before the move to FB infra; facebook had better economics with smaller single Xeon D nodes; I dunno what makes sense today, maybe a single socket Epyc?
that cpu is 8c/16t [1] ... having two of them in a system would mean 16c/32t.
AMD recently launched a 192c/384t monster cpu [2], and having two of them would mean having 384c/768t.
So most likely a 100 nodes-clusters today, powered by those amd cpus would give you the same core count of a 1200-node cluster at the time.
The primitives of sending messages across a "cluster" are built in the language, yep. And lightweight processes with a per-process garbage collector is magic on minimizing global gc pauses.
But all the hard work to make a distributed system work are not "batteries included" for Erlang. Back pressure on messages, etc. you end up havig to build it yourself.
We hit a limit at around 200 physical servers in a cluster back in 2015. Maybe it's gotten better since then. shrug.
As the author calls out, it was built for the 80s, when a million dollar switch shouldn't fail. It isn't built with https://en.wikipedia.org/wiki/STONITH in mind i.e. any node is trash, assume it will go away and maybe new ones will come back.
Speaking as someone who spent a lot of time in front of CRTs, this is NOT what an average CRT looks like. This would be reason to send it to maintenance back then - it looks like breaking contacts or problems with the analog board.
A good CRT would show lines, but they'd be steady and not flicker (unless you self-inflict an interlaced display on a short phosphor screen). It might also show some color convergence issues, but that, again, is adjustable (or you'd send it to maintenance).
This looks like the kind of TV you wouldn't plug a VIC-20 to.
Uses @keyframes and text-shadow to try and mimic a CRT effect but makes the text unreadable (for me at least). The browser readability mode does work on the page though.
I think so too, and works on my old LCD monitor. If I focus on it, I can see the subtle changes, but other than that, it does not make it any less readable for me.
OP, if you're reading this, the animation pegs my CPU. Relying on your readers to engage reader mode is going to turn away a lot of people who might otherwise enjoy your content.
If a person considers putting this type of effect on text reasonable, I really don't think they have anything of value to say in the text content anyways.
fn/n where n is parity of fn was confusing for me at first, but then I started to like the idea, esp. in dynamic languages it's helpful to have at least some sense of how things work by just looking at them.
this made me think about a universal function definition syntax that captures everything except implementation details of the function. something like:
which shows that the function fn receives a (integer), b (float), c (generic type T) and maybe returns something that's either a boolean or a list of strings or a list of T, or it throws and doesn't return anything. the function fn also makes use of foo and bar (other functions) and variable d which are not given to it as arguments but they exist in the "context" of fn.
Well Erlang and Elixir both have a type checker, dialyzer, which has a type specification not to far from what you proposed. Well excluding the variable or function captures.
minor nitpick, I think you meant arity. If it's a typo sorry for the nitpick if you had only ever heard it spoken out loud before: arity is the number of parameters (in the context of functions), parity is whether a number is even or odd
Looks similar to how languages with effect systems work. (Or appear to work, I haven't actually sat down with one yet.) They don't capture the exact functions, rather captuing the capabilities of said functions, which are the important bit anyway. The distinction between `print` vs `printline` doesn't matter too much, the fact you're doing `IO` does. It seems to be largely new languages trying this, but I believe there's a few addons for languages such as Haskell as well.
> fn/n where n is parity of fn was confusing for me at first, but then I started to like the idea, esp. in dynamic languages it's helpful to have at least some sense of how things work by just looking at them.
To be clear, this syntax is not just "helpful", it's semantically necessary given Erlang's compilation model.
In Erlang, foo/1 and foo/2 are separate functions; foo/2 could be just as well named bar/1 and it wouldn't make a difference to the compiler.
But on the other hand, `foo(X) when is_integer(X)` and `foo(X) when is_binary(X)` aren't separate functions — they're two clause-heads of the same function foo/1, and they get unified into a single function body during compilation.
So there's no way in Erlang for a (runtime) variable [i.e. a BEAM VM register] to hold a handle/reference to "the `foo(X) when is_integer(X)` part of foo/1" — as that isn't a thing that exists any more at runtime.
When you interrogate an Erlang module at runtime in the erl REPL for the functions it exports (`Module:module_info(exports)`), it gives you a list of {FunctionName, Arity} tuples — because those are the real "names" of the functions in the module; you need both the FunctionName and the Arity to uniquely reference/call the function. But you don't need any kind of type information; all the type information is lost from the "structure" of the function, becoming just validation logic inside the function body.
---
In theory, you could have compile-time type safety for function handles, with type erasure at runtime ala Java's runtime erasure of compile-time type parameters. I feel like Erlang is not the type of language to ever bother to add support for this — as it doesn't even accept or return function handles from most system functions, instead accepting or returning {FunctionName, Arity} tuples — but in theory it could.
> But Erlang itself, as a language, doesn't even have function-reference literals; you usually just pass {FunctionName, Arity} tuples around.
I don't really understand what a literal is, but isn't fun erlang:node/0 a literal? It operates the same way as 1, which I'm quite sure is a numeric literal:
Eshell V12.3.2.17 (abort with ^G)
1> F = fun erlang:node/0.
fun erlang:node/0
2> X = 1.
1
3> F.
fun erlang:node/0
4> X.
1
I set the variable to the thing, and then when I output the variable, I see the same value I set. AFAIK, both 1 and fun erlang:node/0 must both be literals, as they behave the same; but I'm happy to learn otherwise.
I recommend taking a look at the various open source Riak applications too! Might not be updated to any sort of recent versions of erlang but was a great resource to me early on.
A single node erlang application would be one that doesn't use dist at all. Although, if it includes anything gen_event or similar, and it happens to be dist connected, unless it specifically checks, it will happily reply to remote Erlang processes.
my personal reality is that the majority of projects I've consulted on have seldom actually leveraged distributed erlang for anything. the concurrency part yes, clustering for the sake of availability or spreading load yes, but actually doing anything more complex than that has been the exception! ymmv tho!
No, the total number of dist connections grows quadratically with the number of nodes, but the number of dist connections each node makes grows linearally.
> Not only that, in order to keep the cluster connected, each node periodically sends heartbeat messages to every other node in the cluster.
IIRC, heat beats are once every 30 seconds by default.
> This can lead to a lot of network traffic in large clusters, which can put a strain on the network.
Lets say I'm right about 30 seconds between heart beats, and you've got 1000 nodes. Every 30 seconds each node sends out 999 heartbeats (which almost certainly fit in a single tcp packet each, maybe less if they're piggybacking on real data exchanges). That's 999,000 packets every second, or 33k pps across your whole cluster. For reference, GigE line rate with full 1500 mtu packets is 80k pps. If you actually have 1000 nodes worth of work, the heartbeats are not at all a big deal.
> Historically, a "large" cluster in Erlang was considered to be around 50-100 nodes. This may have changed in recent years, but it's still something to be aware of when designing distributed Erlang systems.
I don't have recent numbers, but Rick Reed's presentation at Erlang Factory in 2014 shows a dist cluster with 400 nodes. I'm pretty sure I saw 1000+ node clusters too. I left WhatsApp in 2019, and any public presentations from WA are less about raw scale, because it's passe.
Really, 1000 dist connections is nothing when you're managing 500k client connections. Dist connections weren't even a big deal when we went to smaller nodes in FB.
It's good to have a solid backend network, and to try to bias towards fewer larger nodes, rather than more smaller nodes. If you want to play with large scale dist, so you spin up 1000 low cpu, low memory VMs, you might have some trouble. It makes sense to start with small nodes and whatever number makes you comfortable for availability, and then when you run into limits, reach for bigger nodes until you get to the point where adding nodes is more cost effective: WA ran dual xeon 2690 servers before the move to FB infra; facebook had better economics with smaller single Xeon D nodes; I dunno what makes sense today, maybe a single socket Epyc?
Using up almost half of your pps every 30 seconds for cluster maintenance certainly seems like it's more than "not a big deal", no?
If your switching fabric can only deal with 1Gbps, yes, you've used it halfway up with heartbeats. But if your network is 1x 48 port 1G switch and 44x 24 port 1G switches, you won't bottleneck on heartbeats, because that spine switch should be able to simultaneously send and receive at line rate on all ports whicj is plenty of bandwidth. You might well bottleneck on other transmissions, but the nice thing about dist heartbeats is on a connection, each node is sending heartbeats on a timer and will close the connection if it doesn't see a heartbeat in some timeframe; it's a requirement for progress, it's not a requirement for a timely response, so you can end up with epic round trip times for net_adm:ping ... I've seen on the order of an hour once over a long distance dist connection with an unexpected bandwidth constraint.
It would probably be a lot more comfortable if your spine switch was 10g and your node switches had a 10g uplink, and you may want to consider LACP and double up all the connections. You might also want to consider other topologies, but this is just an illustration.
I think you’re missing the fact that the heart beats will be combined with existing packets. Hence the quoted bit. If you’ve got 1000 nodes, they should be doing something with that network such that an extra 50 bytes (or so) every 30s would not be an issue.
The network saturation is just a necessary cost of running such a massive cluster.
I really have no idea what kind of system would require 1000 nodes, that couldn't be replaced by 100, 10x larger, nodes instead. And at that point, you should probably be thinking of ways to scale the network itself as well.
Deleted Comment
the statement about what historically constitutes a large erlang cluster was an anecdote told to by Francesco Cesarini during lunch a few years ago — I'm not actually sure of the time frame (or my memory!)
likewise I'll update the post to reflect that! thanks ( ◜ ‿ ◝ ) ♡
that cpu is 8c/16t [1] ... having two of them in a system would mean 16c/32t.
AMD recently launched a 192c/384t monster cpu [2], and having two of them would mean having 384c/768t.
So most likely a 100 nodes-clusters today, powered by those amd cpus would give you the same core count of a 1200-node cluster at the time.
references:
[1]: https://www.intel.com/content/www/us/en/products/sku/64596/i...
[2]: https://www.youtube.com/watch?v=S-NbCPEgP1A
Deleted Comment
The primitives of sending messages across a "cluster" are built in the language, yep. And lightweight processes with a per-process garbage collector is magic on minimizing global gc pauses.
But all the hard work to make a distributed system work are not "batteries included" for Erlang. Back pressure on messages, etc. you end up havig to build it yourself.
We hit a limit at around 200 physical servers in a cluster back in 2015. Maybe it's gotten better since then. shrug.
As the author calls out, it was built for the 80s, when a million dollar switch shouldn't fail. It isn't built with https://en.wikipedia.org/wiki/STONITH in mind i.e. any node is trash, assume it will go away and maybe new ones will come back.
Rock on, Erlang dude, rock on.
A good CRT would show lines, but they'd be steady and not flicker (unless you self-inflict an interlaced display on a short phosphor screen). It might also show some color convergence issues, but that, again, is adjustable (or you'd send it to maintenance).
This looks like the kind of TV you wouldn't plug a VIC-20 to.
Deleted Comment
this made me think about a universal function definition syntax that captures everything except implementation details of the function. something like:
which shows that the function fn receives a (integer), b (float), c (generic type T) and maybe returns something that's either a boolean or a list of strings or a list of T, or it throws and doesn't return anything. the function fn also makes use of foo and bar (other functions) and variable d which are not given to it as arguments but they exist in the "context" of fn.https://en.m.wikipedia.org/wiki/Erlang_(programming_language...
Elixir’s syntax for declaring type-specs can be found at https://hexdocs.pm/elixir/1.12/typespecs.html
https://koka-lang.github.io/koka/doc/index.html
https://www.unison-lang.org/docs/fundamentals/abilities/
https://dev.epicgames.com/documentation/en-us/uefn/verse-lan...
To be clear, this syntax is not just "helpful", it's semantically necessary given Erlang's compilation model.
In Erlang, foo/1 and foo/2 are separate functions; foo/2 could be just as well named bar/1 and it wouldn't make a difference to the compiler.
But on the other hand, `foo(X) when is_integer(X)` and `foo(X) when is_binary(X)` aren't separate functions — they're two clause-heads of the same function foo/1, and they get unified into a single function body during compilation.
So there's no way in Erlang for a (runtime) variable [i.e. a BEAM VM register] to hold a handle/reference to "the `foo(X) when is_integer(X)` part of foo/1" — as that isn't a thing that exists any more at runtime.
When you interrogate an Erlang module at runtime in the erl REPL for the functions it exports (`Module:module_info(exports)`), it gives you a list of {FunctionName, Arity} tuples — because those are the real "names" of the functions in the module; you need both the FunctionName and the Arity to uniquely reference/call the function. But you don't need any kind of type information; all the type information is lost from the "structure" of the function, becoming just validation logic inside the function body.
---
In theory, you could have compile-time type safety for function handles, with type erasure at runtime ala Java's runtime erasure of compile-time type parameters. I feel like Erlang is not the type of language to ever bother to add support for this — as it doesn't even accept or return function handles from most system functions, instead accepting or returning {FunctionName, Arity} tuples — but in theory it could.
I don't really understand what a literal is, but isn't fun erlang:node/0 a literal? It operates the same way as 1, which I'm quite sure is a numeric literal:
I set the variable to the thing, and then when I output the variable, I see the same value I set. AFAIK, both 1 and fun erlang:node/0 must both be literals, as they behave the same; but I'm happy to learn otherwise.And a function returning one of three types sounds like a TERRIBLE idea.
Should have been an enum :)
Is there a list of projects that shows distributed erlang in action (taking advantage of its strengths and avoiding the pitfals)?
distributed Erlang is like saying ATM machine