When I worked at Discord, we used BEAM hot code loading pretty extensively, built a bunch of tooling around it to apply and track hot-patches to nodes (which in turn could update the code on >100M processes in the system.) It allowed us to deploy hot-fixes in minutes (full tilt deploy could complete in a matter of seconds) to our stateful real-time system, rather than the usual ~hour long deploy cycle. We generally only used it for "emergency" updates though.
The tooling would let us patch multiple modules at a time, which basically wrapped `:rpc.call/4` and `Code.eval_string/1` to propagate the update across the cluster, which is to say, the hot-patch was entirely deployed over erlang's built-in distribution.
This matches my experience. I spent a decade operating Erlang clusters and using hot code upgrades is a superpower for debugging a whole class of hard to track bugs. Although, without the tracking for cluster state it can be its own footgun when a hotpatch gets unpatched during a code deploy.
As for relups, I once tried starting a project to make them easier but eventually decided that the number of bazookas pointed at each and every toe made them basically a non-starter for anything that isn’t trivial. And if its trivial it was already covered by the nl (network load, send a local module to all nodes in the cluster and hot load it) style tooling.
Where is the security problem? All code commits and builds can still be signed. All of this is just a more efficient way of deploying changes without dropping existing connections.
Are you suggesting that hot code replacement is somehow a attack vector?
Ericsson has been using this method for decades on critical infrastructure to patch switches without dropping live calls/connections it works.
Erlang distribution shouldn't be used between nodes that aren't in the same security boundary, it promises and provides no isolation whatsoever. It's kind of inherent to what it does: it makes a bunch of nodes behave as part of a single large system, so compromising one node compromises the system as a whole.
In a use case like clustering together identical web servers, or message broker nodes like RabbitMQ, I don't think it's all that scary. It gives an attacker easier lateral movement, but that doesn't gain them a whole lot if all the nodes have the same permissions, operate on the same data, etc.
Depending on risk appetite and latency requirements you can also isolate clusters at the deployment / datacenter level. RabbitMQ for instance uses Erlang clustering within a deployment (nodes physically close together, in the same or nearly the same configuration) and a separate federation protocol between clusters. This acts as a bulkhead to isolate problems and attackers.
Code reloading on embedded Nerves devices is fantastic. If you have non-trivial hardware or state you can just hot load new code to test a fix live. Great for integration testing.
I literally used hot code reloading a few weeks back to fix a 4-20 mA circuit on a new beta firmware while a client was watching in remote Colorado. Told them I was “fixing a config”. Tested it on our device and then they checked it out over a satellite PLC system. Then I made an update Nerves FW, uploaded it. Made the client happy!
Note that I’ve found that using scp to copy the files to /tmp and then use Code.compile to work better than copy and paste in IEx. The error messages get proper line numbers.
It’s also very simple to write a helper function to compile all the code in /tmp and then delete it. I’ve got a similar one in my project that scp’s any changed elixir files in my project over. It’s pretty nice.
I used to work on a pretty big elixir project that had many clients with long lived connections that ran jobs that weren't easily resumable. Our company had a language agnostic deployment strategy based on docker, etc which meant we couldn't do hot code updates even though they would have saved our customers some headache.
Honestly I wish we had had the ability to do both. Sometimes a change is so tricky that the argument that "hot code updates are complicated and it'll cause more issues than it will solve" is very true, and maybe a deploy that forces everyone to reconnect is best for that sort of change. But often times we'd deploy some mundane thing where you don't have to worry about upgrading state in a running gen server or whatever, and it'd be nice to have minimal impact.
Obviously that's even more complexity piled onto the system, but every time I pushed some minor change and caused a retry that (in a perfect world at least...) didn't need to retry, I winced a bit.
I work in gaming and have experienced the opposite side of this: many of our services have more than one "kind" of update, each with its own caveats and gotchas, so that it takes an expert in the whole system (meaning really almost ALL of our systems) to determine which would be the least impactful possible one if nothing goes wrong. Not only is there a lot of complexity and lost productivity in managing this process ("Are we sure this change is zero downtime-able?" "Does it need a schema reload?" etc) but we often get it wrong. The result is that, in practice, anything even remotely questionable gets done during a full downtime where we kick players out.
It's sometimes helpful to have the option to just restart one little corner of the full system, to minimize impact, but it is helpful to customer experience (if we don't screw it up) and very much the opposite for developer experience (it's crippling to velocity to need to discuss each change with multiple experts and determine the appropriate type of release).
We use hot code upgrades on kosmi.io with great success.
It's absolute magic and allows for very rapid development and ease of deploying fixes and updates.
We do use have to use distillery though and have had to resort to a bunch of custom glue bash scripts which I wish was more standardized because it's such a killer feature.
Due to Elixirs efficiency, everything is running on a single node despite thousands of concurrents so haven't really experienced how it handles multiple nodes.
Nerves and hot code reloading got me into erlang after I watched a demo of patching code on a flying drone ~8 years ago.
While I can't imagine hot reloading is super practicle in production, it does highlight that erlang/beam/otp has great primitives for building reliable production systems.
I have told so many people about that video over the years. It was one of the most amazing demonstrations of a programming language/ecosystem that I've ever seen.
You have to be very very very careful when preparing relups. The alternative on Linux is to launch an entire new server on the same machine, then transfer the session data and the open sockets to it through IPC. I once asked Joe Armstrong whether this was as good as relups and why Erlang went the relup route. I don't remember the exact words and don't want to misquote him, but he basically said it was fine, and Erlang went with relups and hot patching because transferring connections (I guess they would have been hardware interfaces rather than sockets) wasn't possible when they designed the hot patch system.
Hot patching is a bit unsatisfying because you are still running the same VM afterwards. WIth socket migration you can launch a new VM if you want to upgrade your Erlang version. I don't know of a way to do it with existing software, but in principle using something like HAProxy with suitable extensions, it should be possible to even migrate connections across machines.
State migration is possible, and yeah, if you want to upgrade BEAM, state migration would be effective, whereas hot loading is not. If your VM gets pretty big, you might need to be careful about memory usage though, the donor VM is likely not going to shrink as fast as the heir VM grows. If you were so inclined, C does allow for hot loading too, but I think it'd be pretty hard to bend BEAM into something that you could hot load to upgrade.
Migrating socket state across machines is possible too, but I don't think it's anywhere close to mainstream. HAProxy is a lovely tool, but I'm pretty sure I saw something in its documentation that explicitly states that sort of thing is out of scope; they want to deal with user level sockets.
Linux has a TCP Repair feature which can be used as part of socket migration; but you'll also need to do something to forward packets to the new destination. Could be arping for the address from a new machine, or something fancier that can switch proportionally or ??? there's lots of options, depending on your network.
As much as I'd love to have a use case for TCP migration, it's a little bit too esoteric for me ... reconnecting is best avoided when possible, but I'm counting TCP migration as non-possible for purposes of the rule of thumb.
TCP migration on the same machine is real and it's not that big a deal, if that's what you meant by TCP migration. Doing it across machines is at best a theoretical possibility, I would agree. I have been wanting to look into CRIU more carefully, but I believe it uses TCP Repair that you mentioned. I'm unfamiliar with it though.
The saying in the Erlang crowd is that a non-distributed system can't be really reliable, since the power cord is a single point of failure. So a non-painful way to migrate across machines would be great. It just hasn't been important enough (I guess) for make anyone willing to deal with the technical obstacles.
I wonder whether other OS's have supported anything like that.
I worked on a phone switch (programmed in C) a long time ago that let you do both software and hardware upgrades (swap CPU boards etc.) while keeping connections intact, but the hardware was specially designed for that.
Self-followup/correction to avoid misattributing something to Joe that I'm not sure he said. I don't remember him specifically saying there were technical obstacles to migrating connections from one BEAM to another. My main question to him was whether socket migration (such as with SCM_RIGHTS messages on Linux) was a viable alternative to relups. I expected him to say relups were better because [whatever] but instead he said migration was perfectly fine. I do think starting a new BEAM in such situations fits fine with the Erlang spirit of restarting crashed processes so that you start in a known state, rather than trying to recover inside the process.
Seems like they deploy Elixir on embedded Linux. The embedded Linux distro is Nerves which replaces systemd and boots to the BEAM VM instead as process 1, putting Elixir as close to the metal as they can.
I know nothing about any of the above (assumption is I'm fool enough to try and simplify) plus I know I've misused the concepts I wrote but that's my point so read the article. All simplifications are salads
Don't worry - this is an accurate and concise summary of what Nerves is.
What I would add is: you do not have bash (or sh, or any other "conventional" shell). The BEAM is your userspace.
You can SSH in, but what you end up with is an IEx prompt (the Elixir REPL). This is surprisingly fine once you get used to it (and once you've built a few helpers for your usecase).
IMO Hot Code Updates are a tantalizing tool that can be useful at times but are extremely easy to foot-gun and have little support. I suspect that the reason why no one has built a nice, formal framework for organizing and fanning out hot code changes to erlang nodes is that it's very hard to do well, involves making some educated guesses about the halting problem, and generally doesn't help you much unless you're already in a real bind.
Most of the benefits of hot code updates (with better understanding of the boundaries of changes) can be found through judicious rolling restarts that things like k8s make easier these days. Any time you have the capacity to hot patch code on a node, you probably have the capacity to hot patch the node's setup as well.
That said I think that someone could use the code reloading abilities of erlang to make a genuinely unparalleled production problem diagnostic toolkit - where you can take apart a problem as it is happening in real time. The same kinds of people who are excited about time traveling debugging should be excited about this imo.
The tooling would let us patch multiple modules at a time, which basically wrapped `:rpc.call/4` and `Code.eval_string/1` to propagate the update across the cluster, which is to say, the hot-patch was entirely deployed over erlang's built-in distribution.
As for relups, I once tried starting a project to make them easier but eventually decided that the number of bazookas pointed at each and every toe made them basically a non-starter for anything that isn’t trivial. And if its trivial it was already covered by the nl (network load, send a local module to all nodes in the cluster and hot load it) style tooling.
This and everything else said sounds so much like PHP+FTP workflow. It's so good.
Dead Comment
Dead Comment
Are you suggesting that hot code replacement is somehow a attack vector? Ericsson has been using this method for decades on critical infrastructure to patch switches without dropping live calls/connections it works.
No need to fear Erlang/BEAM.
In a use case like clustering together identical web servers, or message broker nodes like RabbitMQ, I don't think it's all that scary. It gives an attacker easier lateral movement, but that doesn't gain them a whole lot if all the nodes have the same permissions, operate on the same data, etc.
Depending on risk appetite and latency requirements you can also isolate clusters at the deployment / datacenter level. RabbitMQ for instance uses Erlang clustering within a deployment (nodes physically close together, in the same or nearly the same configuration) and a separate federation protocol between clusters. This acts as a bulkhead to isolate problems and attackers.
I literally used hot code reloading a few weeks back to fix a 4-20 mA circuit on a new beta firmware while a client was watching in remote Colorado. Told them I was “fixing a config”. Tested it on our device and then they checked it out over a satellite PLC system. Then I made an update Nerves FW, uploaded it. Made the client happy!
Note that I’ve found that using scp to copy the files to /tmp and then use Code.compile to work better than copy and paste in IEx. The error messages get proper line numbers.
It’s also very simple to write a helper function to compile all the code in /tmp and then delete it. I’ve got a similar one in my project that scp’s any changed elixir files in my project over. It’s pretty nice.
Honestly I wish we had had the ability to do both. Sometimes a change is so tricky that the argument that "hot code updates are complicated and it'll cause more issues than it will solve" is very true, and maybe a deploy that forces everyone to reconnect is best for that sort of change. But often times we'd deploy some mundane thing where you don't have to worry about upgrading state in a running gen server or whatever, and it'd be nice to have minimal impact.
Obviously that's even more complexity piled onto the system, but every time I pushed some minor change and caused a retry that (in a perfect world at least...) didn't need to retry, I winced a bit.
It's sometimes helpful to have the option to just restart one little corner of the full system, to minimize impact, but it is helpful to customer experience (if we don't screw it up) and very much the opposite for developer experience (it's crippling to velocity to need to discuss each change with multiple experts and determine the appropriate type of release).
It's absolute magic and allows for very rapid development and ease of deploying fixes and updates.
We do use have to use distillery though and have had to resort to a bunch of custom glue bash scripts which I wish was more standardized because it's such a killer feature.
Due to Elixirs efficiency, everything is running on a single node despite thousands of concurrents so haven't really experienced how it handles multiple nodes.
While I can't imagine hot reloading is super practicle in production, it does highlight that erlang/beam/otp has great primitives for building reliable production systems.
Yet I've never been able to find it again.
Hot patching is a bit unsatisfying because you are still running the same VM afterwards. WIth socket migration you can launch a new VM if you want to upgrade your Erlang version. I don't know of a way to do it with existing software, but in principle using something like HAProxy with suitable extensions, it should be possible to even migrate connections across machines.
Migrating socket state across machines is possible too, but I don't think it's anywhere close to mainstream. HAProxy is a lovely tool, but I'm pretty sure I saw something in its documentation that explicitly states that sort of thing is out of scope; they want to deal with user level sockets.
Linux has a TCP Repair feature which can be used as part of socket migration; but you'll also need to do something to forward packets to the new destination. Could be arping for the address from a new machine, or something fancier that can switch proportionally or ??? there's lots of options, depending on your network.
As much as I'd love to have a use case for TCP migration, it's a little bit too esoteric for me ... reconnecting is best avoided when possible, but I'm counting TCP migration as non-possible for purposes of the rule of thumb.
The saying in the Erlang crowd is that a non-distributed system can't be really reliable, since the power cord is a single point of failure. So a non-painful way to migrate across machines would be great. It just hasn't been important enough (I guess) for make anyone willing to deal with the technical obstacles.
I wonder whether other OS's have supported anything like that.
I worked on a phone switch (programmed in C) a long time ago that let you do both software and hardware upgrades (swap CPU boards etc.) while keeping connections intact, but the hardware was specially designed for that.
Seems like they deploy Elixir on embedded Linux. The embedded Linux distro is Nerves which replaces systemd and boots to the BEAM VM instead as process 1, putting Elixir as close to the metal as they can.
I know nothing about any of the above (assumption is I'm fool enough to try and simplify) plus I know I've misused the concepts I wrote but that's my point so read the article. All simplifications are salads
What I would add is: you do not have bash (or sh, or any other "conventional" shell). The BEAM is your userspace.
You can SSH in, but what you end up with is an IEx prompt (the Elixir REPL). This is surprisingly fine once you get used to it (and once you've built a few helpers for your usecase).
Most of the benefits of hot code updates (with better understanding of the boundaries of changes) can be found through judicious rolling restarts that things like k8s make easier these days. Any time you have the capacity to hot patch code on a node, you probably have the capacity to hot patch the node's setup as well.
That said I think that someone could use the code reloading abilities of erlang to make a genuinely unparalleled production problem diagnostic toolkit - where you can take apart a problem as it is happening in real time. The same kinds of people who are excited about time traveling debugging should be excited about this imo.