Seriously cool. That also reminds me of DragonFlyBSD's process checkpointing feature that offers suspend to disk. In Linux world there were many attempts, but AFAIK nothing simple and complete enough. To be fair I don't know if DF's implementation is that either.
This article[0] on LWN seems to suggest that Linux has no kernel support for checkpoint/restore because it's hard to implement (no arguments there). But hypervisors support checkpoint/restore for virtual machines, e.g. ESXi VMotion and KVM live migration, so it seems like these technical problems are solvable. Indeed all the benefits of VM migration seem to also apply to process migration (load balancing, service uptime, etc).
IMO, big reason for why Linux doesn’t support it, is that big tech companies, who drive a lot of development and funding, found that it’s just not worth it.
Tech moved away from mainframe model of keeping single processes up for as long as possible and isolating software from the failures.
Instead they embrace failure at the software layer, which gives you (when executed correctly) both same high level of availability, but also layers of protection against other issues (as node going down in expected way is the same, no matter what’s the root cause) and makes overall maintainability and upgradability systems higher.
Basically, why deal with complicated checkpoints at kernel level, that doesn’t understand your software, when you ca instead deal with that at the software layer, with full control over it.
(in this run it restored a previous VM snapshot to skip the setup, and took a snapshot after the webserver started to give a lazy-loaded staging server)
They seem to be different problems though, as that LWN article suggests. It's probably easier to checkpoint/restore an entire image's state than an individual process' state.
But even calls like gettimeofday() work differently when running on hypervisors than they do when running on bare metal
Same reaction here. MOSIX was basically this idea developed for real. Turns out there are a bunch of secondary problems to do with namespaces and security, scheduling and resource use, topology, performance, and so on. In the end it never turned out to be all that practical, and I know of several efforts (e.g. LOCUS) that went quite far to reach the same conclusion. There's a reason that non-transparent "shared nothing" distributed computing came to dominate the computing landscape.
Whoa, I was just talking about OpenMOSIX the other day - at my college job, when we retired workstations, they would just sit in the back closet for years until facilities would get around to getting them up for auction. We set up a TFTP boot setup and had a few dozen nodes at any given time. It wasn't high performance by any fashion, but it worked pretty transparently and was always fun to throw heavy synthetic workloads at it and watch the cluster rebalance and hear the fans spin up on the clunky old pentium's.
I recall OpenMosix too, I remember playing with it on a couple of old machines a while ago. I thought it seemed quite a cool idea. I think I was trying to do mp3 encoding from CDs with it in a distributed fashion.
Funny almost exactly one year ago I was desperately trying to learn about and demo a Linux cluster.
I happened upon ClusterKnoppix and used that LiveCD as my demo which uses OpenMOSIX. And MPI as well but I was having a lot of trouble getting it all in my head and having to explain it during a presentation.
Came here to write about ClusterKnoppix! It was amazing and I had 7 small form factor PC's running a mini cluster. It was great for moving things around. Problem was it fell behind in releases and it became hard to get applications to run on it. Knoppix was how I got into Linux, it still ranks as my favorite distro. Thanks Klaus Knopper for your work in setting that up!
It would be nice if it worked on Raspberry Pi, or if there was a simple way to set OpenMOSIX up on Pi's
Came here for the MOSIX, not disappointed. I worked at a facility that ran it on a small science cluster, and it worked extremely well.
A fun wrinkle: When the Intel C compilers first came out, they had warnings about process migration -- there's an executable mode where the executable checks to see what kind of CPU it's on at launch-time, and then branches to CPU-specific code blocks during the run. This can be dangerous, the check is only done once, so if the CPU type changes in mid-run due to migration to different hardware, Bad Things can happen.
This sounds like the same feature that Intel's C compiler used to "cripple AMD" by switching to a slower code path when it detected that the program was running on a non-Intel CPU.
For people that want to try OpenMOSIX out, take a look at this site http://dirk.eddelbuettel.com/quantian.html He has a distro that is called Quantian, with a big collection of science tools added. Shame it's Sunday, I'll need to wait a week to pull it down and see how well it flys.
Kerrighed was a similar project from INRIA in France. Process migration was more transparent than with OpenMOSIX (no special launch command like mosrun). It even supported thread migration!
MPI / MPICH were the first things I thought of when I saw the headline. I was like "thats not new...HPC has been doing message passing for decades this way." :)
What's old is new again -- I'm pretty sure QNX could do this in the 1990s.
QNX had a really cool way of doing inter-process communication over the LAN that worked as if it were local. Used it in my first lab job in 2001. You might not find it on the web, though. The API references were all (thick!) dead trees.
Edit: Looks like QNX4 couldn't fork over the LAN. It had a separate "spawn()" call that could operate across nodes.
It's nice to see people re-discover old school tech. In cluster computing this was generally called "application checkpointing"[1] and it's still in use in many different systems today. If you want to build this into your app for parallel computing you'd typically use PVM[2]/MPI[3]. SSI[4] clusters tried to simplify all this by making any process "telefork" and run on any node (based on a load balancing algorithm), but the most persistent and difficult challenge was getting shared memory and threading to work reliably.
It looks like CRIU support is bundled in kernels since 3.11[5], and works for me in Ubuntu 18.04, so you can basically do this now without custom apps.
Really cool idea! Thanks for providing so much detail in the post. I enjoyed it.
A somewhat related project is the PIOS operating system written 10 years ago but still used today to teach the operating systems class there. The OS has different goals than your project but it does support forking processes to different machines and then deterministically merging their results back into the parent process. Your post remind me of it. There's a handful of papers that talks about the different things they did with the OS, as well as their best paper award at OSDI 2010.
Condor, a distributed computing environment, has done IO remoting (where all calls to IO on the target machine get sent back to the source) for several decades. The origin of Linux Containers was process migration.
I believe people have found other ways to do this, personally I think the ECS model (like k8s, but the cloud provider hosts the k8s environment) where the user packages up all the dependencies and clearly specifies the IO mechanisms through late biniding, makes a lot more sense for distributed computing.
I clicked through to mention Condor too... I first came across it in the 90's, and it seems like one of those obvious hacks that keeps being reinvented.
I was actually channeling the creator of Condor, Miron Livny, who has a history of going to talks about distributed computing and pointing out that "Condor already does that" for nearly everything that people try to tout as new and cool.
That goes back to the 1980s, with UCLA Locus. This was a distributed UNIX-like system. You could launch a process on another machine and keep I/O and pipes connected. Even on a machine with a different CPU architecture. They even shared file position between tasks across the network. Locus was eventually part of an IBM product.
A big part of the problem is "fork", which is a primitive designed to work on a PDP-11 with very limited memory. The way "fork" originally worked was to swap out the process, and instead of discarding the in-memory copy, duplicate the process table entry for it, making the swapped-out version and the in-memory version separate processes. This copied code, data, and the process header with the file info.
This is a strange way to launch a new process, but it was really easy to implement in early Unix.
Most other systems had some variant on "run" - launch and run the indicated image. That distributes much better.
I worked at Locus back in the 90s, when this technology was part of the AIX 1.2/1.3 family. The basic architecture allowed for heterogeneous clusters (i386 and i370 -- PC's and IBM Mainframes all running on the same global filesystem.) Pretty sure you couldn't migrate processes to a machine with a different architecture though. It was awesome to be able to "kill -MIGRATE" a long-running make job to somebody else's idle workstation, or use one of the built-in shell primitives to launch a new job on the fastest machine in the cluster "fast make -j 10 gcc".
There's also an ergonomics to process creation APIs - rather than needing separate APIs for manipulating your child process vs manipulating your own process, fork() lets you use one to implement the other: fork(), configure the resulting process, then exec().
CreateProcess* on Windows is a relative monstrosity of complexity compared to fork/exec.
This can let you stream in new pages of memory only as they are accessed by the program, allowing you to teleport processes with lower latency since they can start running basically right away.
I guess this post is a little bit different, because VMs are designed to be portable across different hosts. Even hypervisor software without live migration still lets you freeze the VM’s state to a file which can be copied to a new host. However, an already running process is not designed to be portable in the same way.
https://www.dragonflybsd.org/cgi/web-man?command=sys_checkpo...
https://www.dragonflybsd.org/cgi/web-man?command=checkpoint&...
0. https://lwn.net/Articles/293575/
Tech moved away from mainframe model of keeping single processes up for as long as possible and isolating software from the failures. Instead they embrace failure at the software layer, which gives you (when executed correctly) both same high level of availability, but also layers of protection against other issues (as node going down in expected way is the same, no matter what’s the root cause) and makes overall maintainability and upgradability systems higher.
Basically, why deal with complicated checkpoints at kernel level, that doesn’t understand your software, when you ca instead deal with that at the software layer, with full control over it.
This is what it looks like: https://layerci.com/jobs/github/layerdemo/layer-gogs-demo/18
(in this run it restored a previous VM snapshot to skip the setup, and took a snapshot after the webserver started to give a lazy-loaded staging server)
But even calls like gettimeofday() work differently when running on hypervisors than they do when running on bare metal
Deleted Comment
MPI also comes to mind, but it's more focused on the IPC mechanisms.
I always liked Plan 9's approach, where every CPU is just a file and you execute code by writing to that file, even if it's on a remote filesystem.
I happened upon ClusterKnoppix and used that LiveCD as my demo which uses OpenMOSIX. And MPI as well but I was having a lot of trouble getting it all in my head and having to explain it during a presentation.
I'm glad that over with! But it was still fun.
It would be nice if it worked on Raspberry Pi, or if there was a simple way to set OpenMOSIX up on Pi's
A fun wrinkle: When the Intel C compilers first came out, they had warnings about process migration -- there's an executable mode where the executable checks to see what kind of CPU it's on at launch-time, and then branches to CPU-specific code blocks during the run. This can be dangerous, the check is only done once, so if the CPU type changes in mid-run due to migration to different hardware, Bad Things can happen.
https://www.agner.org/optimize/blog/read.php?i=49#49
Now that's a blast from the past.
QNX had a really cool way of doing inter-process communication over the LAN that worked as if it were local. Used it in my first lab job in 2001. You might not find it on the web, though. The API references were all (thick!) dead trees.
Edit: Looks like QNX4 couldn't fork over the LAN. It had a separate "spawn()" call that could operate across nodes.
https://www.qnx.com/developers/docs/qnx_4.25_docs/qnx4/sysar...
Plan 9, SmallTalk
Dead Comment
It looks like CRIU support is bundled in kernels since 3.11[5], and works for me in Ubuntu 18.04, so you can basically do this now without custom apps.
[1] https://en.wikipedia.org/wiki/Application_checkpointing [2] https://en.wikipedia.org/wiki/Parallel_Virtual_Machine [3] https://en.wikipedia.org/wiki/Message_Passing_Interface [4] https://en.wikipedia.org/wiki/Single_system_image [5] https://en.wikipedia.org/wiki/CRIU#Use
A somewhat related project is the PIOS operating system written 10 years ago but still used today to teach the operating systems class there. The OS has different goals than your project but it does support forking processes to different machines and then deterministically merging their results back into the parent process. Your post remind me of it. There's a handful of papers that talks about the different things they did with the OS, as well as their best paper award at OSDI 2010.
https://dedis.cs.yale.edu/2010/det/
I believe people have found other ways to do this, personally I think the ECS model (like k8s, but the cloud provider hosts the k8s environment) where the user packages up all the dependencies and clearly specifies the IO mechanisms through late biniding, makes a lot more sense for distributed computing.
https://research.cs.wisc.edu/htcondor/description.html
He's not wrong, but few people use Condor.
A big part of the problem is "fork", which is a primitive designed to work on a PDP-11 with very limited memory. The way "fork" originally worked was to swap out the process, and instead of discarding the in-memory copy, duplicate the process table entry for it, making the swapped-out version and the in-memory version separate processes. This copied code, data, and the process header with the file info. This is a strange way to launch a new process, but it was really easy to implement in early Unix.
Most other systems had some variant on "run" - launch and run the indicated image. That distributes much better.
edit: spelling
CreateProcess* on Windows is a relative monstrosity of complexity compared to fork/exec.
That's what "live migration" does; it can be done with an entire VM: https://en.wikipedia.org/wiki/Live_migration