As part of my job, I'm working on a software solution that needs to interact with one of the largest Italian HPC clusters (Cineca Leonardo, 270 PFLOPS). Of course developing on the production system was out of question, as it would have led to unbearably long feedback loops. I thus started looking around for existing containerised solutions, which were always lacking some key ingredient in order to suitably mock our target system (accounting, MPI, out of date software, ...).
I thus decided that it was worth it to make my own virtual cluster from scratch, learning a thing or two about SLURM in the process. Even though it satisfies the particular needs of the project I'm working on, I tried to keep vHPC as simple and versatile as possible.
I proposed the company to open source it, and as of this morning (CET) vHPC is FLOSS for others to use and tweak. I am around to answer any question.
* https://github.com/ComputeCanada/magic_castle
They link to various other projects that do cloud-y-HPC:
* AWS ParallelCluster [AWS]
* Cluster in the cloud [AWS, GCP, Oracle]
* Elasticluster [AWS, GCP, OpenStack]
* Google Cluster Toolkit [GCP]
* illume-v2 [OpenStack]
* NVIDIA DeepOps [Ansible playbooks only]
* StackHPC Ansible Role OpenHPC [Ansible Role for OpenStack]
Nvidia also offers free licenses for their Base Command Manager (BCM, formerly Bright Cluster Manager); pay for enterprise support, or hit up the forums:
* https://www.nvidia.com/en-us/data-center/base-command-manage...
* http://support.brightcomputing.com/manuals/10/
* http://support.brightcomputing.com/manuals/11/
Even surprisingly popular distributed-systems stuff is always really bad about "follow this 10 step copy/paste to deploy to EKS" but that's also obnoxious. In the first place, people want to see something basically working on small scale first to check if it's abandonware. But even after that.. local prototyping without first setting up multiple repositories, then shipping multiple modified container images, and already having CI/CD for all of the above is really nice to have.
Not quite sure how well you looked, but there are a bunch of deployment systems for HPC, Ansible or otherwise:
* https://old.reddit.com/r/HPC/comments/1p4a3fq/what_imaging_s...
* My comment listing a bunch: https://news.ycombinator.com/item?id=46037792
I have worked 100% in 3 comparable systems over the past 10 years. Can you access with ssh?
I find it super fluid to work on the HPC directly to develop methods for huge datasets by using vim to code and tmux for sessions. I focus on printing detailed log files constantly with lots of debugs and an automated monitoring script to print those logs in realtime; a mixture of .out .err and log.txt.
Our reference cluster has long queuing times during busy hours and requires 2FA for access, so we had extra incentives to have a self-contained solution to run on our development machines.
But this still runs on a single computer, so you wouldn’t use this to deploy a production cluster. This would be for testing in a virtual multi-node-ish setup.
Futile hope though. My company is still using SGE.
But I work in silicon and every company I've worked in uses SGE/SLURM for automated testing. SLURM absolutely sucks for that. They really want you to submit jobs as bash scripts, they can't handle a large number of jobs without using janky array jobs, submitting a job and waiting for it to finish is kind of janky. Getting the output anywhere except a file is difficult. Nesting jobs is super awkward and buggy. All the command line tools feel like they're from the 80s - by default the column widths are like 5 characters (not an exaggeration).
We even had an issue that SLURM uses 4 ports per job for the duration of the job, so you can't actually run more than a few thousand jobs simultaneously because the controller runs out of TCP ports!
I don't think it would actually be that hard to write a modern replacement. The difficult bit is dealing with cgroups. I won't hold my breath for anyone in the silicon industry to write it though. Hardware engineers can't write software for shit.
Dead Comment
Dead Comment