The interplay of nf and groovy (how I wish they hadn’t used groovy!) can be mind bending but if you’re writing your own thkng you have a different optimization model than nf-core that is trying to be one size fits all
You have tasks which require resources based on the input parameters, these are run in docker containers to ensure the environment and you want to track the output of each step. Often these are embarrassingly parallel operations (e.g. I have 200 samples to do the same thing on).
Something like dask perhaps,but can specify a docker image for the task?
What is the goto in DevOps for similar tasks? GitHub actions comes pretty close...
To bioinformatics what is the unique selling point of next flow over say wdl/Cromwell?
You can do this on other systems but it’s nice to have the headache abstracted away for you.
The other major difference is assumption of lifecycle. In most biz domains you don’t have researchers iterating on these things the way you do in bioinf. The newer ML/DS systems do solve this problem than say Aorflow
However all of them were rejections of prior models as well as the workflow solutions prominent in the business space.
Some features that bridge the gap:
1. Command-line tools are often used in steps of a bioinformatics pipeline. The workflow managers expect this and make them easier to use (e.g. https://github.com/snakemake/snakemake-wrappers).
2. Using file I/O to explicitly construct a DAG is built-in, which seems easier to understand for researchers than constructing DAGs from functions.
3. Built-in support for executing on a cluster through something like SLURM.
4. Running "hacky" shell or R scripts in steps of the pipeline is well-supported. As an aside, it's surprising how often a mis-implemented subprocess.run() or os.system() call causes issues.
5. There's a strong community building open-source bioinformatics pipelines for each workflow manager (e.g. nf-core, warp, snakemake workflows).
Airflow – and the other less domain-specific workflow managers – are arguably better for people who have a stronger software engineering basis. For someone who moved wet lab to dry lab and is learning to code on the side, I think the bioinformatics workflow managers lower the barrier to entry.
As someone who is a software developer in the bioinformatics space (as opposed to the other way around) and have spent over 10 years deep in the weeds of both the bioinformatics workflow engines as well as more standard ones like Airflow - I still would reach for a bioinfx engine for that domain.
But - what I find most exciting is a newer class of workflow tools coming out that appear to bridge the gap, e.g. Dagster. From observation it seems like a case of parallel evolution coming out of the ML/etc world where the research side of the house has similar needs. But either way, I could see this space pulling eyeballs away from the traditional bioinformatics workflow world.
I had a job similar to that at a similar places (actually places), but I ended up leaving because I was a one man team and got burnt out. Writing software for scientific purposes and true R&D is very fun and interesting, and I think there is a lot of untapped potential for doing some interesting things there. But there is a balance between the wild west, then what your first described, and then what you later described. Keeping things organized enough to not be chaos but loose enough to not get siloed.
For example, the Broad Institute is super interesting, but having applied there several times, they are esoteric, to say the least, in their hiring. They pay well below market, and their process is opaque and slow and sometimes downright non-communicative. They are also not really open to remote work, so you gotta move there and commute to the heart of Cambridge. Budgets are set by folks maybe a couple years out of a PhD program, who will also make technical decisions in terms of the software design (the latter an assumption given my experience in similar places).
These organizations are also pretty traditional in their selection of stacks. Good luck trying to use a functional-first language, aside from maybe Scala (usually lots of Java stacks), and be prepared to write lots of Python, the only language that exists to many scientists. I once saw a Python signature (function name and arguments) spill over 10-20 lines, in a file over 10,000 lines long. They had given up on another software stack because “it wasn’t working for them”.
This is all painting with broad strokes, of course. But I think scientific organizations that would embrace software as a major component of their technological and scientific development would do well. There’s a lot of opportunity.
While they've moved away from it in the last few years, the Broad Institute had a huge investment in Scala. It's been in use there since at least 2010 and I believe longer. The primary software department was almost entirely Scala based for several years. That same department had pockets of Clojure as well.
This is an oversimplification but IMO the easiest way of picturing it is instead thinking of defining your graph as a forward moving thing w/ the orchestrator telling things they can be run you shift to defining your graph nodes to know their dependencies and they let the orchestrator know when they're runnable.