geoffjentry (u/geoffjentry)

geoffjentry commented on Nextflow: Data-Driven Computational Pipelines nextflow.io/... · Posted by u/brianzelip

bafe · 3 years ago

Perhaps it is not as popular, but I found the groovy syntax ideal for DSL like that

geoffjentry · 3 years ago

I feel like groovy pushes towards the worst of both worlds between an internal dsl and external dsl. It’s an internal dsl so you get the language but oh man groovy sucks

geoffjentry commented on Nextflow: Data-Driven Computational Pipelines nextflow.io/... · Posted by u/brianzelip

a_bonobo · 3 years ago

As someone who dips into nextflow from time to time, I'd strongly suggest by developing your pipeline based on an existing nf-core pipeline or the nf-core templates. nf-core comes with a bunch of nicer defaults like profiles for SLURM, Singularity, Docker that help you abstract some of the headaches away, plus you could get lucky and can just glue some of their modules together.

geoffjentry · 3 years ago

I don’t know. I started this route and then quickly switched to only dipping into nf-core when they had actual prior art.

The interplay of nf and groovy (how I wish they hadn’t used groovy!) can be mind bending but if you’re writing your own thkng you have a different optimization model than nf-core that is trying to be one size fits all

geoffjentry commented on Nextflow: Data-Driven Computational Pipelines nextflow.io/... · Posted by u/brianzelip

firecraker · 3 years ago

So my question to the non bioinformatics - is this already a solved problem?

You have tasks which require resources based on the input parameters, these are run in docker containers to ensure the environment and you want to track the output of each step. Often these are embarrassingly parallel operations (e.g. I have 200 samples to do the same thing on).

Something like dask perhaps,but can specify a docker image for the task?

What is the goto in DevOps for similar tasks? GitHub actions comes pretty close...

To bioinformatics what is the unique selling point of next flow over say wdl/Cromwell?

geoffjentry · 3 years ago

The big difference when comparing bioinformatics systems with non are what the typical payload of a DAG node is and what optimizations that indicates. Most other domains don’t have DAG nodes that assume the payload is a crappy command line call and expecting inputs/outputs to magically be in specific places on a POSIX file system.

You can do this on other systems but it’s nice to have the headache abstracted away for you.

The other major difference is assumption of lifecycle. In most biz domains you don’t have researchers iterating on these things the way you do in bioinf. The newer ML/DS systems do solve this problem than say Aorflow

geoffjentry commented on Nextflow: Data-Driven Computational Pipelines nextflow.io/... · Posted by u/brianzelip

bafe · 3 years ago

They try to address similar solutions, but comparing snakemake and nextflow doesn't do either tool a favour. They use different computation models, nextflow is based on dataflow programming and therefore schedules processes dynamically as new data comes in, while snakemake is pull-based and schedules the processes based on the dag defined by the dependencies. Anyhow they are both great tools.

geoffjentry · 3 years ago

While true that’s a minor distinction when comparing the clusters of bioinformatics workflow systems vs workflow systems aimed at different domains

geoffjentry commented on Nextflow: Data-Driven Computational Pipelines nextflow.io/... · Posted by u/brianzelip

esafak · 3 years ago

In all fairness, they predate the competition (2013): https://github.com/nextflow-io/nextflow/releases?page=25

geoffjentry · 3 years ago

As GP referenced CWL, while NF had appeared first in terms of the bioinformatics world Nextflow, CWL, Snakelike, and WDL all erupted close enough to each other to be equal-ish. The people were aware of each other but they were all so nascent that it wasn't clear if it was worth joining in or not. At the end of the day these all came from groups trying to scratch particular itches, and not everyone agreed on the right way to scratch.

However all of them were rejections of prior models as well as the workflow solutions prominent in the business space.

geoffjentry commented on Snakemake – A framework for reproducible data analysis snakemake.github.io/... · Posted by u/gjvc

lebovic · 3 years ago

The bioinformatics workflow managers are designed around the quirkiness of bioinformatics, and they remove a lot of boilerplate. That makes them easier to grok for someone who doesn't have a strong programming background, at the cost of some flexibility.

Some features that bridge the gap:

1. Command-line tools are often used in steps of a bioinformatics pipeline. The workflow managers expect this and make them easier to use (e.g. https://github.com/snakemake/snakemake-wrappers).

2. Using file I/O to explicitly construct a DAG is built-in, which seems easier to understand for researchers than constructing DAGs from functions.

3. Built-in support for executing on a cluster through something like SLURM.

4. Running "hacky" shell or R scripts in steps of the pipeline is well-supported. As an aside, it's surprising how often a mis-implemented subprocess.run() or os.system() call causes issues.

5. There's a strong community building open-source bioinformatics pipelines for each workflow manager (e.g. nf-core, warp, snakemake workflows).

Airflow – and the other less domain-specific workflow managers – are arguably better for people who have a stronger software engineering basis. For someone who moved wet lab to dry lab and is learning to code on the side, I think the bioinformatics workflow managers lower the barrier to entry.

geoffjentry · 3 years ago

> are arguably better for people who have a stronger software engineering basis

As someone who is a software developer in the bioinformatics space (as opposed to the other way around) and have spent over 10 years deep in the weeds of both the bioinformatics workflow engines as well as more standard ones like Airflow - I still would reach for a bioinfx engine for that domain.

But - what I find most exciting is a newer class of workflow tools coming out that appear to bridge the gap, e.g. Dagster. From observation it seems like a case of parallel evolution coming out of the ML/etc world where the research side of the house has similar needs. But either way, I could see this space pulling eyeballs away from the traditional bioinformatics workflow world.

geoffjentry commented on Consider working on genomics claymcleod.dev/blog/2022-... · Posted by u/clmcleod

bmitc · 3 years ago

This is a great read. Thanks for the information. What you originally described is basically my dream job: software engineers working alongside scientists and engineers, where the software engineers become domain knowledgeable if not experts in certain areas.

I had a job similar to that at a similar places (actually places), but I ended up leaving because I was a one man team and got burnt out. Writing software for scientific purposes and true R&D is very fun and interesting, and I think there is a lot of untapped potential for doing some interesting things there. But there is a balance between the wild west, then what your first described, and then what you later described. Keeping things organized enough to not be chaos but loose enough to not get siloed.

geoffjentry · 3 years ago

A really hard aspect to this is that there's a massive impedance mismatch between the research & production side of things. Working in the research side is pretty straightforward - although software development practices are going to be a lot looser & faster. Working in a production environment is straightforward, it's like any other software job. But - working at the confluence of those two states is incredibly difficult.

geoffjentry commented on Consider working on genomics claymcleod.dev/blog/2022-... · Posted by u/clmcleod

laidoffamazon · 3 years ago

I live next to Broad's offices and see people leaving/entering the office at odd hours on Saturday and Sunday. That (and the fact that they pay about 75% what I made as a new grad) prevented me from ever applying there.

geoffjentry · 3 years ago

Keep in mind that there are wetlabs with experiments being conducted in them. Lab techs will be coming and going at all hours.

geoffjentry commented on Consider working on genomics claymcleod.dev/blog/2022-... · Posted by u/clmcleod

bmitc · 3 years ago

Usually, scientific oriented companies or organizations have little regard for software as a domain, craft, etc. It’s just a thing that gets in the way, despite being vital. It’s almost just a utility to them rather than a differentiator and active component of the advanced work going on.

For example, the Broad Institute is super interesting, but having applied there several times, they are esoteric, to say the least, in their hiring. They pay well below market, and their process is opaque and slow and sometimes downright non-communicative. They are also not really open to remote work, so you gotta move there and commute to the heart of Cambridge. Budgets are set by folks maybe a couple years out of a PhD program, who will also make technical decisions in terms of the software design (the latter an assumption given my experience in similar places).

These organizations are also pretty traditional in their selection of stacks. Good luck trying to use a functional-first language, aside from maybe Scala (usually lots of Java stacks), and be prepared to write lots of Python, the only language that exists to many scientists. I once saw a Python signature (function name and arguments) spill over 10-20 lines, in a file over 10,000 lines long. They had given up on another software stack because “it wasn’t working for them”.

This is all painting with broad strokes, of course. But I think scientific organizations that would embrace software as a major component of their technological and scientific development would do well. There’s a lot of opportunity.

geoffjentry · 3 years ago

> Good luck trying to use a functional-first language, aside from maybe Scala

While they've moved away from it in the last few years, the Broad Institute had a huge investment in Scala. It's been in use there since at least 2010 and I believe longer. The primary software department was almost entirely Scala based for several years. That same department had pockets of Clojure as well.

geoffjentry commented on Airflow's Problem stkbailey.substack.com/p/... · Posted by u/cloakedarbiter

sethjr5rtfgh · 4 years ago

Could you clarify what "keep track of state of assets" means?

geoffjentry · 4 years ago

It's a mindset shift to a more declarative model. The idea has also popped up in other niche orchestrators.

This is an oversimplification but IMO the easiest way of picturing it is instead thinking of defining your graph as a forward moving thing w/ the orchestrator telling things they can be run you shift to defining your graph nodes to know their dependencies and they let the orchestrator know when they're runnable.