Snakemake – A framework for reproducible data analysis

Snakemake is a beautiful project and evolves and improves so fast. Years ago I realized I needed to up my game from the usual bash based NGS data processing pipelines I was writing. Based on several recommendation I choose Snakemake. I have never regretted it, It worked perfectly on our PBS cluster then on our Slurm cluster. I made some steps to make it run on K8s, which it supports, and most recently, I'm still/again happy with my choice for Snakemake because it (together with Nextflow) seems to be the chosen framework for GA4GH's cloud work stream's "products" like WES and TES [0]. This seems to be the tech stack where Amazon Omics and Microsoft Genomics focus on [1]. It enables many cool things, like "data visiting": Just submit your Snakefile (the definition of a workflow, a DAG basically) to a WES API in the same data center where your data lives, and data analysis starts, near the data. Brilliant.

I owe a lot to Snakemake and Johannes Köster, I hope some day I can repay him and his project.

[0] https://www.ga4gh.org/work_stream/cloud/

[1] https://github.com/Microsoft/tes-azure

bsmith89 · 3 years ago

I too owe a lot of my PhD and postdoc productivity to Snakemake. It's my bioinformatics super-power, allowing me to run a complex analysis, including downloading containers (Singularity/Apptainer) and other dependencies (conda), with one command.

Great for reproducibility. Great for development. Great for scaling analyses.

Snakemake is vital infrastructure for my work.

tetris11 · 3 years ago

Its fantastic but it doesn't scale laterelly particularly well, compared to just Make.

ta988 · 3 years ago

What dimension are you referring to?

matthew_stone · 3 years ago

100% agree, and it's wonderful to see Snakemake on the top of HN.

Snakemake is an invaluable tool in bioinformatics analysis. It's a testament to Johannes' talent and dedication that, even with the relatively limited resources of an academic developer, Snakemake has remained broadly useful and popular.

Super nice guy too, he's always been remarkably responsive and helpful. I saw him present on Snakemake back when he was a postdoc, and it really changed my approach to pipeline development.

I work with Snakemake for computational biology. I see a lot of confusion as to why Snakemake exists when workflow management tools like Airflow exist, which mirrors my sentiment when moving from normal software to bio software.

Snakemake is used mostly by researchers who write code, not software engineers. Their alternative is writing scrips in bash, Python, or R; Snakemake is an easy-to-learn way to convert their scripts into a reproducible pipeline that others can use. It's popular in bioinformatics.

Snakemake also can execute remotely on a shared cluster or cloud computing. It has built-in support for common executors like SLURM, AWS, and TES[1].

Snakemake isn't perfect, but it helps researchers jump from "scripts that only work on their laptop" to "reproducible pipelines using containers" that easily run on clusters and cloud computing. Running these pipelines is still pretty quirky[2], but is better than the alternative of unmaintained and untested scripts.

There are other workflow managers further down the path of a domain-specific language, like Nextflow, WDL, or CWL. Nextflow is a dialect of Java/Groovy that is notoriously difficult to learn for researchers. Snakemake, in comparison, is built on Python and has a less steep learning curve and fewer quirks.

There are other Python based workflow managers like Prefect, Metaflow, Dagster, and Redun. They're great for software engineers, but don't bridge the gap as well with researchers-who-write-code.

[1] TES is an open standard for workflow task execution that's usable with most bioinformatics workflow managers, like HTML for browsers.

[2] I'm trying to fix this (flowdeploy.com), as are others (e.g. nf-tower). I think the quirkiness will fade over time as tooling gets better.

oldelpaso66 · 3 years ago

I don't get why you claim something like airflow doesn't bridge the gap well with resear hers who write code. I've worked with wdl extensively, and I still think that airflow is a superior tool. The second I need any sort of branching logic in my pipeline, the ways of solving this feel like you are working against the tool, not with it.

lebovic · 3 years ago

The bioinformatics workflow managers are designed around the quirkiness of bioinformatics, and they remove a lot of boilerplate. That makes them easier to grok for someone who doesn't have a strong programming background, at the cost of some flexibility.

Some features that bridge the gap:

1. Command-line tools are often used in steps of a bioinformatics pipeline. The workflow managers expect this and make them easier to use (e.g. https://github.com/snakemake/snakemake-wrappers).

2. Using file I/O to explicitly construct a DAG is built-in, which seems easier to understand for researchers than constructing DAGs from functions.

3. Built-in support for executing on a cluster through something like SLURM.

4. Running "hacky" shell or R scripts in steps of the pipeline is well-supported. As an aside, it's surprising how often a mis-implemented subprocess.run() or os.system() call causes issues.

5. There's a strong community building open-source bioinformatics pipelines for each workflow manager (e.g. nf-core, warp, snakemake workflows).

Airflow – and the other less domain-specific workflow managers – are arguably better for people who have a stronger software engineering basis. For someone who moved wet lab to dry lab and is learning to code on the side, I think the bioinformatics workflow managers lower the barrier to entry.

jghn · 3 years ago

The problem with Airflow is that each step of the DAG for a bioinformatics workflow is generally going to be running a command line tool. And it'll expect files to have been staged in and living in the exact right spot. And it'll expect files to have been staged out from the exact right spot.

This can all be done with Airflow, but the bioinformatics workflow engines understand that this is a first class use case for these users, and make it simpler.

samuell · 3 years ago

In the (unusually long) introduction of our paper on SciPipe, we did a pretty thorough overview and contrasting of the pros and cons of the top similar tools, including Snakemake, as we basically tried most of them out before realizing they didn't at the time solve our problems:

https://academic.oup.com/gigascience/article/8/5/giz044/5480...

(A little background, FWIW: We had extra tough requirements as we were implementing the whole machine learning progress in the pipeline, so needed to combine parameter sweeps for hyperparameter optimization with cross-validation, and none of the tools, at least at the time did meet the needs for dynamic scheduling together with fully modular components.

Snakemake and similar tools in our experience are great for situations where you have a defined set of outputs you want to be able to easily reproduce (think, figures for an analysis), but can become harder to reason about when the workflow is highly dynamic and it might be hard to express in terms of file name patterns the desired output files.

Nextflow subsequently has implemented the modular components part which we missed (and implemented in SciPipe), but we are still happy with SciPipe as it provides things like an audit log per output file, workflows copileable to binaries and great debugging (because of plain Go), all with a very small codebase without external Go dependencies.)

krastanov · 3 years ago

You might like mandala (https://github.com/amakelov/mandala) - it is not a build recipe tool, rather it is a tool that tracks the history of how your builds / computational graph has changed, and ties it to how the data looked like at each such step.

That's a cool approach, thanks for sharing!

dodslaser · 3 years ago

Shameless plug for a project I'm somewhat involved in: Hydra-genetics provides a growing set of well structured snakemake modules for bioinformatics (NGS) workflows.

https://github.com/hydra-genetics/

remram · 3 years ago

FYI the link to the "installation page" in scipipe's repo's README is 404.

Ouch, thanks, fixed!

teekert · 3 years ago

Snakemake is great, but it does feel like just a slightly more modern Make.

I am pretty excited about research projects that tie the recipe and the computation closer together so that you do not preserve just the last recipe, but the whole history of exploratory computation and analysis.

E.g. mandala (https://github.com/amakelov/mandala), a project of a colleague of mine which is basically semantic git for your computational graph and data at the same time.

johanneskoester · 3 years ago

Awesome to see Snakemake discussed here and many thanks for the positive feedback below. Thanks to the awesome and very active community, Snakemake evolves very fast in the recent years, while being widely used (on average >10 new citations per week in 2022). The best overview about the main features and capabilities can be found in our rolling paper, which is updated from time to time when new important features become available. It also contains various design patterns that are helpful when designing more complex workflows with Snakemake. You can find the paper here: https://doi.org/10.12688/f1000research.29032.2

And here is a little spoiler: I am currently working together with Vanessa Sochat on implementing a comprehensive plugin system for Snakemake, which will be used for executor backends but soon also for Snakemake's remote filesystem, logging, and scripting language support (and maybe more). The current implementations of those things will be moved to plugins but still maintained by the core team. This approach will enable a further democratization of Snakemake's functionalities because people can easily add new plugins without the need to integrate them into the main codebase. The plugin APIs will be kept particularly stable, so that Snakemake updates will in most cases not require any changes in the plugin implementations.

bafe · 3 years ago

I love snakemake. It almost saved my PhD, but then I found nextflow which suited my type of problems better. What's is slightly off-putting about snakemake is that the internal API isn't well documented, I wanted to contribute a new remote for SQL database and I had to figure out most of the method by comparing with other examples. Anyway my PR has been inactive since months, which surprised me since usually they tend to review and approve quickly

jerrygenser · 3 years ago

I see it can be used to define and run workflows. But for reproducibility on top of the execution of operations on data, I'm wondering if there's any way to version the underlying data?

Or would you typically use this -in addition- to a tool like dvc? I've used dvc a bit and while it's quite good for data versioning, I find the workflow aspect is clunky.

The way a workflow like Snakemake can help here is generally by letting the filenames pretty much describe how each particular output was created, meaning data outputs can act as immutable in a sense.

What I mean is that rather than create a new version of a file, if you run the same analysis with different sets of parameters, it should generate a new file with a different name rather than a new version of the old one. This also helps comparing differences between output from different parameters etc.

That said, there are workflow platforms which support data versioning, such as Pachyderm (https://pachyderm.com), but it is a bit more heavyweight as it runs on top of Kubernetes.

The reliance on filenames to define (parametric) dependencies was among the reasons I later adopted nextflow. The model fit the type of computation dependencies better for my case. In the mean time snakemake grew and many DAG that were hard to describe back then are now expressed directly with snakemake primitives

jlduan · 3 years ago

I am very fond of https://zenodo.org, especially for small datasets for scientific publications.

There is indeed an old issue asking whether they could return a provenance graph using the prov ontology in JSON format https://github.com/snakemake/snakemake/issues/2077 I think it's a good task to work on if you like to contribute

samesense · 3 years ago

I use snakemake along with dvc. It detects changes in files, and reruns steps to produce downstream files as needed.

troymc · 3 years ago

Pachyderm has data versioning / data version control built-in. I guess other tools do too.

Protostome · 3 years ago

This is a fantastic project. It's crucial to note that Snakemake is an extension of Python, meaning you can directly incorporate Python code into your makefiles.

In our team, we utilize Snakemake along with Singularity. The key operations, such as model training and inference that aren't straightforward shell commands, are compartmentalized using containers. Snakemake significantly simplifies the process of integrating these various modules.