Readit News logoReadit News
Posted by u/chordol a year ago
Ask HN: What is the simplest data orchestration tool you've worked with?
Along the lines of Airflow, Prefect, Dagster, Argo, etc. What produced the least WTF per minute?
rasmusab · a year ago
Pure python scripts, maybe using the #%%-convention (https://code.visualstudio.com/docs/python/jupyter-support-py...) so you get the best of both notebooks and scripts, in a right-sized instance/container/machine. And if you need to run jobs in parallel, then orchestrate using make, like so: https://www.sumsar.net/blog/makefile-recipe-python-data-pipe...
niwtsol · a year ago
Yeah, I love this — pure Python with cron or periodic tasks (e.g., Django) works great. Celery task for parallelization, and if you pipe logs/alerts into a Slack channel, you can actually get really far without needing a "proper" orchestration layer.

I recently took over an Airflow system from a former colleague, and in our case, it’s just overly complex for what’s really a pretty simple data flow.

mettamage · a year ago
I don’t know much about airflow

But isn’t it just also python with cron?

vitorbaptistaa · a year ago
My experience entails:

* Luigi -- extensive usage (4y+)

* Makefiles -- (15y+)

* GitHub Actions -- (4y+)

* Airflow -- little usage (<6 months)

* Dagster -- very little, just trying it out

* Prefect -- just followed tutorial

Although it lacks a lot of the monitoring and advanced web ui other platforms have (maybe because of it), Luigi is the simplest to reason about IMHO.

For a new project that will require complex orchestrations, I'd probably go with Dagster or Prefect nowadays. Dagster seems more complex and more powerful with its data lineage functionality, but I have very little experience with either tool.

If it's a simple project, a mix of Makefiles + GH Actions can work well.

vector_spaces · a year ago
Is there anything even more lightweight, where you don't have to write your code any differently? For instance, say I have 10 jobs that don't depend on each other, all of them pretty small.

Dagster and even Luigi feel like overkill but I'd still like to plug those into a unified interface where I can view previous runs, mainly logs and exit codes. Being able to do some light job configuration or add retries would be nice but not required. For the moment I just use a logging handler that writes to a database table and that's fine

disgruntledphd2 · a year ago
I think that Airflow 2 implemented a decorator mode which you can just use on functions.

Honestly, just use airflow, it has its issues but it sucks in well known and predictable ways.

cicdw · a year ago
One of the goals of Prefect's SDK is to be minimally invasive from a code-standpoint (in the simplest case you only need two lines to convert a script to a `flow`). Our deployment model also makes infrastructure job config a first-class citizen so you might have a good time trying it out. (disclosure: work at Prefect)
PaulHoule · a year ago
Straightforward programs in languages like Java, Python, etc.

The tools you describe all have the endpoint "you can't get there from here" and the only difference is if it takes you 5 seconds, 5 minutes, 5 days, 5 weeks or 5 months to learn that.

djsjajah · a year ago
I few people have mentioned dagster and I took a look at that for some machine learning things I was playing with but I found dvc (data version control [1]) and I think it is fantastic. I think it also has more applications than just machine learning but really anything with data. If you have a bunch of shell scripts that write to files to pass data around, then dvc might be a good fit. it will do things like only rerun steps if it needs to. Also for totally non-data stuff, Prefect is great.

[1] https://dvc.org

saturn8601 · a year ago
I used to work for an automation company that produced a product called ActiveBatch. It was such an amazing tool for just drag and drop automation. Its focus was on full fledged workflow automation and not just data orchestration.

What I loved was its simplicity + its out of the box features. To set it up just took a simple MS SQL DB + An Installer. Bam you are up and running an absolute rock solid scheduler(i've seen million+ jobs running on it without it breaking a sweat). Then you could install (or use it to deploy) execution agents to all the servers you wanted as workers.

It also installed a robust Desktop GUI that had so many services built in ready to go (anything from executing scripts all the way to performing direct actions against countless products a company would have or against various cloud services).

There were so many pre built actions where all you had to do was input credentials and it would enumerate the appropriate properties from that service automatically. Then you could connect things together (ie, pull something from the cloud, process it on some other server, store it, pass it along to another service, whatever you wanted)

Only problem was this is very much a B2B application and their sales is really only interested in selling to enterprises and not end users. I really wish we had something like this that regular people could download.

Everything ive seen listed here requires extensive setup,requires coding, or does not have a robust desktop GUI but instead some half baked web gui which might require dropping back down to scripts/coding. You could set up hundreds/thousands of automated steps in ActiveBatch without writing a single line of code. I miss that product.

margor · a year ago
As I worked adjacent to people who ran thousands of jobs in ActiveBatch - that software indeed was very simple to use and its GUI might have been awesome - but it's been double edged sword where if you have hundreds of people working on it - it becomes maintenance nightmare and promoting changes between environments was non-existent, causing multiple incidents.

Mind you, it might have been just culture at that place, but I don't think this is as good of an example as you make it be. Sure, it was easy to get started and made the life easier at the beginning, but running it at scale was not in any way easy.

saturn8601 · a year ago
How long back are you talking about? Do you remember what version? When I was working there, there were improvements made to "Change Management" ie. promoting changes from Test -> Production. After I left I heard the improvements continued. When I was there, it was a ~50 person company that was very focused so this was a pain point they were well aware of.
pramodbiligiri · a year ago
The parent comment made me curious enough to go look it up. Is it this same ActiveBatch that you both are referring to? https://www.advsyscon.com/
rich_sasha · a year ago
I wrote my own in half a day. Worked 24/7 for 3 years... then I quit.

Seriously, took me much less time than setting up airflow. Even had a webpage in the end, with all the tasks, a tree view, downstream, upstream tasks (these were incremental improvements beyond the initial half-day), CLI... The works.

I now know the points of fragility I didn't know before, but I'd do it again.

itfollowsthen · a year ago
at my last startup I asked a friend to help me debug an Airflow DAG. he just pip installed prefect and I've never really looked back. at the time everything else felt too hard to figure out.
speedgoose · a year ago
I like having containers running as CronJobs or Deployments in Kubernetes, but Argo Workflow has been a pretty reliable plugin to Kubernetes for the more advanced scenarios.

However, it’s simple only if you are already familiar with software containers and Kubernetes. But it’s perhaps better to learn than having to deal with dependency hell in Python or Java.