Netflix Conductor: Open-source workflow orchestration engine

kelnos · 5 years ago

I set up Conductor where I work while evaluating workflow engines, and overall wasn't too happy with it. The default datastore is this Netflix-specific thing (Dynomite) that's built on top of redis. It's not particularly easy to integrate operationally into non-Netflix infrastructure, and Conductor itself hard dependencies on several services.

The programming model for workflows/tasks felt a little cumbersome, and after digging into the Java SDK/Client, I wasn't impressed with the code quality.

We did have some contacts at Netflix to help us with it, but some aspects (like dyomite itself, and its sidecar, dynomite-manager) felt abandoned with unresponsive maintainers.

We've started using Temporal[0] (née Cadence) recently, and while it's not quite production-ready, it's been great to work with, and, just as critically, very easy to deal with operationally in our infrastructure. The Temporal folks are mostly former Uber developers who worked on Cadence, and since they're building a business around Temporal, they've been much more focused and responsive.

[0] https://temporal.io/

doktorhladnjak · 5 years ago

The Temporal founders worked on AWS SWF too before building Cadence at Uber. They have a lot of experience in this area and are making the product better with each iteration no doubt. I enjoyed using Cadence at Uber and definitely sad about not having it at my current company.

One of the founders, mfateev, is around elsewhere on this thread answering questions.

skissane · 5 years ago

> I set up Conductor where I work while evaluating workflow engines, and overall wasn't too happy with it. The default datastore is this Netflix-specific thing (Dynomite) that's built on top of redis. It's not particularly easy to integrate operationally into non-Netflix infrastructure, and Conductor itself hard dependencies on several services.

I find the RDBMS-based backends (Postgres or MySQL, depending on your preference of DB) much easier to get going with. Of course, by default you don't get the same HA as Dynomite – but that might not be a big issue in a particular use case, and RDBMS clustering/failover solutions can address that. If you are already in an environment with lots of RDBMS, RDBMS can be a sensible choice.

There are other options to consider too – Conductor supports direct use of Redis (including with Redis Cluster and Redis Sentinel) instead of going via Dynomite, and it also supports Cassandra. Someone should really create a pro/con comparison of all the different storage options, that was the biggest thing I was missing when evaluating it.

freeqaz · 5 years ago

Workflows and orchestration are my jam -- that's what we're trying to simplify over at https://refinery.io

Conductor is a cool piece of tech, and it's a well-established player in a rapidly growing space for workflow engines.

I used to work at Uber and that company had microservice-hell for a while. They built the project Cadence[0] to alleviate that. It is similar to Conductor in many ways.

One project to watch out for is Argo[1] which is a CNCF-backed project.

There are also some attempts[2] to standardize the workflow spec.

Serverless adds a whole new can of worms to what orchestration engines have to manage, and I'm very curious to see how things evolve in the future. Kubernetes adds a whole dimension of complexity to the problem space, as well.

If anybody is interested in chatting about microservice hell or complex state machines for business logic, I'd be excited to chat. I'm always looking for more real world problems to help solve (as an early stage startup founder) and more exposure to what others are struggling with is helpful!

0: https://github.com/uber/cadence

1: https://argoproj.github.io/argo/

2: https://serverlessworkflow.github.io/

jayd16 · 5 years ago

Hey, wait a second. Are these just a modern incarnation of the enterprise service bus? Is there a significant difference?

zo1 · 5 years ago

By modern you mean using "docker", have an "io domain", and use node-js/go/rust? That seems to be the pattern these days. Take existing items, rewrite/repackage and io-ify it, maybe make it a SAAS product while at it.

devonkim · 5 years ago

ESBs historically required a domain-aware object model (when I saw them they were routed based upon rich types and terms like CORBA and RMI were still in vogue). This is more obvious when you look down the page for features in Apache Camel and they mention different data type support. Well, a workflow or orchestration engine doesn't need _any_ idea of your business layer data model to make decisions on where they need to be broadcast, but it's a convenience similar to listening to specific topics and segments in Kafka after querying a data broker service. Decoupling routing and representation of objects means that the bus is now dumb but the scheduler / router is now interchangeable. Which is how we can see so many possible competitors to Conductor and EBPM and all that now as shown in these comments.

extrapickles · 5 years ago

Yes. The only difference from what I can tell is that each service/node on the bus is a VM/Container rather than a bespoke built machine.

dragonwriter · 5 years ago

> Are these just a modern incarnation of the enterprise service bus?

No, ESB is the messaging plane, workflow orchestration is a higher level service, on top of the messaging plane.

Answerawake · 5 years ago

Not sure if this relates but I was wondering if you heard of ASCI ActiveBatch? If so what are your thoughts on it?

I'd like to have some easy to set up orchestration/job scheduling engine in my team so we can help clean up the tangled mess that we are in but something like Ansible seems like too much work to setup and add more jobs over time. I tried Activebatch and I wish there was some free or cheap alternative of it.

freeqaz · 5 years ago

I hadn't heard of it before. Thanks for sharing!

What are the types of problems you're hoping to untangle, more specifically? I assume you want something like... A workflow where each step maps to some provisioning process?

If that guess is reasonably true, then I immediately think of Terraform. You can specify dependencies and write hooks that trigger callbacks when certain steps are called. We use Terraform a bunch under-the-hood for Refinery, and it's great. I haven't used Ansible (only read about it) so I can't contrast them very well.

If you want to chat some more, I'd be curious to hear more about what you're trying to build and see if there is a way we could collaborate. Or if I can offer any recommendations for software that I know of. My email is free at refinery.io :)

dap · 5 years ago

Thanks for the resources! It seems like a lot of these implementations expect to run as independently managed services in some microservice architecture. Are you aware of any workflow engines implemented as a library inside your application, maybe with storage backed by an external database? I think that you could still have a highly available model, provided the database supported that.

hukola · 5 years ago

https://github.com/ing-bank/baker is one such library for JVM languages. The state is kept in Cassandra or in-memory. We've been running production workloads with it for the last 2.5 years. A feature comparison with alternatives can be found here https://ing-bank.github.io/baker/sections/feature-comparison....

mianos · 5 years ago

Prefect. It's a workflow engine in a python library.

manyxcxi · 5 years ago

Funny, even though Conductor isn’t designed to be- this is exactly how we use it. In a Kotlin and Spring Boot codebase nonetheless!

AlphaSite · 5 years ago

Like celery?

Ataraxy · 5 years ago

The idea of a pseudo flow based programming worfklow has always appealed to me because I think having logic as components that can be used as legos makes sense. It seemed like the other stuff out there was either too overkill/obtuse or even too low level for what I had in mind. So I started writing something for personal use in JS that fits my mental model of what that would be.

In a nutshell it's just a runner and uses a project called moleculer which is a microservice framework that I'm more or less just using as an rpc client to execute the tasks of the workflow.

One thing I've been debating with myself is how I should work with the dataflow. Would you say it's better to have the results of every node merged into a singular context across an entire flow that is passed to every other node/block (ie. always one input, the context, one output the context), or would it be better to explicitly declare and pass in inputs/outputs.

Here's one project that does it that way: https://github.com/danielduarte/flowed

One benefit of this seems to be that nodes can run in parallel the moment their dependencies are met without having to care about each other.

I poked around on your site and I like what you have to offer. You appear to do it the singular context way which I suppose makes sense because it seems like every block is an individual lambda. This also seems like overkill for my purposes because I'm interested in being able to do granular things if I wish such as a simpe rule action. I'm not sure having an individual lambda to run "if email exists return true" would be practical. That and the warm up time/latency.

The other thing is managing something like npm dependencies might be annoying across blocks.

Being able to create arbitrary API endpoints is nice though.

I wish there were something a little more in between than all of these enterprisey offerings.

halfmatthalfcat · 5 years ago

My company is slowly switching from Tekton to Argo (much to my chagrin). They seem very similar in that they’re incredibly alpha and fighting over similar parts of the stack.

Argo seems more CD focused right now whereas Tekton really is a “toolbox”/make what you want.

freeqaz · 5 years ago

That's funny because the Tekton website[0] has this as their first text:

"Tekton is a powerful and flexible open-source framework for creating CI/CD systems, allowing developers to build, test, and deploy across cloud providers and on-premise systems."

Do you know why they are switching, out of curiosity?

The market is definitely very young and I don't think anybody has really nailed down every use case.

There are the parts of the market targeting "business model" use cases like Camunda/Zeebe.

Then there are ETL-style systems like Airflow for dealing with massive data processing.

And still you've got the CI/CD side of things like Argo/Tekton for automating complex build systems/running tests.

Then for systems like Netflix Conductor, Uber Cadence, and AWS Step Functions (among others), you have systems trying to abstract on top of existing complex systems (microservices, etc).

That's not even including low-code spaces like Zapier or IFTTT that try to target making integrations trivial.

It's a crazy world!

0: https://tekton.dev/

theptip · 5 years ago

I had Tekton in my mental filing cabinet under "Cloud Native CI/CD", how's the UI / DAG-visualization for that toolchain? Like the sibling comment, I'd also be interested in hearing your A/B thoughts on Argo vs. Tekton for the generic workflow management usecase.

devmunchies · 5 years ago

interesting project you started. I had used snaplogic before and its similar, although not built specifically for AWS serverless. Do you have any videos of any walkthroughs, demos, or tutorials?

EDIT: nevermind, i see your getting started docs have lots of video clips.

freeqaz · 5 years ago

I hadn't heard of Snaplogic before. Had to dig in. Thanks for sharing -- this is helpful context to have. There are so many companies across different spaces that it's hard to find them all!

Glad you found the docs here[0] (for anybody else who is curious).

We're still iterating on Refinery a lot and trying to find product market fit. If it doesn't make sense or you're confused, that feedback is super helpful for me. The goal is to build something people actually want to use and it's an iterative process to get there!

0: https://docs.refinery.io/getting-started/

faizshah · 5 years ago

Is there a serverless workflow solution for google cloud yet?

CraigJPerry · 5 years ago

What kind of support is present for testing workflows?

mfateev · 5 years ago

Temporal (Cadence) comes with Unit testing framework. The most interesting feature of it is ability to skip time automatically when workflow is blocked waiting for something. It allows testing long running workflows in milliseconds without any need to change timeouts.

freeqaz · 5 years ago

It's fairly limited right now to testing single blocks in the editor.

We recently added support for Git-based development to address testing. The idea there is that each block is just a chunk of code living in a Git repo, so you can use whatever testing tools you'd like to.

It's actually pretty slick -- it's bidirectional. You can use Git to make a commit, like normal, and then refresh the editor to see it. And you can make also changes in the editor and it will commit + push to Git. If you want to play with that, I can share details on how to enable it (it's fairly hidden in the UI atm).

In terms of testing workflows together as a chain, that's still TBD. It is something that we'd like to ship eventually though. Along with staging environments and canary deployments.

On our immediate roadmap is to get the core of the tool open-sourced so that people can start playing with it outside of our hosted platform. We have it gated behind a credit-card form right now to fight fraud and because of some technical limitations.

theptip · 5 years ago

Quick notes from skimming the docs:

* Conductor implements a workflow orchestration system which seems at the highest level to be similar to Airflow, with a couple of significant details.

* There are no "workers", instead tasks are executed by existing microservices.

* The Orchestrator doesn't push work to workers (e.g. Airflow triggering Operators to execute a DAG), instead the clients poll the orchestrator for tasks and execute when they find them.

My hot take:

If you already have a very large mesh of collaborating microservices and want to extract an orchestration layer on top of their tasks, this system could be a good fit.

Most of what you're doing here can also be implemented in Airflow, using an HTTPOperator or GRPCOperator that triggers your services to initiate their task. You don't get things like pausing though. On the other hand, you do get the ability to run simple/one-off tasks in an Airflow operator, instead of having to build a service to run your simple Python function.

I'm unsure on whether push/pull is better; I think it largely depends on your context. I'm inclined to say that for most cases, having the orchestrator push tasks out over HTTP is a better default, since you can simply load-balance those requests and horizontally scale your worker pool, and it's easier to test a pipeline manually (e.g. for development environments) if the workers respond to simple HTTP requests, instead of having to provide a stub/test implementation of the orchestrator. (In particular I'm thinking about "running the prod env on your local machine in k8s" -- this isn't practical at Netflix scale though.)

dilly_li · 5 years ago

Is there a workflow tool that’s designed with micro service in mind?

My particular use case: - several workers process the data on the workers’ local threads - several workers serve as relays to interface with external third party services, hold all the necessary credentials, and conduct cron-like checking. - the ETL tool doesn’t directly provision these workers.

The second point is part of the reason why we don’t want to use Airflow’s k8s operator.

But it doesn’t seem like there is a better option in terms commonly used and robustness. So we are leaning towards write some custom operators and sensors to make Airflow more friendly to micro services.

Thought?

tupac_speedrap · 5 years ago

We've used Conductor at my workplace for about a year now. The grounding is pretty solid but the documentation is pretty pants once you dig into it. We have to resort to digging into github issues to find fairly fundamental features that aren't really documented. I feel Conductor is something Netflix has open-sourced and then sort of dumped on the OS community.

For example there isn't any examples of how to implement workers using their Java client, we had to dig up a blog post to do that, although it is fairly simple a very basic example of implementing the Worker interface would be nice.

They also do not make it clear the exact relationship between tasks and workflows and it's hard to find any good examples of relatively complex workflows and task definitions available on the internet other than Netflix's barebones documenatation and the kitchen-sink workflow they provide, which is broken by default on the current API.

Also the configuration makes mention of so many fields that are pretty much undocumented, like you can swap out your persistence layer for something else but I would have no idea how that works.

TheColorYellow · 5 years ago

Suprised to see Camunda isn't mentioned here more.

Open-Source BPMN compliant workflow processing with a history of success. Goldman Sachs supposedly runs their internal org with it.

Slightly different target use case, but Camunda has really shined in microservices orchestration and I find implementing complex workflow and managing task dependencies much easier with it.

swyx · 5 years ago

do you have some recommended resources to learn more about BPMN? what's your take on BPMN vs other approaches? (JSON or cadence/temporal style "workflow as code")

_gfrc · 5 years ago

Very interesting. Looks a lot like zeebe [0], which uses BPMN for the workflow definition. This makes it easier to communicate the processes with the rest of the company. I never used it in production, just played around with it for a demo.

[0] https://zeebe.io/

theptip · 5 years ago

I've looked at Zeebe, and Camunda too - likewise, just in a demo capacity.

Interested in folks' experiences deploying these tools, as this sounds like a potentially very useful way of modeling business workflows that span multiple services.

MrSaints · 5 years ago

I've used Conductor, Zeebe, and Cadence all in close to production capacity. This is just my personal experience.

Conductor's JSON DSL was a bit of a nightmare to work with in my opinion. But otherwise, it did the job OK-ish. Felt more akin to Step Functions.

Arguably, Zeebe was the easiest to get started with once you get past the initial hurdle of BPMN. Their model of job processing is very simple, and because of that, it is very easy to write an SDK for it in any language. The biggest downside is that it is far from production ready, and there are ongoing complaints in their Slack about its lack of stability, and relatively poor performance. Zeebe does not require an external storage since workflows are transient, and replicated using their own RocksDB, and Raft set-up. You need to export, and index the workflows if you want to keep a history of it or even if you want to manage them. It is very eventually consistent.

With both Conductor, and Zeebe however, if you have a complex enough online workflow, it starts getting very difficult to model them in their respective DSLs. Especially if you have a dynamic workflow. And that complexity can translate to bugs at an orchestration level which you do not catch unless running the different scenarios.

Cadence (Temporal) handles this very well. You essentially write the workflow in the programming language itself, with appropriate wrappers / decorators, and helpers. There is no need to learn a new DSL per se. But, as a result, building an SDK for it in a specific programming language is a non-trivial exercise, and currently, the stable implementations are in Java, and Go. Performance, and reliability wise, it is great (relies on Cassandra, but there are SQL adapters, though, not mature yet).

We have somewhat settled on Temporal now having worked with the other two for quite some time. We also explored Lyft's Flyte, but it seemed more appropriate for data engineering, and offline processing.

As it is mentioned elsewhere here, we also use Argo, but I do not think it falls in the same space as these workflow engines I have mentioned (which can handle the orchestration of complex business logic a lot better rather than simple pipelines for like CI / CD or ETL).

Also worth mentioning is that we went with a workflow engine to reduce the boilerplate, and time / effort needed to write orchestration logic / glue code. You do this in lots of projects without knowing. We definitely feel like we have succeeded in that goal. And I feel this is an exciting space.

TheColorYellow · 5 years ago

I've worked with Camunda extensively, which Zeebe is based on.

I've found Camunda to be incredible. The APIs are implemented well and the work flow processing paradigm is easy to work with. Setting up the Camunda engine as a web server in a Spring project and integrating with external sources is great.

I've found there can be some performance issues when running a single engine, but clustering is easily enabled and you can adjust with a dedicated worker paradigm too.

Great piece of tech honestly. Haven't worked with Zeebe yet but am excited to.

shadykiller · 5 years ago

Can someone explain how and where to use a Workflow Orchestration Engine ?

juancampa · 5 years ago

These are useful for tasks that can last an arbitrarily long amount of time. Think about the process of signing up a user and waiting for her to click on a email verification link. This process can literally never end (the user never clicked) but more commonly it takes a few minutes.

It's easier to implement these things if you can write the code like:

   await sendVerificationEmail();
   await waitForUserToClick();   // This could take forever
   await sendWelcomeEmail();

If you do the above in a "normal" program, said program could stay in memory forever (consuming RAM, server can't restart, etc). The workflow engine will take care of storing intermediate state so you can indeed write the above code.

The other option is to implement a state machine via your database and some state column, but the code doesn't look as pretty as the above three lines.

Note that this particular tool seems to be more declarative than my example above (it uses JSON to do define the steps), so instead of using an `if` statement, you'd need to declare a "Decision".

Hope this helps!

vvladymyrov · 5 years ago

So is it like https://github.com/uber/cadence, AWS SWF, Amazon Step Functions?

acjohnson55 · 5 years ago

Yep. This is the dream, IMO. Serverless functions that can await for arbitrary spans of time. It's incredible to me how much effort we spend working around this limitation. Cadence/Temporal are pretty close to this ideal, but I need to look into what Conductor has going on.

bogomipz · 5 years ago

Interesting. So what holds the state in a workflow orchestrator? In your example what if user never clicks? Or if a server is rebooted? Is the state persisted to disk and then garbage collected after after a certain threshold?

stingraycharles · 5 years ago

I’ve seen them a lot, but two of the more canonical use cases are (complex) ETL and Continuous Delivery processes.

In both these cases, before these types of systems existed, it were useful a dozens of ad-hoc scripts that handled it. Monitoring, testing, instrumentation and visibility were usually after-thoughts and often not worth the trouble.

Workflow Orchestration Engines solve these problems. Each of these workflow engines have a slightly different angle, slightly better at different use cases, and as usual it can be quite important to select the right tool for the job.

But if you have found one, it can be a real boost to the quality of your processes.

ramon · 5 years ago

I prefer Power Automate / Logic Apps interface, it would be cool if there was a Power Automate open source imagine the number of plugins for that cloud tool that would come up? It's a valuable tool and part of the O365 ecosystem and could be even greater, more strategy and vision into that product would make O365 and Azure a leader in components and integrations this is the most valuable thing in the end of it all.

catmanjan · 5 years ago

There is, it's called Apache nifi

It's ripe for someone to make it saas

ramon · 5 years ago

Dude awesome tip, just saw it now thank you! Great project! Watching the videos now... this is ground breaking stuff and little marketing for now on it... impressive I'm impressed with Nifi.

zok3102 · 5 years ago

Checkout flogo.io - specifically Flogo Flow action https://github.com/project-flogo/flow for orchestration. You can use a Web UI, Golang SDK or a hand-edit the JSON DSL to build your orchestration flows. Deploy as binaries/containers/functions or on a cloud service like TIBCO Cloud Integration.

Disclaimer: I work at TIBCO

ramon · 5 years ago

Wow.. the product looks good and it's totally open I can choose between UI or coding.. I saw I can code in golang it would be cool be to be able to code in NodeJS as well thinking about AWS Lambda.. agregate different providers there like the Azure Functions and Google Functions.. all within your own Kubernetes. this is a very nice concept and seems like a great architecture. Loved the tip thanks!