Readit News logoReadit News
Posted by u/abelanger a year ago
Show HN: Hatchet – Open-source distributed task queuegithub.com/hatchet-dev/ha...
Hello HN, we're Gabe and Alexander from Hatchet (https://hatchet.run), we're working on an open-source, distributed task queue. It's an alternative to tools like Celery for Python and BullMQ for Node.js, primarily focused on reliability and observability. It uses Postgres for the underlying queue.

Why build another managed queue? We wanted to build something with the benefits of full transactional enqueueing - particularly for dependent, DAG-style execution - and felt strongly that Postgres solves for 99.9% of queueing use-cases better than most alternatives (Celery uses Redis or RabbitMQ as a broker, BullMQ uses Redis). Since the introduction of SKIP LOCKED and the milestones of recent PG releases (like active-active replication), it's becoming more feasible to horizontally scale Postgres across multiple regions and vertically scale to 10k TPS or more. Many queues (like BullMQ) are built on Redis and data loss can occur when suffering OOM if you're not careful, and using PG helps avoid an entire class of problems.

We also wanted something that was significantly easier to use and debug for application developers. A lot of times the burden of building task observability falls on the infra/platform team (for example, asking the infra team to build a Grafana view for their tasks based on exported prom metrics). We're building this type of observability directly into Hatchet.

What do we mean by "distributed"? You can run workers (the instances which run tasks) across multiple VMs, clusters and regions - they are remotely invoked via a long-lived gRPC connection with the Hatchet queue. We've attempted to optimize our latency to get our task start times down to 25-50ms and much more optimization is on the roadmap.

We also support a number of extra features that you'd expect, like retries, timeouts, cron schedules, dependent tasks. A few things we're currently working on - we use RabbitMQ (confusing, yes) for pub/sub between engine components and would prefer to just use Postgres, but didn't want to spend additional time on the exchange logic until we built a stable underlying queue. We are also considering the use of NATS for engine-engine and engine-worker connections.

We'd greatly appreciate any feedback you have and hope you get the chance to try out Hatchet.

kcorbitt · a year ago
I love your vision and am excited to see the execution! I've been looking for exactly this product (postgres-backed task queue with workers in multiple languages and decent built-in observability) for like... 3 years. Every 6 months I'll check in and see if someone has built it yet, evaluate the alternatives, and come away disappointed.

One important feature request that probably would block our adoption: one reason why I prefer a postgres-backed queue over eg. Redis is just to simplify our infra by having fewer servers and technologies in the stack. Adding in RabbitMQ is definitely an extra dependency I'd really like to avoid.

(Currently we've settled on graphile-worker which is fine for what it does, but leaves a lot of boxes unchecked.)

ako · a year ago
Funny how this is vision now. I started my career 29 years ago at a company that build exactly this, but based on oracle. The agents would run on Solaris, aix, vax vms, hpux, windows nt, iris, etc. Was also used to create an automated cicd pipeline to build all binaries on all these different systems.
sixdimensional · a year ago
Because people don’t know what they don’t know, and, learning from others (along with human knowledge sharing and transfer) doesn’t seem to be what society often prioritizes in general.

Not so much talking about the original post, I think it’s awesome what they are building, and clearly they have learned by observing other things.

throwawaymaths · a year ago
Also basically has existed as an open source (pro version has web dashboard and complex task zoo) drop-in library (no sidecar dependencies outside of postgres) in Elixir for years called Oban.
abelanger · a year ago
Thank you, appreciate the kind words! What boxes are you looking to check?

Yes, I'm not a fan of the RabbitMQ dependency either - see here for the reasoning: https://news.ycombinator.com/item?id=39643940.

It would take some work to replace this with listen/notify in Postgres, less work to replace this with an in-memory component, but we can't provide the same guarantees in that case.

jaggederest · a year ago
I come to this only as an interested observer, but my experience with listen/notify is that it outperforms rabbitmq/kafka in small to medium operations and has always pleasantly surprised me. You might find out it's a little easier than you think to slim your dependency stack down.
kcorbitt · a year ago
Boxes-wise, I'd like a management interface at least as good as the one Sidekiq had in Rails for years. Would also need some hard numbers around performance and probably a bit more battle-testing before using this in our current product.
bevekspldnw · a year ago
You can do a fair amount of this with Postgres using locks out of the box. It’s not super intuitive but I’ve been using just Postgres and locks in production for many years for large task distribution across independent nodes.
renegade-otter · a year ago
magic_hamster · a year ago
For what it's worth, RabbitMQ is extremely low maintenance, fire and forget. In the multiple years we've used it in production I can't remember a single time we had an issue with rabbit or that we needed to do anything after the initial set up.

Deleted Comment

BenjieGillam · a year ago
Not sure if you saw it but Graphile Worker supports jobs written in arbitrary languages so long as your OS can execute them: https://worker.graphile.org/docs/tasks#loading-executable-fi...

Would be interested to know what features you feel it’s lacking.

kcorbitt · a year ago
That's interesting! Would that still involve each worker node needing to have Nodejs installed to run the process that actually reads from the queue? That's doable, but makes the deployment story a little more annoying/complicated if I want a worker that just runs Python or Rust or something.

Feature-wise, the biggest missing pieces from Graphile Worker for me are (1) a robust management web ui and (2) really strong documentation.

simplyinfinity · a year ago
Hope im not misunderstanding, but have you checked gearman? While I haven't used it personally, ive used similar thing but in c#, namely hangfire.
rubenfiszel · a year ago
Windmill is is built exactly like that, what box is left unchecked for it if you had time to review it?
yencabulator · a year ago
Note that Hatchet is MIT license and Windmill is AGPL-3.. that's enough of a reason for many.
doctorpangloss · a year ago
Why does the RabbitMQ dependency matter?

It was pretty painless for me to set up and write tests against. The operator works well and is really simple if you want to save money.

I mean, isn’t Hatchett another dependency? Graphile Worker? I like all these things, but why draw the line at one thing over another over essentially aesthetics?

You better start believing in dependencies if you’re a programmer.

eska · a year ago
Introducing another piece of software instead of using one you already use anyway introduces new failures. That’s hardly aesthetics.

As a professional I’m allergic to statements like “you better start believing in X”. How can you even have objective discourse at work like that?

blandflakes · a year ago
And you better start critically assessing dependencies if you're a programmer. They aren't free; this is a wild take.
otabdeveloper4 · a year ago
> You better start believing in dependencies if you’re a programmer.

Yeah, faith will be your last resort when the resulting tower of babel fails in hitherto unknown to man modes.

jerrygenser · a year ago
Something I really like about some pub/sub systems is Push subscriptions. For example in GCP pub/sub you can have a "subscriber" that is not pulling events off the queue but instead is an http endpoint where events are pushed to.

The nice thing about this is that you can use a runtime like cloud run or lambda and allow that runtime to scale based on http requests and also scale to zero.

Setting up autoscaling for workers can be a little bit more finicky, e.g. in kubernetes you might set up KEDA autoscaling based on some queue depth metrics but these might need to be exported from rabbit.

I suppose you could have a setup where your daemon worker is making http requests and in that sense "push" to the place where jobs are actually running but this adds another level of complexity.

Is there any plan to support a push model where you can push jobs into http and some daemons that are holding the http connections opened?

abelanger · a year ago
I like that idea, basically the first HTTP request ensures the worker gets spun up on a lambda, and the task gets picked up on the next poll when the worker is running. We already have the underlying push model for our streaming feature: https://docs.hatchet.run/home/features/streaming. Can configure this to post to an HTTP endpoint pretty easily.

The daemon feels fragile to me, why not just shut down the worker client-side after some period of inactivity?

jerrygenser · a year ago
I think it depends on the http runtime. One of the things with cloud run is that if the server is not handling requests, it doesn't get CPU time. So even if the first request is "wake up", it wouldn't get any CPU to poll outside of the request-response cycle.

You can configure cloud run to always allocate CPU but it's a lot more expensive. I don't think it would be a good autoscaling story since autoscaling is based on http requests being processed. (maybe can be done via CPU but that's may not be what you want, it may not even be cpu bound)

jsmeaton · a year ago
https://cloud.google.com/tasks is such a good model and I really want an open source version of it (or to finally bite the bullet and write my own).

Having http targets means you get things like rate limiting, middleware, and observability that your regular application uses, and you aren’t tied to whatever backend the task system supports.

Set up a separate scaling group and away you go.

jamescmartinez · a year ago
Mergent (YC S21 - https://mergent.co) might be precisely what you're looking for in terms of a push-over-HTTP model for background jobs and crons.

You simply define a task using our API and we take care of pushing it to any HTTP endpoint, holding the connection open and using the HTTP status code to determine success/failure, whether or not we should retry, etc.

Happy to answer any questions here or over email james@mergent.co

tonyhb · a year ago
You might want to look at https://www.inngest.com for that. Disclaimer: I'm a cofounder. We released event-driven step functions about 20 months ago.
jerrygenser · a year ago
Looks cool but looks like it's only typescript. If there is a json payload, couldn't any web server handle it?
yencabulator · a year ago
> For example in GCP pub/sub you can have a "subscriber" that is not pulling events off the queue but instead is an http endpoint where events are pushed to.

That just means that there's a lightweight worker that does the HTTP POST to your "subscriber". With retries etc, just like it's done here.

sixdimensional · a year ago
There are some tools like Apache Nifi which call this pattern an HTTP listener. it’s also basically a kind of a sink, and also sort of resembles webhook architecture.
lysecret · a year ago
Yep we are using cloud tasks and pub sub a lot. Another big benefit is that the GCP infra is literally “pushing” your messages even if your infra goes down.
alexbouchard · a year ago
The push queue model has major benefits has you mentioned. We've built Hookdeck (hookdeck.com) on that premise. I hope we see more projects adopt it.
leetrout · a year ago
Just pointing out even though this is a "Show HN" they are, indeed, backed by YC.

Is this going to follow the "open core" pattern or will there be a different path to revenue?

abelanger · a year ago
Yep, we're backed by YC in the W24 batch - this is evident on our landing page [1].

We're both second time CTOs and we've been on both sides of this, as consumers of and creators of OSS. I was previously a co-founder and CTO of Porter [2], which had an open-core model. There are two risks that most companies think about in the open core model:

1. Big companies using your platform without contributing back in some way or buying a license. I think this is less of a risk, because these organizations are incentivized to buy a support license to help with maintenance, upgrades, and since we sit on a critical path, with uptime.

2. Hyperscalers folding your product in to their offering [3]. This is a bigger risk but is also a bit of a "champagne problem".

Note that smaller companies/individual developers are who we'd like to enable, not crowd out. If people would like to use our cloud offering because it reduces the headache for them, they should do so. If they just want to run our service and manage their own PostgreSQL, they should have the option to do that too.

Based on all of this, here's where we land on things:

1. Everything we've built so far has been 100% MIT licensed. We'd like to keep it that way and make money off of Hatchet Cloud. We'll likely roll out a separate enterprise support agreement for self hosting.

2. Our cloud version isn't going to run a different core engine or API server than our open source version. We'll write interfaces for all plugins to our servers and engines, so even if we have something super specific to how we've chosen to do things on the cloud version, we'll expose the options to write your own plugins on the engine and server.

3. We'd like to make self-hosting as easy to use as our cloud version. We don't want our self-hosted offering to be a second-class citizen.

Would love to hear everyone's thoughts on this.

[1] https://hatchet.run

[2] https://github.com/porter-dev/porter

[3] https://www.elastic.co/blog/why-license-change-aws

echelon · a year ago
I got flagged, but I want to reiterate that you need legal means of stopping AWS from simply lifting your product wholesale. Just look at all the other companies they've turned into their own thankless premium offerings.

Put in a DAU/MAU/volume/revenue clause that pertains specifically only to hyperscalers and resellers. Don't listen to the naysayers telling you not to do it. This isn't their company or their future. They don't care if you lose your business or that you put in all of that work just for a tech giant to absorb it for free and turn it against you.

Just do it. Do it now and you won't get (astroturfed?) flack for that decision later by people who don't even have skin in the game. It's not a big deal. I would buy open core products with these protections -- it's not me you're protecting yourselves against, and I'm nowhere in the blast radius. You're trying not to die in the miasma of monolithic cloud vendors.

Dead Comment

MuffinFlavored · a year ago
> path to revenue

There have to be at least 10 different ways between different cloud providers to run a distributed task queue. Amazon, Azure, GCP

Self-hosting RabbitMQ, etc.

I'm curious how they are able to convince investors that there is a sizable portion of market they think doesn't already have this solved (or already has it solved and is willing to migrate)

Kinrany · a year ago
There will be space for improvement until every cloud has a managed offering with exactly the same interface. Like docker, postgres, S3.
leetrout · a year ago
I am curious to see where they differentiate themselves on observability on the longer run.

Comparing to rabbitmq it should be easier to see what is in the queue itself without mutating it, for instance.

Aeolun · a year ago
> I'm curious how they are able to convince investors that there is a sizable portion of market they think doesn't already have this solved

Is there any task queue you are completely happy with?

I use Redis, but it’s only half of the solution.

wodenokoto · a year ago
Wasn’t the first Dropbox introduction also a show HN?

I don’t think this is out of place

leetrout · a year ago
I am not saying it is out of place but I feel for such a long winded explanation of what they are doing a missing "YC W24" was surprising.
bluehadoop · a year ago
How does this compare against Temporal/Cadence/Conductor? Does hatchet also support durable execution?

https://temporal.io/https://cadenceworkflow.io/https://conductor-oss.org/

abelanger · a year ago
It's very similar - I used Temporal at a previous company to run a couple million workflows per month. The gRPC networking with workers is the most similar component, I especially liked that I only had to worry about an http2 connection with mTLS instead of a different broker protocol.

Temporal is a powerful system but we were getting to the point where it took a full-time engineer to build an observability layer around Temporal. Integrating workflows in an intuitive way with OpenTelemetry and logging was surprisingly non-arbitrary. We wanted to build more of a Vercel-like experience for managing workflows.

We have a section on the docs page for durable execution [1], also see the comment on HN [2]. Like I mention in that comment, we still have a long way to go before users can write a full workflow in code in the same style as a Temporal workflow, users either define the execution path ahead of time or invoke a child workflow from an existing workflow. This is also something that requires customization for each SDK - like Temporal's custom asyncio event loop in their Python SDK [3]. We don't want to roll this out until we can be sure about compatibility with the way most people write their functions.

[1] https://docs.hatchet.run/home/features/durable-execution

[2] https://news.ycombinator.com/item?id=39643881

[3] https://github.com/temporalio/sdk-python

bicijay · a year ago
Well, you just got an user. Love the concept of temporal, but i can't justify the overhead you need with infra to make it work for the upper guys... And the cloud offering is a bit expensive for small companies.
dangoodmanUT · a year ago
> we were getting to the point where it took a full-time engineer to build an observability layer around Temporal

We did it in like 5 minutes by adding in otel traces? And maybe another 15 to add their grafana dashboard?

What obstacles did you experience here?

Kinrany · a year ago
With NATS in the stack, what's the advantage over using NATS directly?
abelanger · a year ago
I'm assuming specifically you mean Nex functions? Otherwise NATS gives you connectivity and a message queue - it doesn't (or didn't) have the concept of task executions or workflows.

With regards to Nex -- it isn't fully stable and only supports Javascript/Webassembly. It's also extremely new, so I'd be curious to see how things stabilize in the coming year.

bruth · a year ago
(Disclaimer: I am a NATS maintainer and work for Synadia)

The parent comment may have been referring to the fact that NATS has support for durable (and replicated) work queue streams, so those could be used directly for queuing tasks and having a set of workers dequeuing concurrently. And this is regardless if you would want to use Nex or not. Nex is indeed fairly new, but the team on is iterating on it quickly and we are dog-fooding it internally to keep stabilizing it.

The other benefits of NATS is the built-in multi-tenancy which would allow for distinct applications/teams/contexts to have an isolated set of streams and messaging. It acts as a secure namespace.

NATS supports clustering within a region or across regions. For example, Synadia hosts a supercluster in many different regions across the globe and across the three major cloud providers. As it applies to distributed work queues, you can place work queue streams in a cluster within a region/provider closest to the users/apps enqueuing the work, and then deploy workers in the same region for optimizing latency of dequeuing and processing.

Could be worth a deeper look on how much you could leverage for this use case.

Kinrany · a year ago
I wasn't thinking of Nex, I didn't realize Hatchet includes compute and doesn't just store tasks.

Still, it seems like NATS + any lambda implementation + a dumb service that wakes lambdas when they need to process something, would be simple to set up and in combination do the same thing.

rapnie · a year ago
I recently found Nex in the context of Wasmcloud [0] and ability for it to support long-running tasks/workflows. Impression that indeed Nex needs a good time to mature still. There was also a talk [1] about using Temporal here. For Hatchet it may be interesting to check it out (note: I am not affiliated with Wasmcloud, nor currently using it).

[0] https://wasmcloud.com

[1] https://www.temporal.io/replay/videos/zero-downtime-deploys-...

toddmorey · a year ago
I need task queues where the client (web browser) can listen to the progress of the task through completion.

I love the simplicity & approachability of Deno queues for example, but I’d need to roll my own way to subscribe to task status from the client.

Wondering if perhaps the Postgres underpinnings here would make that possible.

EDIT: seems so! https://docs.hatchet.run/home/features/streaming

abelanger · a year ago
Yep, exactly - Gabe has also been thinking about providing per-user signed URLs to task executions so clients can subscribe more easily without a long-lived token. So basically, you would start the workflow from your API, and pass back the signed URL to the client, where we would then provide a React hook to get task updates automatically. We need this ourselves once we open our cloud instance up to self-serve, since we want to provision separate queues per user, with a Hatchet workflow of course.
toddmorey · a year ago
Awesome to hear!
rad_gruchalski · a year ago
If you need to listen for the progress only, try server-sent events, maybe?: https://en.wikipedia.org/wiki/Server-sent_events

It's dead simple: an existence of the URI means the topic/channel/whathaveu exists, to access it one needs to know the URI, data streamed but no access to old data, multiple consumers no problem.

sroussey · a year ago
Ah nice! I am writing a job queue this weekend for a DAG based task runner, so timing is great. I will have a look. I don't need anything too big, but I have written some stuff for using PostgreSQL (FOR UPDATE SKIP LOCKED for the win), sqlite, and in-memory, depending on what I want to use it for.

I want the task graph to run without thinking about retries, timeouts, serialized resources, etc.

Interested to look at your particular approach.

SCUSKU · a year ago
Looks pretty great! My biggest issue with Celery has been that the observability is pretty bad. Even if you use Celery Flower, it still just doesn’t give me enough insight when I’m trying to debug some problem in production.

I’m all for just using Postgres in service of the grug brain philosophy.

Will definitely be looking into this, congrats on the launch!

abelanger · a year ago
Appreciate it, thank you! We've spent quite a bit of time in the Celery Flower console. Admittedly it's been a while, I'm not sure if they've added views for chains/groups/etc - it was just a linear task view when I used it.

A nice thing in Celery Flower is viewing the `args, kwargs`, whereas Hatchet operates on JSON request/response bodies, so some early users have mentioned that it's hard to get visibility into the exact typing/serialization that's happening. Something for us to work on.

9dev · a year ago
I case you’re stuck with Celery for a while: I was hit with this same problem, and solved it by adding a sidecar HTTP server thread to the Python workers that would expose metrics written by the workers into a multithreaded registry. This has been working amazingly well in production for over two years now, and makes it really straightforward to get custom metrics out of a distributed Celery app.
kamikaz1k · a year ago
Any chance you could share more specifics about your solution?