Why build another managed queue? We wanted to build something with the benefits of full transactional enqueueing - particularly for dependent, DAG-style execution - and felt strongly that Postgres solves for 99.9% of queueing use-cases better than most alternatives (Celery uses Redis or RabbitMQ as a broker, BullMQ uses Redis). Since the introduction of SKIP LOCKED and the milestones of recent PG releases (like active-active replication), it's becoming more feasible to horizontally scale Postgres across multiple regions and vertically scale to 10k TPS or more. Many queues (like BullMQ) are built on Redis and data loss can occur when suffering OOM if you're not careful, and using PG helps avoid an entire class of problems.
We also wanted something that was significantly easier to use and debug for application developers. A lot of times the burden of building task observability falls on the infra/platform team (for example, asking the infra team to build a Grafana view for their tasks based on exported prom metrics). We're building this type of observability directly into Hatchet.
What do we mean by "distributed"? You can run workers (the instances which run tasks) across multiple VMs, clusters and regions - they are remotely invoked via a long-lived gRPC connection with the Hatchet queue. We've attempted to optimize our latency to get our task start times down to 25-50ms and much more optimization is on the roadmap.
We also support a number of extra features that you'd expect, like retries, timeouts, cron schedules, dependent tasks. A few things we're currently working on - we use RabbitMQ (confusing, yes) for pub/sub between engine components and would prefer to just use Postgres, but didn't want to spend additional time on the exchange logic until we built a stable underlying queue. We are also considering the use of NATS for engine-engine and engine-worker connections.
We'd greatly appreciate any feedback you have and hope you get the chance to try out Hatchet.
One important feature request that probably would block our adoption: one reason why I prefer a postgres-backed queue over eg. Redis is just to simplify our infra by having fewer servers and technologies in the stack. Adding in RabbitMQ is definitely an extra dependency I'd really like to avoid.
(Currently we've settled on graphile-worker which is fine for what it does, but leaves a lot of boxes unchecked.)
Not so much talking about the original post, I think it’s awesome what they are building, and clearly they have learned by observing other things.
Yes, I'm not a fan of the RabbitMQ dependency either - see here for the reasoning: https://news.ycombinator.com/item?id=39643940.
It would take some work to replace this with listen/notify in Postgres, less work to replace this with an in-memory component, but we can't provide the same guarantees in that case.
https://renegadeotter.com/2023/11/30/job-queues-with-postrgr...
Deleted Comment
Would be interested to know what features you feel it’s lacking.
Feature-wise, the biggest missing pieces from Graphile Worker for me are (1) a robust management web ui and (2) really strong documentation.
It was pretty painless for me to set up and write tests against. The operator works well and is really simple if you want to save money.
I mean, isn’t Hatchett another dependency? Graphile Worker? I like all these things, but why draw the line at one thing over another over essentially aesthetics?
You better start believing in dependencies if you’re a programmer.
As a professional I’m allergic to statements like “you better start believing in X”. How can you even have objective discourse at work like that?
Yeah, faith will be your last resort when the resulting tower of babel fails in hitherto unknown to man modes.
The nice thing about this is that you can use a runtime like cloud run or lambda and allow that runtime to scale based on http requests and also scale to zero.
Setting up autoscaling for workers can be a little bit more finicky, e.g. in kubernetes you might set up KEDA autoscaling based on some queue depth metrics but these might need to be exported from rabbit.
I suppose you could have a setup where your daemon worker is making http requests and in that sense "push" to the place where jobs are actually running but this adds another level of complexity.
Is there any plan to support a push model where you can push jobs into http and some daemons that are holding the http connections opened?
The daemon feels fragile to me, why not just shut down the worker client-side after some period of inactivity?
You can configure cloud run to always allocate CPU but it's a lot more expensive. I don't think it would be a good autoscaling story since autoscaling is based on http requests being processed. (maybe can be done via CPU but that's may not be what you want, it may not even be cpu bound)
Having http targets means you get things like rate limiting, middleware, and observability that your regular application uses, and you aren’t tied to whatever backend the task system supports.
Set up a separate scaling group and away you go.
You simply define a task using our API and we take care of pushing it to any HTTP endpoint, holding the connection open and using the HTTP status code to determine success/failure, whether or not we should retry, etc.
Happy to answer any questions here or over email james@mergent.co
That just means that there's a lightweight worker that does the HTTP POST to your "subscriber". With retries etc, just like it's done here.
Is this going to follow the "open core" pattern or will there be a different path to revenue?
We're both second time CTOs and we've been on both sides of this, as consumers of and creators of OSS. I was previously a co-founder and CTO of Porter [2], which had an open-core model. There are two risks that most companies think about in the open core model:
1. Big companies using your platform without contributing back in some way or buying a license. I think this is less of a risk, because these organizations are incentivized to buy a support license to help with maintenance, upgrades, and since we sit on a critical path, with uptime.
2. Hyperscalers folding your product in to their offering [3]. This is a bigger risk but is also a bit of a "champagne problem".
Note that smaller companies/individual developers are who we'd like to enable, not crowd out. If people would like to use our cloud offering because it reduces the headache for them, they should do so. If they just want to run our service and manage their own PostgreSQL, they should have the option to do that too.
Based on all of this, here's where we land on things:
1. Everything we've built so far has been 100% MIT licensed. We'd like to keep it that way and make money off of Hatchet Cloud. We'll likely roll out a separate enterprise support agreement for self hosting.
2. Our cloud version isn't going to run a different core engine or API server than our open source version. We'll write interfaces for all plugins to our servers and engines, so even if we have something super specific to how we've chosen to do things on the cloud version, we'll expose the options to write your own plugins on the engine and server.
3. We'd like to make self-hosting as easy to use as our cloud version. We don't want our self-hosted offering to be a second-class citizen.
Would love to hear everyone's thoughts on this.
[1] https://hatchet.run
[2] https://github.com/porter-dev/porter
[3] https://www.elastic.co/blog/why-license-change-aws
Put in a DAU/MAU/volume/revenue clause that pertains specifically only to hyperscalers and resellers. Don't listen to the naysayers telling you not to do it. This isn't their company or their future. They don't care if you lose your business or that you put in all of that work just for a tech giant to absorb it for free and turn it against you.
Just do it. Do it now and you won't get (astroturfed?) flack for that decision later by people who don't even have skin in the game. It's not a big deal. I would buy open core products with these protections -- it's not me you're protecting yourselves against, and I'm nowhere in the blast radius. You're trying not to die in the miasma of monolithic cloud vendors.
Dead Comment
There have to be at least 10 different ways between different cloud providers to run a distributed task queue. Amazon, Azure, GCP
Self-hosting RabbitMQ, etc.
I'm curious how they are able to convince investors that there is a sizable portion of market they think doesn't already have this solved (or already has it solved and is willing to migrate)
Comparing to rabbitmq it should be easier to see what is in the queue itself without mutating it, for instance.
Is there any task queue you are completely happy with?
I use Redis, but it’s only half of the solution.
I don’t think this is out of place
https://temporal.io/https://cadenceworkflow.io/https://conductor-oss.org/
Temporal is a powerful system but we were getting to the point where it took a full-time engineer to build an observability layer around Temporal. Integrating workflows in an intuitive way with OpenTelemetry and logging was surprisingly non-arbitrary. We wanted to build more of a Vercel-like experience for managing workflows.
We have a section on the docs page for durable execution [1], also see the comment on HN [2]. Like I mention in that comment, we still have a long way to go before users can write a full workflow in code in the same style as a Temporal workflow, users either define the execution path ahead of time or invoke a child workflow from an existing workflow. This is also something that requires customization for each SDK - like Temporal's custom asyncio event loop in their Python SDK [3]. We don't want to roll this out until we can be sure about compatibility with the way most people write their functions.
[1] https://docs.hatchet.run/home/features/durable-execution
[2] https://news.ycombinator.com/item?id=39643881
[3] https://github.com/temporalio/sdk-python
We did it in like 5 minutes by adding in otel traces? And maybe another 15 to add their grafana dashboard?
What obstacles did you experience here?
With regards to Nex -- it isn't fully stable and only supports Javascript/Webassembly. It's also extremely new, so I'd be curious to see how things stabilize in the coming year.
The parent comment may have been referring to the fact that NATS has support for durable (and replicated) work queue streams, so those could be used directly for queuing tasks and having a set of workers dequeuing concurrently. And this is regardless if you would want to use Nex or not. Nex is indeed fairly new, but the team on is iterating on it quickly and we are dog-fooding it internally to keep stabilizing it.
The other benefits of NATS is the built-in multi-tenancy which would allow for distinct applications/teams/contexts to have an isolated set of streams and messaging. It acts as a secure namespace.
NATS supports clustering within a region or across regions. For example, Synadia hosts a supercluster in many different regions across the globe and across the three major cloud providers. As it applies to distributed work queues, you can place work queue streams in a cluster within a region/provider closest to the users/apps enqueuing the work, and then deploy workers in the same region for optimizing latency of dequeuing and processing.
Could be worth a deeper look on how much you could leverage for this use case.
Still, it seems like NATS + any lambda implementation + a dumb service that wakes lambdas when they need to process something, would be simple to set up and in combination do the same thing.
[0] https://wasmcloud.com
[1] https://www.temporal.io/replay/videos/zero-downtime-deploys-...
I love the simplicity & approachability of Deno queues for example, but I’d need to roll my own way to subscribe to task status from the client.
Wondering if perhaps the Postgres underpinnings here would make that possible.
EDIT: seems so! https://docs.hatchet.run/home/features/streaming
It's dead simple: an existence of the URI means the topic/channel/whathaveu exists, to access it one needs to know the URI, data streamed but no access to old data, multiple consumers no problem.
I want the task graph to run without thinking about retries, timeouts, serialized resources, etc.
Interested to look at your particular approach.
I’m all for just using Postgres in service of the grug brain philosophy.
Will definitely be looking into this, congrats on the launch!
A nice thing in Celery Flower is viewing the `args, kwargs`, whereas Hatchet operates on JSON request/response bodies, so some early users have mentioned that it's hard to get visibility into the exact typing/serialization that's happening. Something for us to work on.