Show HN: Hatchet v1 – A task orchestration platform built on Postgres

Show HN: Hatchet v1 – A task orchestration platform built on Postgres github.com/hatchet-dev/ha...

Hey HN - this is Alexander from Hatchet. We’re building an open-source platform for managing background tasks, using Postgres as the underlying database.

Just over a year ago, we launched Hatchet as a distributed task queue built on top of Postgres with a 100% MIT license (https://news.ycombinator.com/item?id=39643136). The feedback and response we got from the HN community was overwhelming. In the first month after launching, we processed about 20k tasks on the platform — today, we’re processing over 20k tasks per minute (>1 billion per month).

Scaling up this quickly was difficult — every task in Hatchet corresponds to at minimum 5 Postgres transactions and we would see bursts on Hatchet Cloud instances to over 5k tasks/second, which corresponds to roughly 25k transactions/second. As it turns out, a simple Postgres queue utilizing FOR UPDATE SKIP LOCKED doesn’t cut it at this scale. After provisioning the largest instance type that CloudSQL offers, we even discussed potentially moving some load off of Postgres in favor of something trendy like Clickhouse + Kafka.

But we doubled down on Postgres, and spent about 6 months learning how to operate Postgres databases at scale and reading the Postgres manual and several other resources [0] during commutes and at night. We stuck with Postgres for two reasons:

1. We wanted to make Hatchet as portable and easy to administer as possible, and felt that implementing our own storage engine specifically on Hatchet Cloud would be disingenuous at best, and in the worst case, would take our focus away from the open source community.

2. More importantly, Postgres is general-purpose, which is what makes it both great but hard to scale for some types of workloads. This is also what allows us to offer a general-purpose orchestration platform — we heavily utilize Postgres features like transactions, SKIP LOCKED, recursive queries, triggers, COPY FROM, and much more.

Which brings us to today. We’re announcing a full rewrite of the Hatchet engine — still built on Postgres — together with our task orchestration layer which is built on top of our underlying queue. To be more specific, we’re launching:

1. DAG-based workflows that support a much wider array of conditions, including sleep conditions, event-based triggering, and conditional execution based on parent output data [1].

2. Durable execution — durable execution refers to a function’s ability to recover from failure by caching intermediate results and automatically replaying them on a retry. We call a function with this ability a durable task. We also support durable sleep and durable events, which you can read more about here [2]

3. Queue features such as key-based concurrency queues (for implementing fair queueing), rate limiting, sticky assignment, and worker affinity.

4. Improved performance across every dimension we’ve tested, which we attribute to six improvements to the Hatchet architecture: range-based partitioning of time series tables, hash-based partitioning of task events (for updating task statuses), separating our monitoring tables from our queue, buffered reads and writes, switching all high-volume tables to use identity columns, and aggressive use of Postgres triggers.

We've also removed RabbitMQ as a required dependency for self-hosting.

We'd greatly appreciate any feedback you have and hope you get the chance to try out Hatchet.

[0] https://www.postgresql.org/docs/

[1] https://docs.hatchet.run/home/conditional-workflows

[2] https://docs.hatchet.run/home/durable-execution

How does this compare to other pg-backed python job runners like Procrastinate [0] or Chancy [1]?

[0] https://github.com/procrastinate-org/procrastinate/

[1] https://github.com/TkTech/chancy

gabrielruttner · 5 months ago

Gabe here, one of the hatchet founders. I'm not very familiar with these runner so someone please correct me if I missed something.

These look like great projects to get something running quickly, but likely will experience many of the challenges Alexander mentioned under load. They look quite similar to our initial implementation using FOR UPDATE and maintaining direct connections from workers to PostgreSQL instead of a central orchestrator (a separate issue that deserves its own post).

One of the reasons for this decision to performantly support more complex scheduling requirements and durable execution patterns -- things like dynamic concurrency [0] or rate limits [1] which can be quite tricky to implement on a worker-pull model where there will likely be contention on these orchestration tables.

They also appear to be pure queues to run individual tasks in python only. We've been working hard on our py, ts, and go sdks

I'm excited to see how these projects approach these problems over time!

[0] https://docs.hatchet.run/home/concurrency [1] https://docs.hatchet.run/home/rate-limits

TkTech · 5 months ago

Chancy dev here.

I've intentionally chosen simple over performance when the choice is there. Chancy still happily handles millions of jobs and workflows a day with dynamic concurrency and global rate limits, even in low-resource environments. But it would never scale horizontally to the same level you could achieve with RabbitMQ, and it's not meant for massive multi-tenant cloud hosting. It's just not the project's goal.

Chancy's aim is to be the low dependency, low infrastructure option that's "good enough" for the vast majority of projects. It has 1 required package dependency (the postgres driver) and 1 required infrastructure dependency (postgres) while bundling everything inside a single ASGI-embeddable process (no need for separate processes like flower or beat). It's used in many of my self-hosted projects, and in a couple of commercial projects to add ETL workflows, rate limiting, and observability to projects that were previously on Celery. Going from Celery to Chancy is typically just replacing your `delay()/apply_async()` with `push()` and swapping `@shared_task()` with `@job()`.

If you have hundreds of employees and need to run hundreds of millions of jobs a day, it's never going to be the right choice - go with something like Hatchet. Chancy's for teams of one to dozens that need a simple option while still getting things like global rate limits and workflows.

wcrossbow · 5 months ago

Another good one is pgqueuer https://github.com/janbjorge/pgqueuer

INTPenis · 5 months ago

Celery also has postgres backend, but I maybe it's not as well integrated.

igor47 · 5 months ago

It's just a results backend, you still have to run rabbitmq or redis as a broker

followben · 5 months ago

diarrhea · 5 months ago

This is very exciting stuff.

I’m curious: When you say FOR UPDATE SKIP LOCKED does not scale to 25k queries/s, did you observe a threshold at which it became untenable for you?

I’m also curious about the two points of:

- buffered reads and writes

- switching all high-volume tables to use identity columns

What do you mean by these? Were those (part of) the solution to scale FOR UPDATE SKIP LOCKED up to your needs?

abelanger · 5 months ago

I'm not sure of the exact threshold, but the pathological case seemed to be (1) many tasks in the backlog, (2) many workers, (3) workers long-polling the task tables at approximately the same time. This would consistently lead to very high spikes in CPU and result in a runaway deterioration on the database, since high CPU leads to slower queries and more contention, which leads to higher connection overhead, which leads to higher CPU, and so on. There are a few threads online which documented very similar behavior, for example: https://postgrespro.com/list/thread-id/2505440.

Those other points are mostly unrelated to the core queue, and more related to helper tables for monitoring, tracking task statuses, etc. But it was important to optimize these tables because unrelated spikes on other tables in the database could start getting us into a deteriorated state as well.

To be more specific about the solutions here:

> buffered reads and writes

To run a task through the system, we need to write the task itself, write the instance of that retry of the count to the queue, write an event that the task has been queued, started, completed | failed, etc. Generally one task will correspond to many writes along the way, not all of which need to be extremely latency sensitive. So we started buffering items coming from our internal queues and flushing them once every 10ms, which helped considerably.

> switching all high-volume tables to use identity columns

We originally had combined some of our workflow tables with our monitoring tables -- this table was called `WorkflowRun` and it was used for both concurrency queues and queried when serving the API. This table used a UUID as the primary key, because we wanted UUIDs over the API instead of auto-incrementing IDs. The UUIDs caused some headaches down the line when trying to delete batches of data and prevent index bloat.

chaz6 · 5 months ago

Out of interest, did you try changing the value of commit_delay? This parameter allows multiple transactions to be written together under heavy load.

Thank you! Very insightful, especially the forum link and the observation around UUIDs bloating indexes.

morsecodist · 5 months ago

This is great timing. I am in the process of designing an event/workflow driven application and nothing I looked at felt quite right for my use case. This feels really promising. Temporal was close but it just felt like not the perfect fit. I like the open source license a lot it gives me more confidence designing an application around it. The conditionals are also great. I have been looking for something just like CEL and despite my research I had never heard of it. It is exactly how I want my expressions implemented, I was on the verge of trying to build something like this myself.

stephen · 5 months ago

Do queue operations (enqueue a job & mark this job as complete) happen in the same transaction as my business logic?

Imo that's the killer feature of database-based queues, because it dramatically simplifies reasoning about retries, i.e. "did my endpoint logic commit _and_ my background operation enqueue both atomically commit, or atomically fail"?

Same thing for performing jobs, if my worker's business logic commits, but the job later retries (b/c marking the job as committed is a separate transaction), then oof, that's annoying.

And I might as well be using SQS at that point.

williamdclt · 5 months ago

My understanding is that hatchet isn’t just a queue, it’s a workflow orchestrator: you can use it as a queue but it’s kind of like using a computer as a calculator: it works but indeed it’d likely be simpler to use a calculator.

On your point of using transactions for idempotency: you’re right that it’s a great advantage of a db-based queue, but I’d be wary about taking it as a holy grail for a few reasons:

- it locks you into using a db-based queue. If for any reason you don’t want to anymore (eg you’re reaching scalability issues) it’ll be very difficult to switch to another queue system as you’re relying on transactions for idempotency.

- you only get transactional idempotency for db operations. Any other side effect won’t be automatically idempotent: external API calls, sending messages to other queues, writing files…

- if you decide to move some of your domain to another service, you lose transactional idempotency (it’s now two databases)

- relying on transactionality means you’re not resilient to having duplicate tasks in the queue (duplicate publishing). That can easily happen: bug of the publisher, two users triggering an action concurrently… it’s quite often a very normal thing to trigger the same action multiple times

So I’d avoid having my tasks rely on transactionality for idempotency, your system is much more resilient if you don’t

lyu07282 · 5 months ago

Just no, your tasks should be idempotent. Distributed transactions are stupid.

They’re not talking about distributed transactions: it’s not about a task being published and consumed atomically, it’s about it being consumed and executed atomically.

nik736 · 5 months ago

The readme assumes users with darkmode outweigh users without (the logo is white, invisible without darkmode). Would be interesting to see stats from Github for this!

lysecret · 5 months ago

This is awesome and I will take a closer look! One question: We ran into issue with using Postgres as a message queue with messages that need to be toasted/have large payloads (50mb+).

Only fix we could find was using unlogged tables and a full vacuum on a schedule. We aren’t big Postgres experts but since you are I was wondering if you have fixed this issue/this framework works well for large payloads.

Don't put them in the queue. Put the large payload into an object store like s3/gcs and put a reference into the db or queue

szvsw · 5 months ago

Yep - this is also the official recommended method by Hatchet, also sometimes called payload thinning.

fabcairo · 5 months ago

This looks super promising, really like the deep PostgreSQL integration and the effort toward durable execution.

One aspect I’d be curious to hear more about (and might be worth expanding on in docs or future posts) is how hatchet holds up operationally in production. For example, what does a typical alerting setup look like for common failure modes? And since the system relies on partitioned tables and tuned schemas, how do you approach migrations or schema changes without downtime?

A lot of open-source job orchestration systems shine at the core execution model but fall short when it comes to observability and smooth day-2 operations. If Hatchet nails that too, it’s a huge win.

sgarland · 5 months ago

> Improved performance across every dimension we’ve tested, which we attribute to six improvements to the Hatchet architecture: range-based partitioning of time series tables, hash-based partitioning of task events (for updating task statuses), separating our monitoring tables from our queue, buffered reads and writes, switching all high-volume tables to use identity columns, and aggressive use of Postgres triggers.

Amazing what you can do when you read the manual, eh?

Seriously though, that’s awesome, and I’m very happy to see someone leaning hard into RDBMS features like triggers instead of shying away.