I strongly believe that being obvious about steps with `step.run` is important: it improves o11y, makes things explicit, and you can see transactional boundaries.
`Wait()` also does that. And the examples in the package documentation don't show the context as a way for the caller to be notified that things are done (that's what `Wait()` is for) but as a way for the callees (the callbacks passed to Do) to early abort.
This is mostly confirmed by the discussion dantillberg linked above, where someone suggests passing the errgroup's context down to the callbacks as parameter and the package author replies they don't do that because the lack of inference makes for nasty boilerplate (https://github.com/golang/go/issues/34510#issuecomment-53961...).
I just figured that the exactly once semantics were so worth discussing that any external side effects (which is what orchestration is for) aren't included in that, which is a big caveat.
Want to send an email, but the app crashes before committing? Now you're at-least-once.
You can compress the window that causes at-least-once semantics, but it's always there. For this reason, this blog post oversells the capabilities of these types of systems as a whole. DBOS (and Inngest, see the disclaimer below) try to get as close to exactly once as possible, but the risk always exists, which is why you should always try to use idempotency in external API requests if they support it. Defense in layers.
Disclaimer: I built the original `step.run` APIs at https://www.inngest.com, which offers similar things on any platform... without being tied to DB transactions.
First, let’s set aside the separate question of whether monopolies are bad. They are not good but that’s not the issue here.
As to architecture:
Cloudflare has had some outages recently. However, what’s their uptime over the longer term? If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.
But there’s a more interesting argument in favour of the status quo.
Assuming cloudflare’s uptime is above average, outages affecting everything at once is actually better for the average internet user.
It might not be intuitive but think about it.
How many Internet services does someone depend on to accomplish something such as their work over a given hour? Maybe 10 directly, and another 100 indirectly? (Make up your own answer, but it’s probably quite a few).
If everything goes offline for one hour per year at the same time, then a person is blocked and unproductive for an hour per year.
On the other hand, if each service experiences the same hour per year of downtime but at different times, then the person is likely to be blocked for closer to 100 hours per year.
It’s not really bad end user experience that every service uses cloudflare. It’s more-so a question of why is cloudflare’s stability seeming to go downhill?
And that’s a fair question. Because if their reliability is below average, then the value prop evaporates.