I was responsible for Stripe's API abstractions, including webhooks and /events, for a number of years. Some interesting tidbits:
Many large customers eventually had some issue with webhooks that required intervention. Stripe retries webhooks that fail for up to 3 days: I remember $large_customer coming back from a 3 day weekend and discovering that they had pushed bad code and failed to process some webhooks. We'd often get requests to retry all failed webhooks in a time period. The best customers would have infrastructure to do this themselves off of /v1/events, though this was unfortunately rare.
The biggest challenges with webhooks:
- Delivery: some customer timing out connections for 30s causing the queues to get backed up (Stripe was much smaller back then).
- Versioning: synchronous API requests can use a version specified in the request, but webhooks, by virtue of rendering the object and showing its changed values (there was a `previous_attributes` hash), need to be rendered to a specific version. This made upgrading API versions hard for customers.
There was constant discussion about building some non-webhook pathway for events, but they all have challenges and webhooks + /v1/events were both simple enough for smaller customers and workable for larger customers.
Shameless plug but I've built https://hookdeck.com precisely to tackle some of these problems. It generally falls onto the consumer to build the necessary tools to process webhooks reliably. I'm trying to give everyone the opportunity to be the "best customers" as you are describing them. Stripe is big inspiration for the work.
Do you provide the ability to consume, translate then forward? I am after a ubiquitous endpoint i can point webhooks at and then translate to the schema of another service and send on. You could then share these 'recipes' and allow customers to reuse well known transforms.
> We'd often get requests to retry all failed webhooks in a time period.
(I worked on the same team as bkraus, non-concurrently).
For teams that are building webhooks into your API, I'd recommend including UI to view webhook attempts and resend them individually or in bulk by date range. Your customers are guaranteed to have a bad deploy at some point.
At Lawn Love, we naively coupled our listening code directly to the Stripe webhook... but it worked flawlessly for years. I wasn't a big fan of the product changes necessitating us switching from the Transfer API for sending money to the complicated--and very confusing for the lawn pros--Connect product, but its webhooks also ran without issue from the moment we first implemented them. So thanks for making my life somewhat easier, Mr. Krausz.
Like many others, I now pattern my own APIs after Stripe's.
Don’t fully thank me, I was also the architect of the Transfers API to Connect transition :). There’s a lot I would have done differently there were I doing it again, though much of the complexity (e.g. the async verification webhooks) were to satisfy compliance needs. Hard to say how much easier the v1 could’ve been given the constraints at the time, though I’m very impressed with the work Stripe has done since to make paying people easier (particularly Express).
Pretty easy for a customer to setup an SQS queue and a lambda for receiving them rather than rely on their infrastructure to do all the actual receiving. Way more reliable than coupling your code directly to the callback.
This is precisely what we do where I work. We have a service which has just one responsibility - receive webhooks, do very basic validation that their legitimate, then ship the payload off to an SQS queue for processing. Doing it this way means that whatever’s going on in the service that wants the data, the webhooks get delivered, and we don’t have to worry about how 3rd party X have configured retries.
These reasons are exactly why we started Svix[1] (we do webhooks as a service). I wish we existed to serve you guys back when you started working on it. :)
I always laugh when people end up with designs like this. They could have just used SMTP! It's designed to reliably deliver messages to distributed queues using a loosely-coupled interface while still being extensible. It scales to massive amounts of traffic. It's highly failure-resistant and will retry operations in various scenarios. And it's bi-directional. But it's not "cool" technology or "web-based" so developers won't consider it.
Watch me get downvoted like crazy by all Nodejs developers. Even though they could accomplish exactly what they want with much less code and far less complex systems to maintain.
I pitched an idea like this years ago to essentially backfill one ticketing system to shiny new system that could read an email inbox. The idea was that if we dropped an email in that inbox with its desired format for each old ticket's updates, the new system would do all the necessary inserts and voila. They told me no -- not because of any technical reason, but because their email infrastructure was required to be audited by the SEC, they would have opened themselves up to significantly more auditing. Instead, I ended up having to do it through painful, painful SQL.
Lesson being, that sometimes there are unexpected reasons why a specific piece of technology shouldn't be used.
I actually did use SMTP as queuing middleware for a registrar platform years ago.
It worked very well.
EDIT: To add some context, my team had come off building a webmail platform, and so we'd done lots of interesting stuff to qmail and knew it inside out. We then launched the .name tld and built a model registrar platform that on registration would bring up web and mail forwarding for users that wanted it. We used SMTP to handle the provisioning of those while keeping the registration part decoupled from the servers handling the forwarding. We also used it to live-update a custom DNS server I wrote.
Honestly this is so stupid brilliant I love it (stupid as in I can’t believe I hadn’t considered this). Honestly it really is about storing, sending, and checking messages so SMTP makes so much sense!
I’ve been building for the web for 15 years and it shows how far I can hyper focus on certain communications implementations that I’m not looking at pre-existing options that really meet a large number of use cases. I suppose it also means making sure your data consumers are comfortable working with the protocol but it’s a really top notch idea.
SMTP used to be a lot more reliable than it is now. Now, with all the changes to help with blocking spam, you have to be very careful or have a lot of control over the receiving server to ensure you actually get delivery. Some anti-spam systems will just discard if the matching rules indicate the spam likelihood score is above a certain threshold, and mistakes in rules at system levels can and do happen.
But here's another way you could (ab)use the mail system for delivery, provide a mailbox for the client and just allow IMAP or POP access and throw the messages into that. The client can log in to access and process them (which they would likely be automating on their own mailboxes anyway). It does mean it's housed at the provider, but it's also pretty easy to scale. There's lots of info on how to set up load balanced dovecot clusters out there, and even specialized middleware modes (dovecot director) to make it work better so you can scale it to very very large systems.
SMTP would raise too many questions, from how both datacenters tolerate it (spam), to who will manage the receiving server itself and certificates on your side, and overall security of this setup. For a nodejs developer it’s really easier to spin up a separate handmade queue process rather than managing SMTP-related things. Webhook (for runtime) and long-polled /events?since= (for startup) have all upsides with little downsides.
When designing something like this as a service, the biggest question is what other developers will find easy to use. Every cheap host supports inbound HTTP requests, and most web developers know how to receive them.
Stripe needs to be usable by both the developers building intense, scalable, reliable systems and the people teaching themselves to code in a limited context on a limited platform.
>And it's bi-directional. But it's not "cool" technology or "web-based" so developers won't consider it.
I might be missing a point or two here, but I don't see how SMTP can work for this case at all. You would require every API consumer to setup a SMTP server (which is another piece of infrastructure to maintain), and then somehow have a layer of authentication so the recipient can control who post messages on that server (overhead for the publisher per new customer). Then we still haven't resolved the issues on the customer side (a bad code that could pop all messages and now we might require the publisher to replay them again).
I haven't even started to think about security and network hardening challenges yet. Again, I might be missing the point but this is not a case of cool tech overuse to me.
I'm confused, "use SMTP" doesn't even type-check for me. Isn't SMTP just a transfer protocol? Meaning it defines a bunch of commands and gives them meanings (like EHLO and DATA and such), just like how HTTP defines commands like GET and POST and all that? Isn't the problem here about e.g. the storage & retry logic rather than about the data transfer itself? Can't you retry transmission as frequently as you like using whatever protocol you like? How does transferring the data over SMTP gain you anything compared to HTTP?
What about events that need faster than 1 minute response times? Any push notification like system is going to be just as error prone. And what about multiple message handlers? And what happens when the send fails? Did someone write the code to check the inbox for them and handle them? When a send fails multiple times, is that logged and is there a system for clients to check that log? Message transfer isnt the hard problem in this domain.
I think it depends on the developer. There's developers hammering out boring business logic as fast as possible and there's developers with a deep understanding of machine internals, protocols, and infrastructure. For the former, SMTP is black magic they'd probably never think of and involves engaging the one infra person that's always busy
It also means standing up and managing "infrastructure"
Developers won't be able to use the existing email systems of the company, too critical and managed by another team. They will never be able to reconfigure it and get API access to read emails. Note that it may or may not be reliable at all (depends on the company and the IT who manages it).
Developers won't be able to setup new email servers for that use case. Security will never open the firewall for email ports. If they do, the servers will be hammered by vulnerability scanners and spam as soon as it's running. Note that large companies like banks run port scanners and they will detect your rogue email servers and shut it down (speaking from experience).
At this point in my career (10 years in the game), let me simply defend node as the tool that got me here. Using it then to bootstrap my career was just as practical as using SMTP as you describe now.
There are better options than SMTP. Basically any message-oriented middleware / message queuing service can provide this. It's great for both sides, maintenance/outages can happen independently, as long as the queue stays online and has space everything is fine.
E-Mail isn't trustworthy. You may get a confirmation that an initial SMTP server accepted a mail, but that's it. There's also no good way to detect that an endpoint (receiver address) is gone for good to stop sending messages.
You will probably point me to SMTP success messages, but a removed mailbox might only be known by a backend server.
Also mail infrastructure will potentially include heavy spam filters etc. making it quite inconvenient. Not even mentioning security aspects with limited availability of transport layer encryption with proper signatures.
I think that would be a great solution for these types of scenarios.
In an enterprise setting it becomes more complex if a 365 subscription is required, or active directory authentication is needed to receive emails. Does someone need to monitor the inbox to confirm it's working etc.
But after you mentioned it, I do wish that this was an alternative to webhooks that more service providers offered.
We used to do this for domain name registrations and it worked fairly well for years. However once you've been added to a spam blacklist it quickly breaks down, especially for time critical operations such as domain name renewals when you're scrabbling around trying to appease the Spamhaus gods.
SMTP doesn't reliably deliver messages, implementations of it do. A webshit could easily create an SMTP server (with the help of a library written by someone with actual programming skills) that silently drops messages when any error occurs instead of implementing all that robustness.
This is a hill I find myself frequently fighting (and losing on): webhooks are terrible to maintain, because they start from the premise "this never breaks" and thats about where development in an organization stops.
The only event API I ever want is notifications there's new data, and then an interface by which I can query all new data which has arrived by some sort of index marker - because this is fundamentally reliable. It means whatever happens to my system, I can reliably recover missed events, skipped events, or rebuild from previous events.
And this is in fact exactly how something like Kafka actually works! Complete with first-class support for compacting queues to produce valid "summarized" starting points.
Any streaming system essentially should never start as a streaming system - it should start as a slow-path pull-based system, and have a fast-path push system added on top of it if needed - because then you've built your recovery path already, rather then what happens way too often which is just "oh yeah, we'll develop that when it breaks".
I agree. A simple ping with the latest ID (which is option for you to use to get events from last ID to newest ID). The go get the events, which is a likely reusing code. Polling is crap.
Extra points for being able to set something like 1s between pings (now you see why I like the option ID for a range).
> Any streaming system essentially should never start as a streaming system - it should start as a slow-path pull-based system, and have a fast-path push system added on top of it if needed - because then you've built your recovery path already, rather then what happens way too often which is just "oh yeah, we'll develop that when it breaks".
I think this is a quite interesting and important point. When we talk about "doing the simple thing first" too often we end building something that is technically simple but fickle. The trick to making the simple thing reliable is to figure out which part is the slow-path (or failure mode), and then only building that. Unfortunately, it often means out result ends up technically "boring" since all the interesting optimizations are what we cut out, but I think that's worth it if the end result is a more useful product.
It's something I've been working with and thinking about for a while. I think it applies to a way broader scope than this discussion.
(I worked on the same team as bkrausz, elsewhere on this thread, albeit not concurrently).
Yes, this is pretty much the right thing to do. It can be a bit more work for the API consumer, partly because they need to track state of their last-read ID, and there's more moving parts.
If you're building a webhhook+events system like Stripe's, you might consider adding an option for a mostly-empty webhook body, which can speed things up in this use-case, but still allows "the easy way" of just processing the event from within the webhook body.
(For readers thinking of implementing this, note that "query for new data" means hitting a dedicated /events api, not individual tables, which might have unpleasant load/performance consequences).
My company has recently switched to Microsoft Teams, where unsupported integrations happen via webhooks. For example, if we wanted to be able to trigger builds in Jenkins or Gitlab, or acknowledge alerts via AlertManager, we'd have to set them up as webhooks to the appropriate service.
The problem is that all of those services are internal to our network, and aren't accessible from the outside world. We cannot set up a webhook to Jenkins because Jenkins does not have a publicly accessible URL. We cannot set up a webhook to Gitlab, or to Prometheus, or to Sentry, or anything else, because those are all internal services.
The only option there would be to create a new, public-facing server, set it up with a domain name and SSL certificate, expose it to the world, and then give it access to those services - which defeats the point of having those services internal and secure if we just create a non-internal system and give it access to them.
Alternately, we have that new, public-facing server buffer those requests and have other services poll them, somehow, so that it cannot connect in, but now we're getting into the same situation as described in the article.
If there were an API, I could easily create a small daemon that would watch for events and dispatch them accordingly, and then respond to them as needed; instead, my only option is to build some kind of Frankenstein - or to give up entirely, which is the more reasonable solution.
Then again, this is Microsoft Teams, where creating an application requires an Azure account and jumping through a ton of hoops, so they're no stranger to stupid ideas that no one wants to deal with.
My company's internal apps use a mix of VPNs and IP fenced load balancers. We are migrating to app proxy.
No inbound connections + access based on Azure AD identity with conditional access (restrict apps to Intune enabled corporate devices) and MFA is an absolute killer.
My only complain is that connectors are not very DevOps friendly. Cloudflare Tunnel is much better in this area.
You might look into Cloudflare Tunnel (formerly Argo). It is free and allows you to poke a hole in your firewall to a specific service. If that meets your security requirements.
I don't believe Cloudflare Tunnel is free, the free tier pricing page [1] lists Argo Smart Routing at "Starting at $5 per month" ("Argo includes: Smart Routing, Tunnel, and Tiered Caching")
> The only option there would be to create a new, public-facing server
This is a problem with receiving any inbound data from a third party. At least with HTTP, it's pretty trivial to set up a robust reverse proxy with nginx.
I'm finishing a browser based application platform where the applications installed expose a RPC api, so in the end all applications can call others in the same local(or remote) node/s.
The beauty of this is that you also can compose with other nodes and for a distributed service by calling the local service as a proxy and routing the requests to the other nodes of the same api.
It took more time than i've predicted because its also expected to deliver UI and most of the 'HTML5' api to native applications (instead of Javascript), which is a massive platform by now (and the #1 reason why newcomers to browser technology cant compete, giving the feature creep tax imposed to them).
The idea is also to distribute over a DHT so you can just serve your application over torrent without needing to register anything..
The only way to get there is by empowering users and developers and taking some of the control from the cloud platform giants.
In my point of view the only way to break the browser monopoly now is to create a new path forward, a branch.. its not the time to follow the rules, its time to break them or else the future doesn't look so bright in my opinion..
> The only option there would be to create a new, public-facing server, set it up with a domain name and SSL certificate, expose it to the world, and then give it access to those services - which defeats the point of having those services internal and secure if we just create a non-internal system and give it access to them.
That's not the only solution -- you could also develop a bot that will do those specific things.
In the days of yore I know of at least three companies that were using IRC bots to similar effect long before webhooks ever existed.
Because of that prior experience, this is how I currently manage a similar set of problems, albeit not on Teams in my current role.
Really good point that corporate firewalls can trip you up. With slack it was so much easier to call into their events API than receive an outgoing webhook for precisely this reason.
The downside was that the event api required a huge amount of scope, so if you weren’t careful and were compromised, someone could use that token to scrape all messages in the system.
We use this same long-polling based /events API interface for all official clients (web, mobile, terminal), our interactive bots ecosystem (https://zulip.com/api/running-bots), and many integrations (E.g. bridges with IRC/Matrix/etc.).
We also offer webhooks, because some platforms like Heroku or AWS Lambda make it much easier to accept incoming HTTP requests than do longpolling, but the events system has always felt like a nicer programming model.
(Zulip's events system was inspired by separate ~2012 conversations I had with the Meteor and Quora founders about the best way to do live updates in a web application).
There are lots of reasons to want to immediately respond to an external event besides building an eventually consistent data syncing system. Polling an API endpoint works fine for the latter case, but not much else.
A good platform should offer both of these and more (for example Slack does webhooks, REST endpoint, websocket-based streaming and bulk exports), and let the client pick what they want based on their use case.
Long-polling is the way to immediately retrieve events. It's more efficient and lower latency than waiting for a sender to initiate a TCP and TLS handshake.
A persistent connection has a cost. Your statement may be true in some circumstances but definitely not all. Namely, for infrequent events it is much more efficient to be notified than to be asking nonstop. Sure, the latency is lowest if the connection is already established, but for efficiency the answer is not cut and dry but is rather a tradeoff decision based on the expected patterns.
One nice benefit of long polling is the built in catch-up-after-a-break functionality: When the client initiates the poll, it tells the server the state it knows about (timestamp, sequence number, hash, whatever), and the server either replies right away if it's different, or waits and replies once it's different.
With webhooks, as in the article, you only get state changes; you need some separate mechanism to achieve (or recover) the initial state.
Someone has to maintain an always-running listener for `/events`. If a server does that, and triggers client calls, we call that webhooks. If a client does that, and triggers internal functions, it's what the op describes. I think that for APIs, `/events` should indeed be the fundamental feature, and "webhooks" should be a nice-to-have service on top of `/events`, for those who don't want to maintain a local subscriber.
If the webhook events are coming at some sort of a brisk pace, the sender well may be able to reuse an already-open connection. And if they're rather infrequent, is the efficiency or latency likely to be a significant concern?
> To mitigate both of these issues, many developers end up buffering webhooks onto a message bus system like Kafka, which feels like a cumbersome compromise.
Kafka solves exactly the issue that the author is complaining about. This is a safeguard to ensure that data isn't dropped in the event of an issue, and provides mechanisms to replay events.
The tradeoff between pushing and polling have been argued since forever.
In other news, mechanics who work with bolts often do so with ratchets. This is a cumbersome compromise, just give me Torx fasteners!
It would if the source was pushing into the Kafka stream directly. It doesn't solve the problem of going out of sync if my code to push to the Kafka stream is entirely down and I miss POSTs.
(And, of course, I don't want Kafka. I want Google PubSub. No, wait, I mean SQS. No, wait, I mean I want zeroMQ. No, I mean....)
The question is: who maintains the queue of events, and pays for it?
Certainly the event producer is in a better position to maintain a queue without missing events, but it also means they need to buffer more data in their queue system to accommodate for your receiver's downtime
Not disagreeing with your point, and I'm sure you already know this, I just wanted to point out (for the benefit of people that don't have other options) that it is possible to build "webhooks" in such a way that you're confident nothing is dropped and nothing goes (permanently) out of sync. (At least, AFAIK -- correct me if this sounds wrong!)
Conceptually, the important thing is each stage waits to "ACK" the message until it's durably persisted. And when the message is sent to the next stage, the previous stage _waits for an ACK_ before assuming the handoff was successful.
In the case that your application code is down, the other party should detect that ("Oh, my webhook request returned a 502") and handle it appropriately -- e.g. by pausing their webhook queue and retrying the message until it succeeds, or putting it on a dead-letter queue, etc. Your app will be "out of sync" until it comes back online and the retries succeed, but it will eventually end up "in sync."
Of course, the issue with this approach is most webhook providers... don't do that (IME). It seems like webhooks are often viewed as a "best-effort" thing, where they send the HTTP request and if it doesn't work, then whatever. I'd be inclined to agree that kind of "throw it over the fence" webhook is not great and risks permanent desync. But there are situations where an async messaging flow is the right decision and believe it or not, it can work! :)
As long as you guarantee delivery to your message queue before acknowledging receipt, you should be golden.
Also, swapping out one messaging system for another is trivial. Pick the one best suited to the environment you're working in, and if that environment changes, changing messaging queues is going to be one the easiest transitions you'll make.
Yeah I was scratching my head reading this article; they're bending so far backwards to avoid the obvious solution that I thought they were gearing up to pitch some competing tech.
> If the sender's queue starts to experience back-pressure, webhook events will be delayed, and it may be very difficult for you to know that this slippage is occurring
I've never before seen anyone try to argue that properly dealing with backpressure is a bad thing. The author's proposed model makes this situation even worse. With kafka, consumers can continue processing the event stream and you can continue to serve reads from your primary datastore. With the author's model the event stream lives in your primary datastore, so if that starts to lock up the blast radius is much larger.
Are you going to expose your Kafka brokers directly to your integration partners? Are they going to use the Kafka client library and wire protocol to send you data? That’s the thing about webhooks, HTTP is universal and if you’re comfortable exposing anything externally, it’s going to be a web service.
That's a pretty straight-forward design that's widely used, robust, and easy to put together. I've probably done that same workflow 100s of times without issue.
As long as you guarantee the message was pushed to the queue before acknowledging, that will be fabulously reliable. You need to make contingencies for duplicate messages, but that's not usually difficult.
It's a common writing style as of late, set down a premise and solve that premise decisively.
Now, if that premise isn't based in reality, or if it's already been solved some other way, discredit it without giving it too much air time.
A one liner about kafka being cumbersome and then building your own solution, warts and all, doesn't need to exist in the same thought if you've made the reader mentally disregard it as a possible solution.
Totally, things can get very reliable if you start processing webhooks asynchronously. Personally I've found it pretty cumbersome and complicated to build the necessary infrastructure in the past. I've been building https://hookdeck.com as a simpler alternative specifically to ingesting incoming webhooks.
Are events and webhooks mutually exclusive? How about a combination of both: events for consuming at leisure, webhooks for notification of new events. This allows instant notification of new events but allows for the benefits outlined in the article.
What about supporting fast lookup of the event endpoint, so it can be queried more frequently?
I think that a combo of webhooks / events is nice, but "what scope do we cut?" is an important question. Unfortunately, it feels like the events part is cut, when I'd argue that events is significantly more important.
Webhooks are flashier from a PM perspective because they are perceived as more real-time, but polling is just as good in practice.
Polling is also completely in your control, you will get an event within X seconds of it going live. That isn't true for webhooks, where a vendor may have delays on their outbound pipeline.
I think that's what the author was getting at, after reading through the whole article. The idea isn't to get rid of webhooks, but provide an endpoint that can be used when webhooks won't necessarily work.
Very similar to how I built my previous application.
1) /events for the source of truth (I.e. cursor-based logs)
2) websockets for "nice to have" real-time updates as a way to hint the clients to refetch what's new
Yeah... I'd go so far as to argue that this is the only architecture that should even ever be considered, as only having one half of the solution is clearly wrong.
This is the way to go and I'd love to see more API's with robust events endpoint for polling & reconciliation. Deletes are especially hard to reconcile with many APIs since they aren't queryable and you need to instance check if every ID still exist. Shopify I'm looking at you.
Yes to the combination of both. I worked on architecture and was responsible for large-scale systems at Google. Reliable giant-scale systems do both event subscription and polling, often at the same time, with idempotency guarantees.
Sorry if I'm daft, could you/someone explain why one would want to use both at the same time for the same system?
One thing that makes sense: if you go down use polling so you can work at your own pace. But this isn't really at the same time. When/why does it make sense to do both simultaneously?
I don't think the original comment meant long polling (i.e. keeping the connection alive), they meant periodically call the endpoint to check for events.
There's a much better approach than /events or webhooks: add synchronization directly into HTTP itself.
The underlying problem is that HTTP is a state transfer protocol, not a state synchronization protocol. HTTP knows how to transfer state between client and server once, but doesn't know how to update the client when the state changes.
When you add a /events resource, or a webhooks system, you're trying to bolt state synchronization onto a state transfer protocol, and you get a network-layer mismatch. You end up with the equivalent of HTTP request/response objects inside of existing HTTP request/responses, like you see in /events! You end up sending "DELETE" messages within a GET to an /events resource. This breaks REST.
A much better approach is to just fix HTTP, and teach it how to synchronize! We're doing that in the Braid project (https://braid.org) and I encourage anyone in this space to consider this approach. It ends up being much simpler to implement, more general, and more powerful.
You may send POST /events instead. It also breaks “REST”, which is just a sort of obsession rather than a requirement here, but more importantly it wouldn’t break idempotence and proxy caching that GET implies.
Edit: from the network point of view, it’s either call-back or a persistent call-wait/socket, or polling. The exact protocol is irrelevant, because it’s networking limits and efficiency that prevent everyone from having a persistent connection to everyone. A persistent connection can’t be much better than any other persistent connection in that regard, and what happens inside is unrelated story. Or am I missing something?
...and you can get these features for free using off-the-shelf polyfill libraries. If you're in Javascript, try braidify: https://www.npmjs.com/package/braidify
Many large customers eventually had some issue with webhooks that required intervention. Stripe retries webhooks that fail for up to 3 days: I remember $large_customer coming back from a 3 day weekend and discovering that they had pushed bad code and failed to process some webhooks. We'd often get requests to retry all failed webhooks in a time period. The best customers would have infrastructure to do this themselves off of /v1/events, though this was unfortunately rare.
The biggest challenges with webhooks:
- Delivery: some customer timing out connections for 30s causing the queues to get backed up (Stripe was much smaller back then).
- Versioning: synchronous API requests can use a version specified in the request, but webhooks, by virtue of rendering the object and showing its changed values (there was a `previous_attributes` hash), need to be rendered to a specific version. This made upgrading API versions hard for customers.
There was constant discussion about building some non-webhook pathway for events, but they all have challenges and webhooks + /v1/events were both simple enough for smaller customers and workable for larger customers.
(I worked on the same team as bkraus, non-concurrently).
For teams that are building webhooks into your API, I'd recommend including UI to view webhook attempts and resend them individually or in bulk by date range. Your customers are guaranteed to have a bad deploy at some point.
Like many others, I now pattern my own APIs after Stripe's.
Edit: not sure why I'm being downvoted. I work at stripe and this is literally how it works.
[1] https://www.svix.com
Watch me get downvoted like crazy by all Nodejs developers. Even though they could accomplish exactly what they want with much less code and far less complex systems to maintain.
Lesson being, that sometimes there are unexpected reasons why a specific piece of technology shouldn't be used.
I didn't downvote you but I bet they come from this part. People don't like this kind of negativity.
> But it's not "cool" technology or "web-based" so developers won't consider it. Watch me get downvoted like crazy by all Nodejs developers.
It worked very well.
EDIT: To add some context, my team had come off building a webmail platform, and so we'd done lots of interesting stuff to qmail and knew it inside out. We then launched the .name tld and built a model registrar platform that on registration would bring up web and mail forwarding for users that wanted it. We used SMTP to handle the provisioning of those while keeping the registration part decoupled from the servers handling the forwarding. We also used it to live-update a custom DNS server I wrote.
Why you need to insult a whole body of people, rather than just make a claim about the technology, I don’t know.
I’ve been building for the web for 15 years and it shows how far I can hyper focus on certain communications implementations that I’m not looking at pre-existing options that really meet a large number of use cases. I suppose it also means making sure your data consumers are comfortable working with the protocol but it’s a really top notch idea.
But here's another way you could (ab)use the mail system for delivery, provide a mailbox for the client and just allow IMAP or POP access and throw the messages into that. The client can log in to access and process them (which they would likely be automating on their own mailboxes anyway). It does mean it's housed at the provider, but it's also pretty easy to scale. There's lots of info on how to set up load balanced dovecot clusters out there, and even specialized middleware modes (dovecot director) to make it work better so you can scale it to very very large systems.
Stripe needs to be usable by both the developers building intense, scalable, reliable systems and the people teaching themselves to code in a limited context on a limited platform.
I might be missing a point or two here, but I don't see how SMTP can work for this case at all. You would require every API consumer to setup a SMTP server (which is another piece of infrastructure to maintain), and then somehow have a layer of authentication so the recipient can control who post messages on that server (overhead for the publisher per new customer). Then we still haven't resolved the issues on the customer side (a bad code that could pop all messages and now we might require the publisher to replay them again).
I haven't even started to think about security and network hardening challenges yet. Again, I might be missing the point but this is not a case of cool tech overuse to me.
I think it depends on the developer. There's developers hammering out boring business logic as fast as possible and there's developers with a deep understanding of machine internals, protocols, and infrastructure. For the former, SMTP is black magic they'd probably never think of and involves engaging the one infra person that's always busy
It also means standing up and managing "infrastructure"
Developers won't be able to use the existing email systems of the company, too critical and managed by another team. They will never be able to reconfigure it and get API access to read emails. Note that it may or may not be reliable at all (depends on the company and the IT who manages it).
Developers won't be able to setup new email servers for that use case. Security will never open the firewall for email ports. If they do, the servers will be hammered by vulnerability scanners and spam as soon as it's running. Note that large companies like banks run port scanners and they will detect your rogue email servers and shut it down (speaking from experience).
At this point in my career (10 years in the game), let me simply defend node as the tool that got me here. Using it then to bootstrap my career was just as practical as using SMTP as you describe now.
But... Why?
The HTTP protocol is so much easier to manage, load balance, use, etc.
I think the bigger issue is that consumption isn't particularly friendly. Also, you still haven't solved the versioning issues.
You will probably point me to SMTP success messages, but a removed mailbox might only be known by a backend server.
Also mail infrastructure will potentially include heavy spam filters etc. making it quite inconvenient. Not even mentioning security aspects with limited availability of transport layer encryption with proper signatures.
In an enterprise setting it becomes more complex if a 365 subscription is required, or active directory authentication is needed to receive emails. Does someone need to monitor the inbox to confirm it's working etc.
But after you mentioned it, I do wish that this was an alternative to webhooks that more service providers offered.
Deleted Comment
Dead Comment
The only event API I ever want is notifications there's new data, and then an interface by which I can query all new data which has arrived by some sort of index marker - because this is fundamentally reliable. It means whatever happens to my system, I can reliably recover missed events, skipped events, or rebuild from previous events.
And this is in fact exactly how something like Kafka actually works! Complete with first-class support for compacting queues to produce valid "summarized" starting points.
Any streaming system essentially should never start as a streaming system - it should start as a slow-path pull-based system, and have a fast-path push system added on top of it if needed - because then you've built your recovery path already, rather then what happens way too often which is just "oh yeah, we'll develop that when it breaks".
Extra points for being able to set something like 1s between pings (now you see why I like the option ID for a range).
I think this is a quite interesting and important point. When we talk about "doing the simple thing first" too often we end building something that is technically simple but fickle. The trick to making the simple thing reliable is to figure out which part is the slow-path (or failure mode), and then only building that. Unfortunately, it often means out result ends up technically "boring" since all the interesting optimizations are what we cut out, but I think that's worth it if the end result is a more useful product.
It's something I've been working with and thinking about for a while. I think it applies to a way broader scope than this discussion.
Yes, this is pretty much the right thing to do. It can be a bit more work for the API consumer, partly because they need to track state of their last-read ID, and there's more moving parts.
If you're building a webhhook+events system like Stripe's, you might consider adding an option for a mostly-empty webhook body, which can speed things up in this use-case, but still allows "the easy way" of just processing the event from within the webhook body.
(For readers thinking of implementing this, note that "query for new data" means hitting a dedicated /events api, not individual tables, which might have unpleasant load/performance consequences).
The problem is that all of those services are internal to our network, and aren't accessible from the outside world. We cannot set up a webhook to Jenkins because Jenkins does not have a publicly accessible URL. We cannot set up a webhook to Gitlab, or to Prometheus, or to Sentry, or anything else, because those are all internal services.
The only option there would be to create a new, public-facing server, set it up with a domain name and SSL certificate, expose it to the world, and then give it access to those services - which defeats the point of having those services internal and secure if we just create a non-internal system and give it access to them.
Alternately, we have that new, public-facing server buffer those requests and have other services poll them, somehow, so that it cannot connect in, but now we're getting into the same situation as described in the article.
If there were an API, I could easily create a small daemon that would watch for events and dispatch them accordingly, and then respond to them as needed; instead, my only option is to build some kind of Frankenstein - or to give up entirely, which is the more reasonable solution.
Then again, this is Microsoft Teams, where creating an application requires an Azure account and jumping through a ton of hoops, so they're no stranger to stupid ideas that no one wants to deal with.
My company's internal apps use a mix of VPNs and IP fenced load balancers. We are migrating to app proxy.
No inbound connections + access based on Azure AD identity with conditional access (restrict apps to Intune enabled corporate devices) and MFA is an absolute killer.
My only complain is that connectors are not very DevOps friendly. Cloudflare Tunnel is much better in this area.
You might look into Cloudflare Tunnel (formerly Argo). It is free and allows you to poke a hole in your firewall to a specific service. If that meets your security requirements.
https://www.cloudflare.com/products/tunnel/
[1] https://www.cloudflare.com/plans/
Dead Comment
This is a problem with receiving any inbound data from a third party. At least with HTTP, it's pretty trivial to set up a robust reverse proxy with nginx.
The beauty of this is that you also can compose with other nodes and for a distributed service by calling the local service as a proxy and routing the requests to the other nodes of the same api.
It took more time than i've predicted because its also expected to deliver UI and most of the 'HTML5' api to native applications (instead of Javascript), which is a massive platform by now (and the #1 reason why newcomers to browser technology cant compete, giving the feature creep tax imposed to them).
The idea is also to distribute over a DHT so you can just serve your application over torrent without needing to register anything..
The only way to get there is by empowering users and developers and taking some of the control from the cloud platform giants.
In my point of view the only way to break the browser monopoly now is to create a new path forward, a branch.. its not the time to follow the rules, its time to break them or else the future doesn't look so bright in my opinion..
That's not the only solution -- you could also develop a bot that will do those specific things.
In the days of yore I know of at least three companies that were using IRC bots to similar effect long before webhooks ever existed.
Because of that prior experience, this is how I currently manage a similar set of problems, albeit not on Teams in my current role.
The downside was that the event api required a huge amount of scope, so if you weren’t careful and were compromised, someone could use that token to scrape all messages in the system.
Slack recently added socket mode for precisely this reason: https://api.slack.com/apis/connections/socket
* https://zulip.com/api/real-time-events
* https://zulip.com/api/register-queue
* https://zulip.readthedocs.io/en/latest/subsystems/events-sys...
We use this same long-polling based /events API interface for all official clients (web, mobile, terminal), our interactive bots ecosystem (https://zulip.com/api/running-bots), and many integrations (E.g. bridges with IRC/Matrix/etc.).
We also offer webhooks, because some platforms like Heroku or AWS Lambda make it much easier to accept incoming HTTP requests than do longpolling, but the events system has always felt like a nicer programming model.
(Zulip's events system was inspired by separate ~2012 conversations I had with the Meteor and Quora founders about the best way to do live updates in a web application).
A good platform should offer both of these and more (for example Slack does webhooks, REST endpoint, websocket-based streaming and bulk exports), and let the client pick what they want based on their use case.
With webhooks, as in the article, you only get state changes; you need some separate mechanism to achieve (or recover) the initial state.
Deleted Comment
Kafka solves exactly the issue that the author is complaining about. This is a safeguard to ensure that data isn't dropped in the event of an issue, and provides mechanisms to replay events.
The tradeoff between pushing and polling have been argued since forever.
In other news, mechanics who work with bolts often do so with ratchets. This is a cumbersome compromise, just give me Torx fasteners!
(And, of course, I don't want Kafka. I want Google PubSub. No, wait, I mean SQS. No, wait, I mean I want zeroMQ. No, I mean....)
Certainly the event producer is in a better position to maintain a queue without missing events, but it also means they need to buffer more data in their queue system to accommodate for your receiver's downtime
Conceptually, the important thing is each stage waits to "ACK" the message until it's durably persisted. And when the message is sent to the next stage, the previous stage _waits for an ACK_ before assuming the handoff was successful.
In the case that your application code is down, the other party should detect that ("Oh, my webhook request returned a 502") and handle it appropriately -- e.g. by pausing their webhook queue and retrying the message until it succeeds, or putting it on a dead-letter queue, etc. Your app will be "out of sync" until it comes back online and the retries succeed, but it will eventually end up "in sync."
Of course, the issue with this approach is most webhook providers... don't do that (IME). It seems like webhooks are often viewed as a "best-effort" thing, where they send the HTTP request and if it doesn't work, then whatever. I'd be inclined to agree that kind of "throw it over the fence" webhook is not great and risks permanent desync. But there are situations where an async messaging flow is the right decision and believe it or not, it can work! :)
Also, swapping out one messaging system for another is trivial. Pick the one best suited to the environment you're working in, and if that environment changes, changing messaging queues is going to be one the easiest transitions you'll make.
> If the sender's queue starts to experience back-pressure, webhook events will be delayed, and it may be very difficult for you to know that this slippage is occurring
I've never before seen anyone try to argue that properly dealing with backpressure is a bad thing. The author's proposed model makes this situation even worse. With kafka, consumers can continue processing the event stream and you can continue to serve reads from your primary datastore. With the author's model the event stream lives in your primary datastore, so if that starts to lock up the blast radius is much larger.
HTTP Endpoint -> Push to message queue (kafka, SQS, etc) -> Acknowledge receipt
That's a pretty straight-forward design that's widely used, robust, and easy to put together. I've probably done that same workflow 100s of times without issue.
As long as you guarantee the message was pushed to the queue before acknowledging, that will be fabulously reliable. You need to make contingencies for duplicate messages, but that's not usually difficult.
Now, if that premise isn't based in reality, or if it's already been solved some other way, discredit it without giving it too much air time.
A one liner about kafka being cumbersome and then building your own solution, warts and all, doesn't need to exist in the same thought if you've made the reader mentally disregard it as a possible solution.
I think that a combo of webhooks / events is nice, but "what scope do we cut?" is an important question. Unfortunately, it feels like the events part is cut, when I'd argue that events is significantly more important.
Webhooks are flashier from a PM perspective because they are perceived as more real-time, but polling is just as good in practice.
Polling is also completely in your control, you will get an event within X seconds of it going live. That isn't true for webhooks, where a vendor may have delays on their outbound pipeline.
1) /events for the source of truth (I.e. cursor-based logs) 2) websockets for "nice to have" real-time updates as a way to hint the clients to refetch what's new
One thing that makes sense: if you go down use polling so you can work at your own pace. But this isn't really at the same time. When/why does it make sense to do both simultaneously?
The underlying problem is that HTTP is a state transfer protocol, not a state synchronization protocol. HTTP knows how to transfer state between client and server once, but doesn't know how to update the client when the state changes.
When you add a /events resource, or a webhooks system, you're trying to bolt state synchronization onto a state transfer protocol, and you get a network-layer mismatch. You end up with the equivalent of HTTP request/response objects inside of existing HTTP request/responses, like you see in /events! You end up sending "DELETE" messages within a GET to an /events resource. This breaks REST.
A much better approach is to just fix HTTP, and teach it how to synchronize! We're doing that in the Braid project (https://braid.org) and I encourage anyone in this space to consider this approach. It ends up being much simpler to implement, more general, and more powerful.
Here's a talk that explains the relationship between synchronization and HTTP in more detail: https://youtu.be/L3eYmVKTmWM?t=235
Edit: from the network point of view, it’s either call-back or a persistent call-wait/socket, or polling. The exact protocol is irrelevant, because it’s networking limits and efficiency that prevent everyone from having a persistent connection to everyone. A persistent connection can’t be much better than any other persistent connection in that regard, and what happens inside is unrelated story. Or am I missing something?
Oh yes, changing HTTP is so easy.
You can add it to your own website with a few simple headers, and a response that stays open (like SSE) to send multiple updates when state changes: https://datatracker.ietf.org/doc/html/draft-toomim-httpbis-b...
...and you can get these features for free using off-the-shelf polyfill libraries. If you're in Javascript, try braidify: https://www.npmjs.com/package/braidify