Second Life has an HTTPS long polling channel between client and server. It's used for some data that's too bulky for the UDP connection, not too time sensitive, or needs encryption. This has caused much grief.
On the client side, the poller uses libcurl. Libcurl has timeouts. If the server has nothing to send for a while, libcurl times out. The client then makes the request again.
This results in a race condition if the server wants to send something between timeout and next request. Messages get lost.
On top of that, the real server is front-ended by an Apache server. This just passes through relevant requests, blocking the endless flood of junk HTTP requests from scrapers, attacks, and search engines. Apache has a timeout, and may close a connection that's in a long poll and not doing anything.
Additional trouble can come from middle boxes and proxy servers that don't like long polling.
There are a lot of things out there that just don't like holding an HTTP connection open. Years ago, a connection idle for a minute was fine. Today, hold a connection open for ten seconds without sending any data and something is likely to disconnect it.
The end result is an unreliable message channel. It has to have sequence numbers to detect duplicates, and can lose messages. For a long time, nobody had discovered that, and there were intermittent failures that were not understood.
In the original article, the chart section labelled "loop" doesn't mention timeout handling. That's not good.
If you do long polling, you probably need to send something every few seconds to keep the connection alive.
Not clear what a safe number is.
Every problem you just listed is 100% in your control and able to be configured, so the issue isn't long polling, it's your setup/configs. If your client (libcurl) times out a request, set the timeout higher. If apache is your web server and it disconnects idle clients, increase the timeout, tell it not to buffer the request and to pass it straight back to the app server. If there's a cloud lb somewhere (sounds like it because alb defaults to a 10s idle timeout), increase the timeouts...
Every timeout in every hop of the chain is within your control to configure. Setup a subdomain and send long polling requests through that so the timeouts can be set higher and not impact regular http requests or open yourself up to slow client ddos.
Why would you try to do long polling and not configure your request chain to be able to handle them without killing idle connections? The problems you have only exist because you're allowing them to exist. Set your idle timeouts higher. Send keepalives more often. Tell your web servers to not do request buffering, etc.
All of that is extremely easy to test and verify functioanlity. Does the request live longer than your polling interval? Yes? Great you're done! No? Tune some more timeouts and log the request chain everywhere you can until you know where the problems lie. Knock them out one by one going back to the origin until you get what you want.
Long polling is easy to get right from an operations perspective.
Whilst it's possible you may be correct, I do have to point out you are, I believe, lecturing John Nagle, known for Nagle's algorithm, used in most TCP stacks in the world.
That race condition has nothing to do with long polling, it's just poor design. The sender should stick the message in a queue and the client reads from that queue. Perhaps with the client specifying the last id it saw.
And it's worth noting that you can't just ignore this problem if you're using websockets - websockets disconnect sometimes for a variety of reasons. It may be less frequent than a long-polling timeout, but if you don't have some mechanism of detecting that messages weren't ack'd and retransmitting them the next time the user connects, messages will get lost eventually.
> That race condition has nothing to do with long polling, it's just poor design. The sender should stick the message in a queue and the client reads from that queue.
How does that help? You can't pop from a queue over HTTP because when the client disconnects you don't know whether it saw your response or not.
I'm new to websockets, please forgive my ignorance — how is sending some "heartbeat" data over long polling different from the ping/pong mechanism in websockets?
I mean, in both cases, it's a TCP connection over (eg) port 443 that's being kept open, right? Intermediaries can't snoop the data if its SSL, so all they know is "has some data been sent recently?" Why would they kill long-polling sessions after 10sec and not web socket ones?
Sending an idle message periodically might help. But the Apache timeout for persistent HTTPS connections is now only 5 seconds.[1] So you need rather a lot of idle traffic if the server side isn't tolerant of idle connections.
Why such a short timeout? Attackers can open connections and silently drop them to tie up resources. This is why we can't have nice things.
> If the server has nothing to send for a while, libcurl times out. The client then makes the request again. This results in a race condition if the server wants to send something between timeout and next request. Messages get lost.
I think it's a premise of reliable long-polling that the server can hold on to messages across the changeover from one client request to the next.
Yeah, some servers close connections when there’s no data transfer. When the backend holds the connection while polling the database until a timeout occurs or the database returns data, it needs to send something back to the client to keep the connection alive. I wonder what could be sent in this case and whether it would require special client-side logic.
It is also fire and forget with fall over to http if web sockets aren't available. I believe if web sockets don't work it can fall over to http long polling instead, but don't quote me on that.
All the downsides of web sockets mentioned in the article are handled for you. Plus you can re-use your existing auth solution. Easily plug logging stuff in, etc. etc. Literally all the problems mentioned by the author are dealt with.
Given the author mentions C# is part of their stack I don't know why they didn't mention signalr or use that instead of rolling their own solution.
Phoenix predates Blazer, but they are both server-side rendered frameworks.
In terms of real time updates, Phoenix typically relies on LiveView, which uses web sockets and falls back to long-polling if necessary. I think SignalR is the closest equivalent in the .Net world.
The icing on the cake is that you can also enable Phoenix channels to fallback to longpolling in your endpoint config. The generator sets it to false by default.
hah seriously. my app uses web sockets extensively but since we are also using Phoenix, its never been source of conflict in development. it really was just drop it and scale to thousands of users.
every websocket setup is painless when running on a single server or handling very few connections...
I was on the platform/devops/man w/ many hats team for an elixir shop running Phoenix in k8s. WS get complicated even in elxir when you have 2+ app instances behind a round robin load balancer. You now need to share broadcasts between app servers. Here's a situation you have to solve for w/ any app at scale regardless of language
app server #1 needs to send a publish/broadcast message out to a user, but the user who needs that message isn't connected to app server #1 that generated the message, that user is currently connected to app server #2.
How do you get a message from one app server to the other one which has the user's ws connection?
A bad option is sticky connections. User #1 always connects to server #1. Server #1 only does work for users connected to it directly. Why is this bad? Hot spots. Overloaded servers. Underutilized servers. Scaling complications. Forecasting problems. Goes against the whole concept of horizontal scaling and load balancing. It doesn't handle side-effect messages, ie user #1000 takes some action which needs to broadcast a message to user #1 which is connected to who knows where.
The better option: You need to broadcast to a shared broker. Something all app servers share a connection to so they can themselves subscribe to messages they should handle, and then pass it to the user's ws connection. This is a message broker. postgres can be that broker, just look at oban for real world proof. Throw in pg's listen/notify and you're off to the races. But that's heavy from a resources per db conn perspective so lets avoid the acid db for this then. Ok. Redis is a good option, or since this is elixir land, use the built in distributed erlang stuff. But, we're not running raw elixir releases on linux, we're running inside of containers, on top of k8s. The whole distributed erlang concept goes to shit once the erlang procs are isolated from each other and not in their perfect Goldilocks getting started readme world. So ok, in containers in k8s, so each app server needs to know about all the other app servers running, so how do you do that? Hmm, service discovery! Ok, well, k8s has service discovery already, so how do I tell the erlang vm about the other nodes that I got from k8s etcd? Ah, a hex package cool. lib_cluster to the rescue https://github.com/bitwalker/libcluster
So we'll now tie the boot process of our entire app to fetching the other app server pod ips from k8s service discovery, then get a ring of distributed erlang nodes talking to each other, sharing message passing between them, this way no matter which server the lb routes the user to, a broadcast from any one of them will be seen by all of them, and the one who holds the ws connection will then forward it down the ws to the user.
So now there's a non trivial amount of complexity and risk that was added here. More to reason about when debugging. More to consider when working on features. More to understand when scaling, deploying, etc. More things to potentially take the service down or cause it not to boot. More things to have race conditions, etc.
Nothing is ever so easy you don't have to think about it.
Would there be any technical benefit to this over using server sent events (SSE)?
Both are similar in that they hold the HTTP connection open and have the benefit of being simply HTTP (the big plus here). SSE (at least to me) feels like it's more suitable for some use cases where updates/results could be streamed in.
A fitting use case might be where you're monitoring all job IDs on behalf of a given client. Then you could move the job monitoring loop to the server side and continuously yield results to the client.
One of the drawbacks, as I learned - SSE have limit on number of up to ~6 open connections (browser + domain name). This can quickly become a limiting factor when you open the same web page in multiple tabs.
Syncing state across multiple tabs and windows is always a bit tricky. For SSE, I'd probably reach for the BroadcastChannel API. Open the SSE connection in the first tab and have it broadcast events to any other open tab or window.
Good point! We did consider SSE, but ultimately decided against it due to the way we have to re-implement response payloads (one for application/json and one for text/event-stream).
I've not personally witnessed this, but people on the internets have said that _some_ proxies/LBs have problems with SSE due to the way it does buffering.
> we have to re-implement response payloads (one for application/json and one for text/event-stream)
I am curious about what you mean here. The 'text/event-stream' allows for abitrary event formats, it just provides structure for EventSource to be able to parse.
You should only need one 'text/event-stream' and should be able send the same JSON via normal or SSE response.
I tried using SSE and found it didn't work for my use case, it was broken on mobile. When users switched from the browser to an app and back, the SSE connection was broken and they wouldn't receive state updates. Was easier to do long polling
The standard way to fix that is to send ping messages every ~15 seconds or something over the SSE stream. If the client doesn’t get a ping in any 20 second window, assume the sse stream is broken somehow and restart it. It’s complex but it works.
The big downside of sse in mobile safari - at least a few years ago - is you got a constant loading spinner on the page. Thats bad UX.
Not sure about the default `EventSource` object in JavaScript, but the Microsoft polyfill that I use (https://github.com/Azure/fetch-event-source) supports `POST` and there's an option `openWhenHidden` which controls how it reacts to users tabbing away.
That's the dumbest thing I've ever heard. SSE is just a response type for normal HTTP. Explain exactly what you mean cause that's like saying that are removing response types from HTTP.
I think this article is tying a lot of unrelated decisions to "Websocket" vs "Long-polling" when they're actually independent. A long-polling server could handle a websocket client with just a bit of extra work to handle keep-alive.
For the other direction, to support long-polling clients if your existing architecture is websockets which get data pushed to them by other parts of the system, just have two layers of servers: one which maintains the "state" of the connection, and then the HTTP server which receives the long polling request can connect to the server that has the connection state and wait for data that way.
Having done this, I don't think I'd reduce it to "just a little bit of work" to make it hum in production.
Everything in between your UI components and the database layer needs to be reworked to work in the connection-oriented (Websockets) model of the world vs request-oriented world.
> Everything in between your UI components and the database layer needs to be reworked to work in the connection-oriented (Websockets) model of the world vs request-oriented world.
How so? As a minimal change, the thing on the server end of the websocket could just do the polling of your database on its own while the connection is open (using authorization credentials supplied as the websocket is being opened). If the connection dies, stop polling. This has the nice property that you're in full control of the refresh rate, can implement coordinated backoffs if the database is overloaded, etc.
In the context of Node.js, where op said, yes it is easier. But it's a new thing and most people don't realize timers in Node are awaitable yet, so the other way is less about "works everywhere" and more "this is just what I know"
I like long polling, it’s easy to understand from start to finish and from client perspective it just works like a very slow connection. You have to keep track of retries and client-side cancelled connections to have one but only one (and the right one) of requests at hand to answer to.
One thing that seems clumsy in the code example is the loop that queries the data again and again. Would be nicer if the data update could also resolve the promise of the response directly.
Hard disagree. Long polling can have complex message ordering problems. You have completely different mechanisms for message passing from client-to-server and server-to-client. And middle boxes can and will stall long polled connections, stopping incremental message delivery. (Or you can use one http query per message - but that’s insanely inefficient over the wire).
Websockets are simply a better technology. With long polling, the devil is in the details and it’s insanely hard to get those details right in every case.
you can use one http query per message - but that’s insanely inefficient over the wire
Use one http response per message queue snapshot. Send no more than N messages at once. Send empty status if the queue is empty for more than 30-60 seconds. Send cancel status to an awaiting connection if a new connection opens successfully (per channel singleton). If needed, send and accept "last" id/timestamp. These are my usual rules for long-polling.
One of them 2001 was that Netscape didn't render correctly if the connection is still open. Hah. I am sure this issue has been fixed a long, long time ago, but perhaps there are other issues.
Nowadays I prefer SSE to long polling and websockets.
The idea is: the client doesn't know that the server has new data before it makes a request. With a very simple SSE the client is told that new data is there then it can request new data separately if it wants. This said, SSE has a few quirks, one of them that on HTTP/1 the connection counts to the maximum limit of 6 concurrent connections per browser and domain, so if you have several tabs, you need a SharedWorker to share the connection between the tabs. But probably this quirk also appllies to long polling and websockets. Another quirk, SSE can't transmit binary data and has some limitations in the textual data it represents. But for this use case this doesn't matter.
I would use websockets only if you have a real bidirectional data flow or need to transmit complex data.
You could have your job status update push an update into an in-memory or distributed cache and check that in your long poll rather than a DB lookup, but that may require adding a bunch of complexity to wire the completion of the task to updating said cache. If your database is tuned well and you don’t have any other restrictions (e.g. serverless where you pay by the IO), it may be good enough and come out in the wash.
Neither Server-Sent Events nor WebSockets have replaced all use cases of long polling reliably. The connection limit of SSE comes up a lot, even if you’re using HTTP/2. WebSockets, on the other hand, are unreliable as hell in most environments. Also, WS is hard to debug, and many of our prod issues with WS couldn’t even be reproduced locally.
Detecting changes in the backend and propagating them to the right client is still an unsolved problem. Until then, long polling is surprisingly simple and a robust solution that works.
> The connection limit of SSE comes up a lot, even if you’re using HTTP/2.
I'm considering using SSE for an app. I'm curious, what problems you've run into? At least the docs say you get 100 connections between the server and a client, but it can be negotiated higher if needed it seems?
SSEs are great and more reliable than websockets in a smaller scale. So I'd reach for it despite the issues. But that being said, some websevers don't play well with SSE and you'll need to fiddle with it. If you control the webserver, then it's not much of a problem.
Since the article mentioned Postgres by name, isn't this a case for using its asynchronous notification features? Servers can LISTEN to a channel and PG can TRIGGER and NOTIFY them when the data changes.
No polling needed, regardless of the frontend channel.
It would be easier to run Hypermode’s Dgraph as the database and use GraphQL subscriptions from the frontend. But nobody ever got fired for choosing postgres.
Second Life has an HTTPS long polling channel between client and server. It's used for some data that's too bulky for the UDP connection, not too time sensitive, or needs encryption. This has caused much grief.
On the client side, the poller uses libcurl. Libcurl has timeouts. If the server has nothing to send for a while, libcurl times out. The client then makes the request again. This results in a race condition if the server wants to send something between timeout and next request. Messages get lost.
On top of that, the real server is front-ended by an Apache server. This just passes through relevant requests, blocking the endless flood of junk HTTP requests from scrapers, attacks, and search engines. Apache has a timeout, and may close a connection that's in a long poll and not doing anything.
Additional trouble can come from middle boxes and proxy servers that don't like long polling.
There are a lot of things out there that just don't like holding an HTTP connection open. Years ago, a connection idle for a minute was fine. Today, hold a connection open for ten seconds without sending any data and something is likely to disconnect it.
The end result is an unreliable message channel. It has to have sequence numbers to detect duplicates, and can lose messages. For a long time, nobody had discovered that, and there were intermittent failures that were not understood.
In the original article, the chart section labelled "loop" doesn't mention timeout handling. That's not good. If you do long polling, you probably need to send something every few seconds to keep the connection alive. Not clear what a safe number is.
Every timeout in every hop of the chain is within your control to configure. Setup a subdomain and send long polling requests through that so the timeouts can be set higher and not impact regular http requests or open yourself up to slow client ddos.
Why would you try to do long polling and not configure your request chain to be able to handle them without killing idle connections? The problems you have only exist because you're allowing them to exist. Set your idle timeouts higher. Send keepalives more often. Tell your web servers to not do request buffering, etc.
All of that is extremely easy to test and verify functioanlity. Does the request live longer than your polling interval? Yes? Great you're done! No? Tune some more timeouts and log the request chain everywhere you can until you know where the problems lie. Knock them out one by one going back to the origin until you get what you want.
Long polling is easy to get right from an operations perspective.
lol
Deleted Comment
How does that help? You can't pop from a queue over HTTP because when the client disconnects you don't know whether it saw your response or not.
I mean, in both cases, it's a TCP connection over (eg) port 443 that's being kept open, right? Intermediaries can't snoop the data if its SSL, so all they know is "has some data been sent recently?" Why would they kill long-polling sessions after 10sec and not web socket ones?
Why such a short timeout? Attackers can open connections and silently drop them to tie up resources. This is why we can't have nice things.
[1] https://httpd.apache.org/docs/2.4/mod/core.html#keepalivetim...
I think it's a premise of reliable long-polling that the server can hold on to messages across the changeover from one client request to the next.
https://http.dev/102
100 Continue could be usable as a workaround. Would probably require at a bare minimum some extra integration code on the client side.
It is also fire and forget with fall over to http if web sockets aren't available. I believe if web sockets don't work it can fall over to http long polling instead, but don't quote me on that.
All the downsides of web sockets mentioned in the article are handled for you. Plus you can re-use your existing auth solution. Easily plug logging stuff in, etc. etc. Literally all the problems mentioned by the author are dealt with.
Given the author mentions C# is part of their stack I don't know why they didn't mention signalr or use that instead of rolling their own solution.
In terms of real time updates, Phoenix typically relies on LiveView, which uses web sockets and falls back to long-polling if necessary. I think SignalR is the closest equivalent in the .Net world.
I was on the platform/devops/man w/ many hats team for an elixir shop running Phoenix in k8s. WS get complicated even in elxir when you have 2+ app instances behind a round robin load balancer. You now need to share broadcasts between app servers. Here's a situation you have to solve for w/ any app at scale regardless of language
app server #1 needs to send a publish/broadcast message out to a user, but the user who needs that message isn't connected to app server #1 that generated the message, that user is currently connected to app server #2.
How do you get a message from one app server to the other one which has the user's ws connection?
A bad option is sticky connections. User #1 always connects to server #1. Server #1 only does work for users connected to it directly. Why is this bad? Hot spots. Overloaded servers. Underutilized servers. Scaling complications. Forecasting problems. Goes against the whole concept of horizontal scaling and load balancing. It doesn't handle side-effect messages, ie user #1000 takes some action which needs to broadcast a message to user #1 which is connected to who knows where.
The better option: You need to broadcast to a shared broker. Something all app servers share a connection to so they can themselves subscribe to messages they should handle, and then pass it to the user's ws connection. This is a message broker. postgres can be that broker, just look at oban for real world proof. Throw in pg's listen/notify and you're off to the races. But that's heavy from a resources per db conn perspective so lets avoid the acid db for this then. Ok. Redis is a good option, or since this is elixir land, use the built in distributed erlang stuff. But, we're not running raw elixir releases on linux, we're running inside of containers, on top of k8s. The whole distributed erlang concept goes to shit once the erlang procs are isolated from each other and not in their perfect Goldilocks getting started readme world. So ok, in containers in k8s, so each app server needs to know about all the other app servers running, so how do you do that? Hmm, service discovery! Ok, well, k8s has service discovery already, so how do I tell the erlang vm about the other nodes that I got from k8s etcd? Ah, a hex package cool. lib_cluster to the rescue https://github.com/bitwalker/libcluster
So we'll now tie the boot process of our entire app to fetching the other app server pod ips from k8s service discovery, then get a ring of distributed erlang nodes talking to each other, sharing message passing between them, this way no matter which server the lb routes the user to, a broadcast from any one of them will be seen by all of them, and the one who holds the ws connection will then forward it down the ws to the user.
So now there's a non trivial amount of complexity and risk that was added here. More to reason about when debugging. More to consider when working on features. More to understand when scaling, deploying, etc. More things to potentially take the service down or cause it not to boot. More things to have race conditions, etc.
Nothing is ever so easy you don't have to think about it.
Both are similar in that they hold the HTTP connection open and have the benefit of being simply HTTP (the big plus here). SSE (at least to me) feels like it's more suitable for some use cases where updates/results could be streamed in.
A fitting use case might be where you're monitoring all job IDs on behalf of a given client. Then you could move the job monitoring loop to the server side and continuously yield results to the client.
One of the drawbacks, as I learned - SSE have limit on number of up to ~6 open connections (browser + domain name). This can quickly become a limiting factor when you open the same web page in multiple tabs.
I've not personally witnessed this, but people on the internets have said that _some_ proxies/LBs have problems with SSE due to the way it does buffering.
I am curious about what you mean here. The 'text/event-stream' allows for abitrary event formats, it just provides structure for EventSource to be able to parse.
You should only need one 'text/event-stream' and should be able send the same JSON via normal or SSE response.
The big downside of sse in mobile safari - at least a few years ago - is you got a constant loading spinner on the page. Thats bad UX.
For the other direction, to support long-polling clients if your existing architecture is websockets which get data pushed to them by other parts of the system, just have two layers of servers: one which maintains the "state" of the connection, and then the HTTP server which receives the long polling request can connect to the server that has the connection state and wait for data that way.
Personally I would enjoyed solving that problem instead of hacking around it but that’s me.
Having done this, I don't think I'd reduce it to "just a little bit of work" to make it hum in production.
Everything in between your UI components and the database layer needs to be reworked to work in the connection-oriented (Websockets) model of the world vs request-oriented world.
How so? As a minimal change, the thing on the server end of the websocket could just do the polling of your database on its own while the connection is open (using authorization credentials supplied as the websocket is being opened). If the connection dies, stop polling. This has the nice property that you're in full control of the refresh rate, can implement coordinated backoffs if the database is overloaded, etc.
Only that knows url for endpoints, protocols and connections - and proxies between them and your app / components
I just don't see the point. It doesn't work in the browser and it shadows global.setTimeout which is confusing. Meanwhile the idiom works everywhere.
One thing that seems clumsy in the code example is the loop that queries the data again and again. Would be nicer if the data update could also resolve the promise of the response directly.
Websockets are simply a better technology. With long polling, the devil is in the details and it’s insanely hard to get those details right in every case.
Use one http response per message queue snapshot. Send no more than N messages at once. Send empty status if the queue is empty for more than 30-60 seconds. Send cancel status to an awaiting connection if a new connection opens successfully (per channel singleton). If needed, send and accept "last" id/timestamp. These are my usual rules for long-polling.
Prevents: connection overhead, congestion latency, connection stalling, unwanted multiplexing, sync loss, respectively.
You have completely different mechanisms for message passing from client-to-server and server-to-client
Is this a problem? Why should this even be symmetric?
Nowadays I prefer SSE to long polling and websockets.
The idea is: the client doesn't know that the server has new data before it makes a request. With a very simple SSE the client is told that new data is there then it can request new data separately if it wants. This said, SSE has a few quirks, one of them that on HTTP/1 the connection counts to the maximum limit of 6 concurrent connections per browser and domain, so if you have several tabs, you need a SharedWorker to share the connection between the tabs. But probably this quirk also appllies to long polling and websockets. Another quirk, SSE can't transmit binary data and has some limitations in the textual data it represents. But for this use case this doesn't matter.
I would use websockets only if you have a real bidirectional data flow or need to transmit complex data.
Detecting changes in the backend and propagating them to the right client is still an unsolved problem. Until then, long polling is surprisingly simple and a robust solution that works.
I'm considering using SSE for an app. I'm curious, what problems you've run into? At least the docs say you get 100 connections between the server and a client, but it can be negotiated higher if needed it seems?
https://developer.mozilla.org/en-US/docs/Web/API/EventSource
No polling needed, regardless of the frontend channel.
In terms of not getting fired - Postgres is a lot more innovative than most databases, and the insinuation of IBM.
By innovative I mean uniquely putting in performance related items for the last 10-20 years.