Back to basics: Why we chose long-polling over websockets

Long polling has some problems of its own.

Second Life has an HTTPS long polling channel between client and server. It's used for some data that's too bulky for the UDP connection, not too time sensitive, or needs encryption. This has caused much grief.

On the client side, the poller uses libcurl. Libcurl has timeouts. If the server has nothing to send for a while, libcurl times out. The client then makes the request again. This results in a race condition if the server wants to send something between timeout and next request. Messages get lost.

On top of that, the real server is front-ended by an Apache server. This just passes through relevant requests, blocking the endless flood of junk HTTP requests from scrapers, attacks, and search engines. Apache has a timeout, and may close a connection that's in a long poll and not doing anything.

Additional trouble can come from middle boxes and proxy servers that don't like long polling.

There are a lot of things out there that just don't like holding an HTTP connection open. Years ago, a connection idle for a minute was fine. Today, hold a connection open for ten seconds without sending any data and something is likely to disconnect it.

The end result is an unreliable message channel. It has to have sequence numbers to detect duplicates, and can lose messages. For a long time, nobody had discovered that, and there were intermittent failures that were not understood.

In the original article, the chart section labelled "loop" doesn't mention timeout handling. That's not good. If you do long polling, you probably need to send something every few seconds to keep the connection alive. Not clear what a safe number is.

wutwutwat · a year ago

Every problem you just listed is 100% in your control and able to be configured, so the issue isn't long polling, it's your setup/configs. If your client (libcurl) times out a request, set the timeout higher. If apache is your web server and it disconnects idle clients, increase the timeout, tell it not to buffer the request and to pass it straight back to the app server. If there's a cloud lb somewhere (sounds like it because alb defaults to a 10s idle timeout), increase the timeouts...

Every timeout in every hop of the chain is within your control to configure. Setup a subdomain and send long polling requests through that so the timeouts can be set higher and not impact regular http requests or open yourself up to slow client ddos.

Why would you try to do long polling and not configure your request chain to be able to handle them without killing idle connections? The problems you have only exist because you're allowing them to exist. Set your idle timeouts higher. Send keepalives more often. Tell your web servers to not do request buffering, etc.

All of that is extremely easy to test and verify functioanlity. Does the request live longer than your polling interval? Yes? Great you're done! No? Tune some more timeouts and log the request chain everywhere you can until you know where the problems lie. Knock them out one by one going back to the origin until you get what you want.

Long polling is easy to get right from an operations perspective.

moritonal · a year ago

Whilst it's possible you may be correct, I do have to point out you are, I believe, lecturing John Nagle, known for Nagle's algorithm, used in most TCP stacks in the world.

tbillington · a year ago

> Every timeout in every hop of the chain is within your control to configure.

lol

Deleted Comment

mike-cardwell · a year ago

That race condition has nothing to do with long polling, it's just poor design. The sender should stick the message in a queue and the client reads from that queue. Perhaps with the client specifying the last id it saw.

profmonocle · a year ago

And it's worth noting that you can't just ignore this problem if you're using websockets - websockets disconnect sometimes for a variety of reasons. It may be less frequent than a long-polling timeout, but if you don't have some mechanism of detecting that messages weren't ack'd and retransmitting them the next time the user connects, messages will get lost eventually.

lmm · a year ago

> That race condition has nothing to do with long polling, it's just poor design. The sender should stick the message in a queue and the client reads from that queue.

How does that help? You can't pop from a queue over HTTP because when the client disconnects you don't know whether it saw your response or not.

interroboink · a year ago

I'm new to websockets, please forgive my ignorance — how is sending some "heartbeat" data over long polling different from the ping/pong mechanism in websockets?

I mean, in both cases, it's a TCP connection over (eg) port 443 that's being kept open, right? Intermediaries can't snoop the data if its SSL, so all they know is "has some data been sent recently?" Why would they kill long-polling sessions after 10sec and not web socket ones?

Animats · a year ago

Sending an idle message periodically might help. But the Apache timeout for persistent HTTPS connections is now only 5 seconds.[1] So you need rather a lot of idle traffic if the server side isn't tolerant of idle connections.

Why such a short timeout? Attackers can open connections and silently drop them to tie up resources. This is why we can't have nice things.

[1] https://httpd.apache.org/docs/2.4/mod/core.html#keepalivetim...

Myrmornis · a year ago

> If the server has nothing to send for a while, libcurl times out. The client then makes the request again. This results in a race condition if the server wants to send something between timeout and next request. Messages get lost.

I think it's a premise of reliable long-polling that the server can hold on to messages across the changeover from one client request to the next.

rednafi · a year ago

Yeah, some servers close connections when there’s no data transfer. When the backend holds the connection while polling the database until a timeout occurs or the database returns data, it needs to send something back to the client to keep the connection alive. I wonder what could be sent in this case and whether it would require special client-side logic.

mhitza · a year ago

In the HTTP model, technically status code 102 Processing would fit best. Though, no longer part of the HTTP specification.

https://http.dev/102

100 Continue could be usable as a workaround. Would probably require at a bare minimum some extra integration code on the client side.

Articles like this make me happy to use Phoenix and LiveView every day. My app uses WebSockets and I don’t think about them at all.

pipes · a year ago

Is this similar to Microsoft's blazer?

mattmanser · a year ago

MS's web socket solution is called signalr.

It is also fire and forget with fall over to http if web sockets aren't available. I believe if web sockets don't work it can fall over to http long polling instead, but don't quote me on that.

All the downsides of web sockets mentioned in the article are handled for you. Plus you can re-use your existing auth solution. Easily plug logging stuff in, etc. etc. Literally all the problems mentioned by the author are dealt with.

Given the author mentions C# is part of their stack I don't know why they didn't mention signalr or use that instead of rolling their own solution.

abrookewood · a year ago

Phoenix predates Blazer, but they are both server-side rendered frameworks.

In terms of real time updates, Phoenix typically relies on LiveView, which uses web sockets and falls back to long-polling if necessary. I think SignalR is the closest equivalent in the .Net world.

pipes · a year ago

Odd, I wonder why I got down voted for it, it was a genuine question

dugmartin · a year ago

The icing on the cake is that you can also enable Phoenix channels to fallback to longpolling in your endpoint config. The generator sets it to false by default.

cultofmetatron · a year ago

hah seriously. my app uses web sockets extensively but since we are also using Phoenix, its never been source of conflict in development. it really was just drop it and scale to thousands of users.

arrty88 · a year ago

Why couldn’t nodejs with uWS library or golang + gorilla handle 10s of thousands of connections?

jtchang · a year ago

This is using elixir right?

abrookewood · a year ago

Yes.

zacksiri · a year ago

I was thinking this exact thing as I was reading the article.

wutwutwat · a year ago

every websocket setup is painless when running on a single server or handling very few connections...

I was on the platform/devops/man w/ many hats team for an elixir shop running Phoenix in k8s. WS get complicated even in elxir when you have 2+ app instances behind a round robin load balancer. You now need to share broadcasts between app servers. Here's a situation you have to solve for w/ any app at scale regardless of language

app server #1 needs to send a publish/broadcast message out to a user, but the user who needs that message isn't connected to app server #1 that generated the message, that user is currently connected to app server #2.

How do you get a message from one app server to the other one which has the user's ws connection?

A bad option is sticky connections. User #1 always connects to server #1. Server #1 only does work for users connected to it directly. Why is this bad? Hot spots. Overloaded servers. Underutilized servers. Scaling complications. Forecasting problems. Goes against the whole concept of horizontal scaling and load balancing. It doesn't handle side-effect messages, ie user #1000 takes some action which needs to broadcast a message to user #1 which is connected to who knows where.

The better option: You need to broadcast to a shared broker. Something all app servers share a connection to so they can themselves subscribe to messages they should handle, and then pass it to the user's ws connection. This is a message broker. postgres can be that broker, just look at oban for real world proof. Throw in pg's listen/notify and you're off to the races. But that's heavy from a resources per db conn perspective so lets avoid the acid db for this then. Ok. Redis is a good option, or since this is elixir land, use the built in distributed erlang stuff. But, we're not running raw elixir releases on linux, we're running inside of containers, on top of k8s. The whole distributed erlang concept goes to shit once the erlang procs are isolated from each other and not in their perfect Goldilocks getting started readme world. So ok, in containers in k8s, so each app server needs to know about all the other app servers running, so how do you do that? Hmm, service discovery! Ok, well, k8s has service discovery already, so how do I tell the erlang vm about the other nodes that I got from k8s etcd? Ah, a hex package cool. lib_cluster to the rescue https://github.com/bitwalker/libcluster

So we'll now tie the boot process of our entire app to fetching the other app server pod ips from k8s service discovery, then get a ring of distributed erlang nodes talking to each other, sharing message passing between them, this way no matter which server the lb routes the user to, a broadcast from any one of them will be seen by all of them, and the one who holds the ws connection will then forward it down the ws to the user.

So now there's a non trivial amount of complexity and risk that was added here. More to reason about when debugging. More to consider when working on features. More to understand when scaling, deploying, etc. More things to potentially take the service down or cause it not to boot. More things to have race conditions, etc.

Nothing is ever so easy you don't have to think about it.

conradfr · a year ago

Did you try the Redis adapter that ships with Phoenix.PubSub (which Channels use)?

Muromec · a year ago

If only there was a way to send messages from one host to another in elixir

leansensei · a year ago

Same here. It truly is a godsend.

tzumby · a year ago

I came here to say exactly this! Elixir and OTP (and by extension LiveView) are such a good match for the problem described in the post.

j45 · a year ago

I was kind of wondering how something hadn't solve this at all, compared to a solution not readily being on one's path.

diggan · a year ago

Articles like this make me happy to use Microsoft FrontPage and cPanel, I don't think about HTTP or WebSockets at all.

ipnon · a year ago

CharlieDigital · a year ago

Would there be any technical benefit to this over using server sent events (SSE)?

Both are similar in that they hold the HTTP connection open and have the benefit of being simply HTTP (the big plus here). SSE (at least to me) feels like it's more suitable for some use cases where updates/results could be streamed in.

A fitting use case might be where you're monitoring all job IDs on behalf of a given client. Then you could move the job monitoring loop to the server side and continuously yield results to the client.

vindex10 · a year ago

I got interested and found this nice thread on SO: https://stackoverflow.com/a/5326159

One of the drawbacks, as I learned - SSE have limit on number of up to ~6 open connections (browser + domain name). This can quickly become a limiting factor when you open the same web page in multiple tabs.

bioneuralnet · a year ago

I've gotten around this by using the Page Visibility API - https://developer.mozilla.org/en-US/docs/Web/API/Page_Visibi.... Close the SSE connection when the page is hidden, and re-open when it becomes visible again.

As the other two comments mentioned, this is a restriction with HTTP/1.1 and it would apply also to long polling connections as well.

_heimdall · a year ago

Syncing state across multiple tabs and windows is always a bit tricky. For SSE, I'd probably reach for the BroadcastChannel API. Open the SSE connection in the first tab and have it broadcast events to any other open tab or window.

arresin · a year ago

…if you’re using http/1.1. It’s not an issue with 2+

Klonoar · a year ago

Not an issue if you’re using HTTP/2 due to how multiplexing happens.

lunarcave · a year ago

Good point! We did consider SSE, but ultimately decided against it due to the way we have to re-implement response payloads (one for application/json and one for text/event-stream).

I've not personally witnessed this, but people on the internets have said that _some_ proxies/LBs have problems with SSE due to the way it does buffering.

gorjusborg · a year ago

> we have to re-implement response payloads (one for application/json and one for text/event-stream)

I am curious about what you mean here. The 'text/event-stream' allows for abitrary event formats, it just provides structure for EventSource to be able to parse.

You should only need one 'text/event-stream' and should be able send the same JSON via normal or SSE response.

csumtin · a year ago

I tried using SSE and found it didn't work for my use case, it was broken on mobile. When users switched from the browser to an app and back, the SSE connection was broken and they wouldn't receive state updates. Was easier to do long polling

josephg · a year ago

The standard way to fix that is to send ping messages every ~15 seconds or something over the SSE stream. If the client doesn’t get a ping in any 20 second window, assume the sse stream is broken somehow and restart it. It’s complex but it works.

The big downside of sse in mobile safari - at least a few years ago - is you got a constant loading spinner on the page. Thats bad UX.

Not sure about the default `EventSource` object in JavaScript, but the Microsoft polyfill that I use (https://github.com/Azure/fetch-event-source) supports `POST` and there's an option `openWhenHidden` which controls how it reacts to users tabbing away.

bythreads · a year ago

SSE are being removed from various stacks and implementations

sudodevnull · a year ago

That's the dumbest thing I've ever heard. SSE is just a response type for normal HTTP. Explain exactly what you mean cause that's like saying that are removing response types from HTTP.

yuliyp · a year ago

I think this article is tying a lot of unrelated decisions to "Websocket" vs "Long-polling" when they're actually independent. A long-polling server could handle a websocket client with just a bit of extra work to handle keep-alive.

For the other direction, to support long-polling clients if your existing architecture is websockets which get data pushed to them by other parts of the system, just have two layers of servers: one which maintains the "state" of the connection, and then the HTTP server which receives the long polling request can connect to the server that has the connection state and wait for data that way.

harrall · a year ago

It sounded like the author(s) just had existing request-oriented code and didn’t want to rewrite it to be connection-oriented.

Personally I would enjoyed solving that problem instead of hacking around it but that’s me.

Author here.

Having done this, I don't think I'd reduce it to "just a little bit of work" to make it hum in production.

Everything in between your UI components and the database layer needs to be reworked to work in the connection-oriented (Websockets) model of the world vs request-oriented world.

> Everything in between your UI components and the database layer needs to be reworked to work in the connection-oriented (Websockets) model of the world vs request-oriented world.

How so? As a minimal change, the thing on the server end of the websocket could just do the polling of your database on its own while the connection is open (using authorization credentials supplied as the websocket is being opened). If the connection dies, stop polling. This has the nice property that you're in full control of the refresh rate, can implement coordinated backoffs if the database is overloaded, etc.

It's a good pattern to have a vanilla js network manager / layer in fe for this exact reason - makes swapping network technologies a lot simpler.

Only that knows url for endpoints, protocols and connections - and proxies between them and your app / components

wereHamster · a year ago

Unrelated to the topic in the article…

    await new Promise(resolve => setTimeout(resolve, 500));

In Node.js context, it's easier to:

    import { setTimeout } from "node:timers/promises";
    await setTimeout(500);

hombre_fatal · a year ago

I haven't used that once since I found out that it exists.

I just don't see the point. It doesn't work in the browser and it shadows global.setTimeout which is confusing. Meanwhile the idiom works everywhere.

joshmanders · a year ago

You can alias it if you're worried about shadowing.

    import { setTimeout as loiter } from "node:timers/promises";
    await loiter(500);

treve · a year ago

Is that easier? The first snippet is shorter and works on any runtime.

In the context of Node.js, where op said, yes it is easier. But it's a new thing and most people don't realize timers in Node are awaitable yet, so the other way is less about "works everywhere" and more "this is just what I know"

baumschubser · a year ago

I like long polling, it’s easy to understand from start to finish and from client perspective it just works like a very slow connection. You have to keep track of retries and client-side cancelled connections to have one but only one (and the right one) of requests at hand to answer to.

One thing that seems clumsy in the code example is the loop that queries the data again and again. Would be nicer if the data update could also resolve the promise of the response directly.

Hard disagree. Long polling can have complex message ordering problems. You have completely different mechanisms for message passing from client-to-server and server-to-client. And middle boxes can and will stall long polled connections, stopping incremental message delivery. (Or you can use one http query per message - but that’s insanely inefficient over the wire).

Websockets are simply a better technology. With long polling, the devil is in the details and it’s insanely hard to get those details right in every case.

wruza · a year ago

you can use one http query per message - but that’s insanely inefficient over the wire

Use one http response per message queue snapshot. Send no more than N messages at once. Send empty status if the queue is empty for more than 30-60 seconds. Send cancel status to an awaiting connection if a new connection opens successfully (per channel singleton). If needed, send and accept "last" id/timestamp. These are my usual rules for long-polling.

Prevents: connection overhead, congestion latency, connection stalling, unwanted multiplexing, sync loss, respectively.

You have completely different mechanisms for message passing from client-to-server and server-to-client

Is this a problem? Why should this even be symmetric?

_nalply · a year ago

One of them 2001 was that Netscape didn't render correctly if the connection is still open. Hah. I am sure this issue has been fixed a long, long time ago, but perhaps there are other issues.

Nowadays I prefer SSE to long polling and websockets.

The idea is: the client doesn't know that the server has new data before it makes a request. With a very simple SSE the client is told that new data is there then it can request new data separately if it wants. This said, SSE has a few quirks, one of them that on HTTP/1 the connection counts to the maximum limit of 6 concurrent connections per browser and domain, so if you have several tabs, you need a SharedWorker to share the connection between the tabs. But probably this quirk also appllies to long polling and websockets. Another quirk, SSE can't transmit binary data and has some limitations in the textual data it represents. But for this use case this doesn't matter.

I would use websockets only if you have a real bidirectional data flow or need to transmit complex data.

moribvndvs · a year ago

You could have your job status update push an update into an in-memory or distributed cache and check that in your long poll rather than a DB lookup, but that may require adding a bunch of complexity to wire the completion of the task to updating said cache. If your database is tuned well and you don’t have any other restrictions (e.g. serverless where you pay by the IO), it may be good enough and come out in the wash.

Neither Server-Sent Events nor WebSockets have replaced all use cases of long polling reliably. The connection limit of SSE comes up a lot, even if you’re using HTTP/2. WebSockets, on the other hand, are unreliable as hell in most environments. Also, WS is hard to debug, and many of our prod issues with WS couldn’t even be reproduced locally.

Detecting changes in the backend and propagating them to the right client is still an unsolved problem. Until then, long polling is surprisingly simple and a robust solution that works.

pas · a year ago

Robust WS solutions need a fallback anyway, and unless you are doing something like Discord long polling is a reasonable option.

infamia · a year ago

> The connection limit of SSE comes up a lot, even if you’re using HTTP/2.

I'm considering using SSE for an app. I'm curious, what problems you've run into? At least the docs say you get 100 connections between the server and a client, but it can be negotiated higher if needed it seems?

https://developer.mozilla.org/en-US/docs/Web/API/EventSource

SSEs are great and more reliable than websockets in a smaller scale. So I'd reach for it despite the issues. But that being said, some websevers don't play well with SSE and you'll need to fiddle with it. If you control the webserver, then it's not much of a problem.

imglorp · a year ago

Since the article mentioned Postgres by name, isn't this a case for using its asynchronous notification features? Servers can LISTEN to a channel and PG can TRIGGER and NOTIFY them when the data changes.

No polling needed, regardless of the frontend channel.

Yes, but the problems of detecting that changeset and delivering it to the right connection remains to be solved in the app layer.

cluckindan · a year ago

It would be easier to run Hypermode’s Dgraph as the database and use GraphQL subscriptions from the frontend. But nobody ever got fired for choosing postgres.

I have relatively recently taken steps towards Postgres from it's abiality to be at the center of so much until a project outgrows it.

In terms of not getting fired - Postgres is a lot more innovative than most databases, and the insinuation of IBM.

By innovative I mean uniquely putting in performance related items for the last 10-20 years.