> The (sadly all too) common approach to rarely occurring bugs & edge cases: Pretend like the problem doesn't exist. Blame it on faulty networking, solar flares, etc.
How to tell if someone hasn't been working with a piece of software in production yet? They've never blamed a bug on cosmic radiation yet :D
> At a given interval (the `pingInterval` value sent in the handshake) the server sends a PING packet and the client has a few seconds (the `pingTimeout` value) to send a PONG packet back. If the server does not receive a PONG packet back, it will consider that the connection is closed.
So far this describes what we've already been doing prior to the fix mentioned in the blog post. However, this next sentence is where Socket.io's solution diverts from ours:
> Conversely, if the client does not receive a PING packet within `pingInterval + pingTimeout`, it will consider that the connection is closed.
Indeed looks like a solid way to solve the client-side recognition of a broken connection!
--
That said, I'm a little confused because I cannot find any mention of `pingTimeout` in their JS client [0], and `pingInterval` is only mentioned in an implementation of a test server [1]. I wonder if I'm looking at the wrong thing.
This is what confused me. The discussion made sense to leave it up to the browser to implement, but I can't understand why they didn't require it in the browser WebSocket implementation — they even suggested it and then forgot about it.
Yes! I was both surprised and confused when I saw this. Unless I'm missing something, it means that every application implementing WebSockets has to reinvent the wheel, creating their client-side ping/pong handler using Data Frames since browsers don't automatically send nor expose an API for sending/acting on Control Frames.
A classic issue of TCP half open connection. The client/browser side still thinks that the websocket/TCP connection is still alive. It happens because the client is not actively sending any data outbound, which would have helped to reset that connection eventually. It will be nice if the browser side of the websocket connection can also start PING/PONG mechanism.
Interesting read, thanks. I've delved into websockets and hit some interesting issues. I don't think I've had this scenario - that I know of - but this is good to know.
This is practical implementation when working with websocket. When server got an error or timeout waiting for client pong, it closes the connection, at the same time client send “health check” message without receive reponse (whatever message value of your choise) it closes the connection and reconnect.
How to tell if someone hasn't been working with a piece of software in production yet? They've never blamed a bug on cosmic radiation yet :D
https://socket.io/docs/v4/how-it-works/
See disconnection detection section
https://socket.io/docs/v4/how-it-works/#disconnection-detect... says:
> At a given interval (the `pingInterval` value sent in the handshake) the server sends a PING packet and the client has a few seconds (the `pingTimeout` value) to send a PONG packet back. If the server does not receive a PONG packet back, it will consider that the connection is closed.
So far this describes what we've already been doing prior to the fix mentioned in the blog post. However, this next sentence is where Socket.io's solution diverts from ours:
> Conversely, if the client does not receive a PING packet within `pingInterval + pingTimeout`, it will consider that the connection is closed.
Indeed looks like a solid way to solve the client-side recognition of a broken connection!
--
That said, I'm a little confused because I cannot find any mention of `pingTimeout` in their JS client [0], and `pingInterval` is only mentioned in an implementation of a test server [1]. I wonder if I'm looking at the wrong thing.
[0]: https://github.com/socketio/socket.io-client/search?q=pingti...
[1]: https://github.com/socketio/socket.io-client/search?q=pingin...
I do wish we didn't all have to reinvent this wheel though…
https://www.w3.org/Bugs/Public/show_bug.cgi?id=13104
Such a good insight -- seems obvious, but too often the source of gotchas, bad data, and bad user experience.
It’s application layer keepalive.