We need a replacement for TCP in the datacenter [pdf]

Yes!!! I have been saying for years that lower level protocols are a bad joke at this point, but nobody in the industry wants to invest in making things better. There are so many improvements we could be making, but corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

What's kind of hilarious about this paper is, these are just the network-layer problems! It completely ignores that the "port number" abstraction for service identification has completely failed due to the industry glomming onto HTTP as some sort of universal tunnel encapsulation for all application-layer protocols. And then there's all the non-backend problems!

And that's just TCP. We still lack any way to communicate up and down the stack of an entire transaction, for example for debugging purposes. We should have a way to forward every single layer of the stack across each hop, and return back each layer of the stack, so that we can programmatically determine the exact causes of network issues, automatically diagnose them, and inform the user how to solve them. But right now, you need a human being to jump onto the user's computer and fire up an assortment of random tools in mystical combinations and use human intuition to divine what's going on, like a god damn Networking Gandalf. And we've been doing it this way for 40+ years.

alexgartrell · 3 years ago

> corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

This is ridiculous.

Hyperscalars see an immediate ROI from efficiency/reliability improvements and actively invest in TCP alternatives all of the time. It's just really hard.

Networking companies see an ability to differentiate their products from their peers and work on this kind of thing as well. I did a 3 second google for "QUIC acceleration Mellanox" and got a hit on Nvidia's blog right away.

You just can't trivially replace something with an investment totally 50 years of clock time and thousands of years of engineer time. It will either take a long time or a massive shift in needs/technology. FWIW, I wouldn't be surprised if the high-performance RDMA networks being put together for AI workloads were the thing that grew into the "next" thing.

oconnor663 · 3 years ago

> 50 years of clock time and thousands of years of engineer time

It's not just the size of the investment, it's that it's the protocol everyone uses to talk to other people's machines, and you can't upgrade or replace other people's machines.

samgaw · 3 years ago

> FWIW, I wouldn't be surprised if the high-performance RDMA networks being put together for AI workloads were the thing that grew into the "next" thing.

Maybe we were just early in giving (HFT) customers RDMA back in ~2007[1][2] but I don't see it entering the mainstream anytime soon. And after a relatively short 20 years of adoption, the "next" thing for hyperscalers is not going to be the next thing for everyone else.

[1] https://downloads.openfabrics.org/Media/IB_LowLatencyForum_2...

[2] https://www.thetradenews.com/wombat-and-voltaire-break-milli...

kortilla · 3 years ago

> There are so many improvements we could be making, but corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

This is severe bullshit on two fronts:

- there is an immediate return on value - Google was driving this a decade+ ago for improvements in the data center (things like doubled+cancelable rpc, tcp cubic, quic, etc)

- academia constantly attempts to make these improvements as well because researchers are super incentivized to dethrone tcp for the glory. There are constant attempts to re-invent various layers (IP, tcp, the non existent upper layers of the OSI, etc) that come out of academic conferences every year.

The reason we’re still here is because our current stacks have been heavily optimized and tooled for production workloads. NICs can transparently re-assemble TCP segments for the OS and they can segment before transmit. You have to have a damn good value prop to throw away everything from software and hardware to careers and curriculum. It has to be a shitload better than the security nightmare of “return back each layer of the stack”.

twawaaay · 3 years ago

I don't think you realise why this is so hard.

The basic reason is that software at every level expects TCP/IP. And you can't drop in a translation layer because it will require at least the same amount of overhead as "real" TCP/IP.

It is not a local problem, it is a global problem that affects basically every single piece of non-trivial software in existence.

Even if you construct your datacenter with the new protocol you will run into problems that you can't run anything in it. Want Python? Sorry, have to rewrite it. And every Python library. And every Python application. Then you need to deal with problems that people who can run their scripts on their machines can't run them in datacenter. And so on.

The reason nobody wants to do this is that they would be investing huge amount of money to solve a problem for everybody else. Because the only way to make TCP/IP replacement work is to make it completely free and available to everybody.

There are much better ways to allocate your funds and precious top level engineers that let them distance themselves from competition temporarily.

bsder · 3 years ago

> corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

And yet every time hardware designers get the chance they redesign Ethernet and IPv4--poorly.

See: HDMI 2.0+, USB 3.0+, Thunderbolt 3.0+, etc.

My suspicion is that this paper works fine beween pairs of peers and immediately goes straight to hell after that. It is extremely suspicious that there is zero mention of SCTP and only compares to TCP and not UDP.

The problem with RPC is that multiple organizations must agree on meaning. And that's just not going to fly. It is damn near a miracle that a huge number of institutions all agree on the Ethernet/IP command "Please take this bag of bytes closer to the machine named: <string of bytes>."

dooglius · 3 years ago

Why do you say that these protocols are worse than Ethernet/IPv4? I'm not intimately familiar with any at L2/L3, but I don't think any have hacks as bad as ARP. (USB does have some weirdness at L1 though I know.)

idlehand · 3 years ago

Never thought about that before. Ethernet supports extremely high levels of data transmission. ISB C for intrgrated charging ND data transfer makes sense, but why are there HDMI cables?

friendzis · 3 years ago

> It completely ignores that the "port number" abstraction for service identification has completely failed due to the industry glomming onto HTTP as some sort of universal tunnel encapsulation for all application-layer protocols

I think this is more of an artefact of horizontal scaling and port-contention. De-facto standard discovery mechanism DNS does not work with ports, so "well-known port" abstraction kinda fails. Http as tunnel mostly avoids/sidesteps this problem.

This is weird take or I don't understand it. If you can communicate with an edge node in another network, but the edge node has issues communicating with some inner node (on your behalf), then, as a user, you have no hope of fixing that connectivity issue anyway, regardless of whether layered approach is used or not. This may be related to previous point about http as universal tunnel. Yes, this is a problem, but in a way that communications are effectively terminated at the edge node and monstrosity of stuff happens behind the scenes

patrec · 3 years ago

> De-facto standard discovery mechanism DNS does not work with ports

Yes, it does, see SRV records.

AtNightWeCode · 3 years ago

I would say ports are mainly a problem on layers below transport even though some tech overuse ports.

simplotek · 3 years ago

> Yes!!! I have been saying for years that lower level protocols are a bad joke at this point, but nobody in the industry wants to invest in making things better. There are so many improvements we could be making, but corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

If this was true then how do you explain that the likes of AWS, the same company who ended up investing in developing their own processor line, doesn't seem to agree that none of the pet peeves you mentioned are worth fixing?

xxpor · 3 years ago

https://aws.amazon.com/blogs/hpc/in-the-search-for-performan...

emn13 · 3 years ago

It's not obvious to me that replacing TCP really is harder than designing your "own" chip. Scarequotes here because those graviton chips (that's what you're referring to, I think?) are of course ARM chips, so they're not designing something fresh; they're adapting a very mature design to their own needs. In terms of interoperability, a custom chip based on a standard design is probably a simpler, more locally addressable problem than new network protocols.

Isn't it plausible that graviton was designed yet TCP retained simply because graviton as a project is easier to complete successfully?

gjulianm · 3 years ago

> We should have a way to forward every single layer of the stack across each hop, and return back each layer of the stack, so that we can programmatically determine the exact causes of network issues, automatically diagnose them, and inform the user how to solve them. But right now, you need a human being to jump onto the user's computer and fire up an assortment of random tools in mystical combinations and use human intuition to divine what's going on, like a god damn Networking Gandalf. And we've been doing it this way for 40+ years.

I work in a company that builds network troubleshooting/observability tools and we have some pretty experienced analysts to tell you what's wrong with the network. With that context, your idea of having any tool automatically diagnosing network issues is a pipe dream.

The problem with networks is that they're very complex systems, with multiple elements along the way, made by different manufacturers, often with different owners, failures aren't always easily reproducible, and with human configuration (and therefore errors) almost every step of the way. Even if a tool that "returns each layer of the stack" would be useful, it still would be far from enough to diagnose issues.

EricE · 3 years ago

"The problem with networks is that they're very complex systems, with multiple elements along the way, made by different manufacturers, often with different owners" Ah, how people forget the early days of networking. I remember vividly the early days of the Networld/Interop trade show - Interop was in the name because if, as a vendor, your equipment couldn't integrate with the show network they would throw your booth off the show floor.

That's how bad interoperability in the early days was!

duped · 3 years ago

Every major corporation has multiple research organizations doing nothing but invest in things that don't have immediate shareholder value.

What you're talking about though isn't just coming up with new ideas or even new products. It's replacing hundreds of billions in infrastructure wholesale. The scale at which these changes needs to happen to be practical are at the cluster level in a single data center. If you can propose something that fits that bill there are a few companies willing to pay you millions in salary as an engineering fellow to do it.

simplotek · 3 years ago

> What you're talking about though isn't just coming up with new ideas or even new products. It's replacing hundreds of billions in infrastructure wholesale.

I'd put it differently: it's paying up hundreds of billions in infrastructure to have some sort of gain.

And which gain is that exactly?

I see a lot of "the world is dumb but I am smart" comments in this thread but I saw no one presenting any clear advantage or performance improvement claim about hypothetical replacements. I see a lot of "we need to rewrite things" comments but not a single case being made with a clear tradeoff being presented. Every single criticism of TCP/IP in this thread sounds like change for the sake of change, and changes that aren't even presented with tangible improvements in mind or a clear performance gain.

Wouldn't that explain why TCP is so prevalent, and no one in their right mind thinks of replacing it?

arka2147483647 · 3 years ago

The paper argues the in '3.1 Stream orientation' section, that stream orientation is a problem for TCP, and says that most apps send messages instead, and the better protocol should handle messages, natively, etc. Which is a good point I think.

But back to TCP. What do you do, if you need to send Messages between applications in TCP? Preferably those Messages would be encrypted also.

You could make up your own protocol, but you probably would rather not! So you use something that is readily available, and does messages, encryption, etc. Would be nice if there were also a ready to use load balancers, caches, tools to debug it, etc

Now, what would be such a protocol.

Why HTTPS, of course.

So I kind of think that the lack of a low level Message Protocol has lead us, as an industry, to coalesce these features bit-by-bit on top of HTTP. It's not perfect by any means, but it does the job.

pclmulqdq · 3 years ago

HTTPS adds a tremendous amount of overhead to give you messaging. It's a lot better from a hyperscaler's perspective to replace TCP and not use the byte stream abstraction. After all, networks send messages. It's silly to throw that away at one layer and try to get it back at the next layer.

ajross · 3 years ago

Surely most of your ideas are already being deployed in QUIC/HTTP3. It just happens inside a UDP datagram, for compatibility. Really you're not going to see any new IP protocol layers, there's too much quirky hardware on the network that wouldn't be able to handle it. If we can't even get IPv6 to work all the way to the client, we're never seeing new values for the protocol byte.

vlovich123 · 3 years ago

Don’t the hyperscaled cloud providers run totally segmented networks? What’s stopping them from using something proprietary internally and just exposing TCP at the end for termination of client connections?

bigDinosaur · 3 years ago

Your ideas are interesting, can you link to or explain a concrete example though? The idea of everything magically debugging itself doesn't apply to a single piece of software I've ever seen, so I'm curious what kind of design would lead to that being possible.

Areading314 · 3 years ago

Heres an example of an improvement to sending large files over long distances -- Tsunami protocol. It tries to get a best of both worlds to limit the detrimental effect of synchronous roundtrips in the TCP protocol for file transfers:

https://tsunami-udp.sourceforge.net/

jstimpfle · 3 years ago

> It completely ignores that the "port number" abstraction for service identification has completely failed due to the industry glomming onto HTTP

If you have ever used multiple TCP or UDP connections in parallel on a single machine (doesn't matter if server or client) then you should realize that ports are required.

Apart from that, you can run HTTP on other ports than 80. You can also use HTTP to load balance or do service discovery by means of redirects. (Caveat, I don't work in this field and can't say how solid the approach works in practice).

robertlagrant · 3 years ago

> There are so many improvements we could be making, but corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

This is just not true. Stuff needs to be funded and worth doing, and the internet, like almost everything, is built on making things worth paying for, but there are also loads of improvements everywhere are being made.

fragmede · 3 years ago

What is QUIC in your book?

Given, say, $50 million of dev time, what would you go about fixing? And in what way?

OmarAssadi · 3 years ago

In addition to QUIC, KCP [1] is another reliable low-latency protocol that sits on top of UDP that might be interesting. And unlike RFC 9000/9001 (QUIC), encryption is optional. I haven't really seen it mentioned much outside of primarily China-focused projects, like V2Ray [2], but there is also some English information in their Git repo [2].

[1]: <https://github.com/skywind3000/kcp>

[2]: <https://www.v2fly.org/en_US/>

[3]: <https://github.com/skywind3000/kcp/blob/master/README.en.md>

Luker88 · 3 years ago

IMHO QUIC is nice, but a disappointment, since it could have been so much more.

Does not handle unreliable messages, still only (multi)streaming, no direct support for multicast, 0-rtt which need a lot of stuff to be manually done TheRightWay or risk amplification attacks, the (imho) under-researched (and removed) forward error correction, and more.

I just restarted working on what I consider to be the solution to this, federated authentication and a bit more, but $50M is too far to be even a dream since I am not google.

Areading314 · 3 years ago

Doesn't QUIC still run over TCP? I thought it was a replacement for HTTP not TCP (Edit: looks like it replaces TCP and HTTP)

still_grokking · 3 years ago

> There are so many improvements we could be making, but corporations don't see any "immediate shareholder value", so they sit around happy as pigs in shit with the status quo.

I would affirm that. It's imho true for almost everything in IT tech.

How computers "work" today is just pure madness when looked anyhow closer.

Everything's a result of some "historic accidents" back in the days, and from that the usual race to the bottom caused by market powers.

Nobody is willing to touch any of the lower layers no matter how crazy they are form today's viewpoint. We just shovel new layers on top to paper over the mistakes of the past. Nothing gets repaired, or actually what would be more more important, rethought form the ground up in light of new technological possibilities and changed requirements.

I understand from the economic standpoint how this comes. But I'm also quite sure we didn't make any fundamental improvements in the last 50 years of computing.

That's a very bad sign when everything in a field that's not even really 100 years old is frozen in time since 50 years because everything's so fragile and complex that fundamental changes aren't possible. This looks like a text book example of a house of cards…

Given how vital IT tech is to modern life I fear that this will crash at some point in the worst way possible.

And even if it won't crash, which is really strongly hope, we will never have nice things again as nothing of the old rotten things can be reasonably changed.

KaiserPro · 3 years ago

Thats virtual networking. but that introduces latency if its not well configured.

> But right now, you need a human being to jump onto the user's computer and fire up an assortment of random tools in mystical combinations and use human intuition to divine what's going on, like a god damn Networking Gandalf

not really, assuming you have the right fabric, its nowhere near as hard as that. Plus you seem to be forgetting that there is more to the network than TCP. There is a whole physical layer that has lots of semantics that greatly affect how easy it is to debug higher levels.

starfallg · 3 years ago

> We still lack any way to communicate up and down the stack of an entire transaction, for example for debugging purposes. We should have a way to forward every single layer of the stack across each hop, and return back each layer of the stack, so that we can programmatically determine the exact causes of network issues, automatically diagnose them, and inform the user how to solve them. But right now, you need a human being to jump onto the user's computer and fire up an assortment of random tools in mystical combinations and use human intuition to divine what's going on, like a god damn Networking Gandalf. And we've been doing it this way for 40+ years.

This violates the principle of encapsulation that the entire field of networking is based on, not to mention an massive security hole.

Deleted Comment

cdogl · 3 years ago

I’ll defer to experts on the network-layer problems but im not sure what you see as the problem with converging on HTTP. It’s awkward and inelegant, but as an a backend application developer I never feel like it gets in my way.

guenthert · 3 years ago

Nobody forces them though. It would be much easier to publish a standard port number mapping than to develop a (or multiple) new protocols. Now you just need to motivate people to use it.

peter_retief · 3 years ago

At last hopefully there is light at the end of the tunnel. Big question for me is who is going to build it?

Dead Comment

I get where this is coming from, but no. We don't need to replace TCP in the datacentre.

Why?

because for things that are low latency, need rigid flow control, or other 99.99% utilisation case, one doesn't use TCP. (Storage, which is high throughput, low latency and has rigid flow control, doesn't [well ignore NFS and iscisi] use TCP)

Look if it really was that much of a problem then everyone in datacentres would move to RDMA over infiniband. For shared memory clusters, thats what's been done for years. but for general purpose computing its pretty rare. Most of the time its not worth the effort. Infiniband is cheap now, so its not that hard to deploy RDMA[1] type interconnects. Having a reliable layer2 with inbuilt flow control solves a number of issues, even if you are just slamming IP over the top.

shit, even 25/100gig is cheap now. so most of your problems can be solved by putting extra nics in your servers and have fancypants distributed LACP type setups on your top of rack/core network.

The biggest issue is that its not the network that's constraining throughput, it either processing or some other non network IO.

[1]I mean it is hard, but not as hard as implementing a brand new protocol and expecting it to be usable and debuggable.

pclmulqdq · 3 years ago

The reality of today's large datacenters is that almost all of them have almost all of their traffic on TCP unless the owners of the datacenter have made a conscious effort to not use TCP. The highest-traffic applications, usually databases and storage systems, pretty much all use TCP unless you are buying a purpose-built HPC scale-out storage system (like a Lustre cluster). Most people who build a datacenter today use databases or object stores for storage, not Lustre or dedicated fiber channel SANs. On top of that, pub/sub systems all use TCP today, logging tends to be TCP, etc.

KaiserPro · 3 years ago

Fibre channel is dead, long live fibre channel.

I agree a lot of things are on TCP, but I don't think its a massive problem, unless you are running close to the limit of your core network. And one solution to that is to upgrade your core network....

Failing that, implementing some load balancing/partitioning systems to make sure data-processing affinity is best matched. This the better solution, because it yields other advantages as well. But its not the easiest, unless you have a good scheduler

wmf · 3 years ago

You're missing the fact that Stanford is the farm team for Google and Google is hyperscale. At scale, your "just spend more money" solutions are in fact more expensive than creating a new protocol. And like k8s, the new protocol can be sold to startups so they can "be like Google".

KaiserPro · 3 years ago

You're missing the point that maybe, just maybe, I'm part of a team that looks after >5 million servers.

You might also divine that while TCP can be a problem, a bigger problem is data affinity. Shuttling data from a next door rack costs less than one that's in the next door hall, and significantly less than the datacentre over. With each internal hop, the risk of congestion increases.

You might also divine that changing everything from TCP to a new, untested protocol across all services, with all that associated engineering effort, plus translation latency, might not be worth it. Especially as now all your observability and protocol routing tools don't work.

quick maths: a faster top of rack switch is possibly the same cost as 5 days engineering wage for a mid level google employee. How many new switches do you think you could buy with the engineering effort required to port everything to the new protocol, and have it stable and observable?

As a side note "oh but they are google" is not a selling point. Google has google problems half of which are things related to their performance/promotion system which penalises incremental changes in favour of $NEW_THING. HTTP2.0 was also a largely google effort designed to tackle latency over lossy network connections. which it fundamentally didn't do because a whole bunch of people didn't understand how TCP worked and were shocked to find out that mobile performance was shit.

still_grokking · 3 years ago

Google does not use K8s internally.

They never did, they won't ever do that!

K8s does not scale. Especially not to "Google scale".

First step to "be like Google" would be to ditch all that (docker-like) "container" madness and just compile static binaries. Than use something like Mesos to distribute workloads. Build literally everything as custom made on purpose solutions, and avoid mostly anything off the shelf.

"Being like Google" means not using any third party cloud stuff, but build your own in-house.

But this advice wouldn't sell GCP accounts. So Google does not tell you that. They telling you instead some marketing balderdash "how to be like Google".

ksec · 3 years ago

AWS is True HyperScale. Even more so than Google. And yet their spend more money solution on hardware seems to work fine.

TheRealDunkirk · 3 years ago

God, I love it when the talk turns hyper-technical around here, and the Jedi masters turn up.

Dead Comment

amluto · 3 years ago

The paper explicitly addresses Infiniband.

KaiserPro · 3 years ago

not really. they conflate infiniband with RoCE which given they have different semantics on congestion control, I'd say is a bit of a whoopsey.

if they are using RoCE, are they using DCB to avoid loss(well make it "lossless")? the paper implies otherwise.

bayindirh · 3 years ago

IB does not work TCP/IP by default. You can either run TCP over IB, which has a performance penalty, or you can directly run in Ethernet mode, which is something completely different.

colinmhayes · 3 years ago

> The biggest issue is that its not the network that's constraining throughput, it either processing

To be fair the paper talks a bit about how TCP makes multithreading slower compared to a message based system.

counttheforks · 3 years ago

> Storage, which is high throughput, low latency and has rigid flow control, doesn't [well ignore NFS and iscisi] use TCP)

So storage doesn't use TCP, except for the protocols that are actually used, which do use TCP?

KaiserPro · 3 years ago

Depends on what you are using, for connecting block stores, you'll use some sort of fabric. That is Fibre channel, SAS, NVME over something or other

If you are using GPFS, then you can do stuff over IB, but I don't know how that works. Lustre I imagine does lustre things over RDMA.

For everything else, NFS all the things. pNFS means that you can just throw servers at the problem and let the network figure it out.

But again, if IO speed is critical, you move IO over to a dedicated fabric of somesort. for most thing NFS is good enough. (except databases, its possible but not great. but then depending on your docker setup, you might be kneecaping your performance because overlayfs is causing io amplification)

All that is needed for congestion is for two large flows to hash to the same intermediate link; this hot spot will persist for the life of the flows and cause delays for any other messages that also pass over the affected link.