Cloudflare servers don't own IPs anymore so how do they connect to the internet?

As an industry, we are bad at deprecating old protocols like IPv4. This is a genius hack for a problem we have due to IPv6 not being adopted widely enough so that serving legacy IP users becomes a dropable liability to the business. The ROI is still high enough for us to “innovate” here. I applaud the solution but mourn the fact that we still need this.

I guess ingress is next, then? Two layers of Unimog to achieve stability before TCP/TLS termination maybe.

dopylitty · 3 years ago

I've been thinking a lot about this in my own enterprise and I've increasingly come to the conclusion that IP itself is the wrong abstraction for how the majority of modern networked compute works. IPv6, as a (quite old itself) iteration on top of IPv4 with a bunch of byzantine processes and acronyms tacked on is solving the wrong problem.

Originally IP was a way to allow discrete physical computers in different locations owned by different organizations to find each other and exchange information autonomously.

These days most compute actually doesn't look like that. All my compute is in AWS. Rather than being autonomous it is controlled by a single global control plane and uniquely identified within that control plane.

So when I want my services to connect to each-other within AWS why am I still dealing with these complex routing algorithms and obtuse numbering schemes?

AWS knows exactly which physical hosts my processes are running on and could at a control plane level connect them directly. And I, as someone running a business, could focus on the higher level problem of 'service X is allowed to connect to service Y' rather than figuring out how to send IP packets across subnets/TGWs and where to configure which ports in NACLs and security groups to allow the connection.

Similarly my ISP knows exactly where Amazon and CloudFlare's nearest front doors are so instead of 15 hops and DNS resolutions my laptop could just make a request to Service X on AWS. My ISP could drop the message in AWS' nearest front door and AWS could figure out how to drop the message on the right host however they want to.

I know there's a lot of legacy cruft and also that there are benefits of the autonomous/decentralized model vs central control for the internet as a whole but given the centralized reality we're in, especially within the enterprise, I think it's worth reevaluating how we approach networking and whether the continuing focus on IP is the best use of of our time.

wpietri · 3 years ago

> my laptop could just make a request to Service X on AWS

I was looking for the "just" that handwaves away the complexity and I was not disappointed.

How do you imagine your laptop expressing a request in a way that it makes it through to the right machine? Doing a traceroute to amazon.com, I count 26 devices between me and it. How will those devices know which physical connection to pass the request over? Remember that some of them will be handling absurd amounts of traffic, so your scheme will need to work with custom silicon for routing as well as doing ok on the $40 Linksys home unit. What are you imagining that would be so much more efficient that it's worth the enormous switching costs?

I also have questions about your notion of "centralization". Are you saying that Google, Microsoft, and other cloud vendors should just... give up and hand their business to AWS? Is that also true for anybody who does hosting, including me running a server at home? If so, I invite you to read up on the history of antitrust law, as there are good reasons to avoid a small number of people having total control over key economic sectors.

error503 · 3 years ago

> So when I want my services to connect to each-other within AWS why am I still dealing with these complex routing algorithms and obtuse numbering schemes?

> AWS knows exactly which physical hosts my processes are running on and could at a control plane level connect them directly. And I, as someone running a business, could focus on the higher level problem of 'service X is allowed to connect to service Y' rather than figuring out how to send IP packets across subnets/TGWs and where to configure which ports in NACLs and security groups to allow the connection.

You shouldn't be? Doesn't AWS number your machines for you automatically and give you a unique ID you can use with DNS to reach it? And also provide a variety of 'ingress' services to abstract load balancing and security as well? I'm not a consumer of AWS services in my dayjob, but isn't this their entire raison d'etre? Otherwise you may as well just run much cheaper VMs elsewhere.

> Similarly my ISP knows exactly where Amazon and CloudFlare's nearest front doors are so instead of 15 hops and DNS resolutions my laptop could just make a request to Service X on AWS. My ISP could drop the message in AWS' nearest front door and AWS could figure out how to drop the message on the right host however they want to.

Uhm, aside from handwaving away how your ISP is going to give you a direct, no-hops connection to AWS, this is pretty much exactly what your ISP is doing. Hell, in some cases, your ISP has abstracted the underlying backbone hops too using something like MPLS, and this is completely invisible to you as an end user. You or your laptop don't have to think about the network part of things at all. You ask to connect to s3, your laptop looks up the service's IP address (unique ID) in DNS, sends some packets, and your ISP routes them to CloudFlare's nearest front doors.

There are some good arguments to be made for a message-passing focused rather than connection focused protocol model, but that doesn't seem to be what you're talking about. What you seem to be talking about is doing away with routing altogether, and even in a relatively centralized internet, that just makes zero sense. We will continue to need the aggregation layer, we will continue to have multiple routes to a resource through multiple hops that need to be resolved into a path, and we'll continue to need a way to uniquely identify a service endpoint.

ec109685 · 3 years ago

The IP addresses you see as an AWS customer aren’t the same used to route packets between hosts. That said, there’s a huge amount of commodity infrastructure built up that understands IP addresses and routing layers, so unless a new scheme offers tremendous benefits, it won’t get adoption.

At least from a security perspective though ip acl’s are falling out of favor to service based identities, which is a good thing.

You can see how AWS internally does networking here: https://m.youtube.com/watch?v=ii5XWpcYYnI

akira2501 · 3 years ago

> Rather than being autonomous it is controlled by a single global control plane and uniquely identified within that control plane.

By default, sure. You can easily bring your own IPs into AWS and use them instead, and I don't think it's hard to imagine the pertinent use cases and risk management this brings.

otabdeveloper4 · 3 years ago

You misunderstand what "the internet" is. "The internet" is a process and methodology for connectivity between LANs. And 48 addressing bits is more than enough to solve this problem.

TCP/IP is full of cruft and causes a shitload of unnecessary problems, but increasing the number of bits in the addressing space solves literally none of them.

This is a horrible way to avoid upgrading the world to IPv6.

xnyan · 3 years ago

The industry will not transition to v6 unless: 1) The cost of not doing so is higher than the cost of sticking with v4. Because of all the numerous clever tricks and products designed to mitigate v4's limitations, the cost argument still favors v4 for most people in most situations.

2) We admit that v6 needs to be rethought and rethink it. I understand why v6 does not just increase IP address bits from 32 to 128, but at this point I think everyone has admitted that v6 is simply too difficult for most IT departments to implement. In particular, the complexity of the new assignment schemes like prefix delegation and SLAAC needs to be paired back. Offer a minimum set of features and spin off everything else.

tempnow987 · 3 years ago

Spot on with part 2. Ignore all the folks saying IPv6 is "simple" or works well.

I am absolutely no expert, but could get my head around ipv4, but IPv6 - I always end up running into a fuss. I really wish they'd expanded address space to 64 bits, a few other tweaks, and called it good. Maybe call it IPv5? Is there any chance of doing something like this.

So many things that are so trivial or well known in IPv4 are a total nightmare pain with IPv6. Some quick examples:

Internet service providers will happily give you a block of static IPv4 addresses for a price. ATT goes up to a 64 ip address block easily, even on residential. Almost impossible to get a static block of IPv6 in the US.

Let's say you are SMB, you want WAN failover. With IPv4 this is simple. You can either get two blocks of static for your upstream, and route them directly as appropriate to your servers, or go behind a NAT and do a failover option. Whent the failover happens, your internal network is relatively unaffected.

Now try to do this with IPv6? You can't get your static IP's to do direct routing with, and NPT and anything else is a mess, and the latency in having your entire network renumber when the WAN side flaps is stupid and annoying.

In many SMB contexts folks are very used to DHCP, they use it to trigger boot scripts TFTP, zero touch phone provisioning and lots more, pass out time servers and other info and more. The set of end user devices (printers, phones, security cameras, intercoms, industrial iOT) that can be configured and supported with IPv6 is so poor and the complexity is so high.

Not all ISP's offer prefix delegation to end user sites. Because you have an insane minimum subnet size with ipv6 a lot of things that for example you just need two IPs (think separate network for customer premise equipment) now need a 18,446,744,073,709,551,616 addresses.

The GSE debacle means instead of a very large 64 bit address space we got an insane 128 bit address space. Seriously, how about 96 or anything else a bit more reasonable.

Even things like ICMPv6 - if you just let it through the firewall you could be asking for trouble, but blocking it also causes IPv6 problems. Ugh. Oh, it's simpler than IPv4 they say.

anecdotal1 · 3 years ago

That and I have PTSD from IPv6 routing to random places being broken which nobody notices because the fallback to v4 is still working ...

zajio1am · 3 years ago

> In particular, the complexity of the new assignment schemes like prefix delegation and SLAAC needs to be paired back.

Prefix delegation to customers is necessary if you want to avoid NAT. IPv4 stayed with delegating just one IP address to customer (household) because everyone used NAT in home routers due to IP address conservation.

SLAAC was introduced because people wanted a simpler alternative to more complex DHCP. That is why it is mandatory, while DHCP is optional.

> but at this point I think everyone has admitted that v6 is simply too difficult for most IT departments to implement.

I do not think that IPv6 is too difficult to implement, i rarely heard this argument. The main reason that there is no transition is that there is no economic nor political incentive to do so for individual organizations, so there is a coordination problem.

ignoramous · 3 years ago

> We admit that v6 needs to be rethought and rethink it... Offer minimum set of features and spin off everything else.

NAT is that IPv6.

__turbobrew__ · 3 years ago

I think we would have been better off if ipv6 just lengthened the address fields and left it at that.

immibis · 3 years ago

Both prefix delegation and SLAAC are optional. Use DHCPv6 if you like.

Animats · 3 years ago

I'm surprised that Cloudflare isn't all IPv6 when Cloudflare is the client. That would solve their address problems. Maybe charge more if your servers can't talk IPv6. Or require it for the free tier.

It's useful that they use client side certificates. (They call this "authenticated origin pull", but it seems to be client side certs.

amluto · 3 years ago

Sadly Cloudflare seems to treat client certificates as an optional nifty feature as opposed to the critical feature that it is. And even some of the settings that look secure aren’t:

https://medium.com/@ss23/leveraging-cloudflares-authenticate...

Authenticated origin pulls should not be “useful”. They should be on and configured securely by default, and any insecure setting should get a loud warning.

majke · 3 years ago

Hi there! In the article I only briefly mentioned IPv6, it was long enough.

Indeed, we do happy-eyeballs where we can and _strongly_ prefer IPv6 when origin host resolves AAAA. However, we still need IPv4, since there is still a big chunk of traffic that only works IPv4.

If the client gives us a chance, we'll strongly prefer IPv6.

ec109685 · 3 years ago

They also have to egress to third party servers since they are a CDN and support things like serverless functions

LinkLink · 3 years ago

It is easier to convince a group of geniuses of a grand idea, than to convince an average person of changing one thing.

uvdn7 · 3 years ago

This is a wonderful article. Thanks for sharing. As always, Cloudflare blog posts do not disappoint.

It’s very interesting that they are essentially treating IP addresses as “data”. Once looking at the problem from a distributed system lens, the solution here can be mapped to distributed systems almost perfectly.

- Replicating a piece of data on every host in the fleet is expensive, but fast and reliable. The compromise is usually to keep one replica in a region; same as how they share a single /32 IP address in a region.

- “sending datagram to IP X” is no different than “fetching data X from a distributed system”. This is essentially the underlying philosophy of the soft-unicast. Just like data lives in a distributed system/cloud, you no longer know where is an IP address located.

It’s ingenious.

They said they don’t like stateful NAT, which is understandable. But the load balancer has to be stateful still to perform the routing correctly. It would be an interesting follow up blog post talking about how they coordinate port/data movements (moving a port from server A to server B), as it’s state management (not very different from moving data in a distributed system again).

remram · 3 years ago

I have a lot of trouble mapping your comment to the content of the article. It is about the egress addresses, the ones CloudFlare use as source when fetching from origin servers. Those addresses need to be separated by the region of the end-user ("eyeball"/browser) and the CloudFlare service they are using (CDN or WARP).

The cost they are working around is the cost of IPv4 addresses, versus the combinatorial explosion in their allocation scheme (they need number of services * number of regions * whatever dimension they add next, because IP addresses are nothing like data).

I am not sure where you see data replication in this scheme?

It's not meant to be a perfect analogy. The replication analogy is mostly talking about the tradeoff between performance and cost. So it's less about "replicating" the ip addresses (which is not happening). On that front, maybe distribution would be a better term. Instead of storing a single piece of data on a single host (unicast), they are distributing it to a set of hosts.

Overall, it seems like they are treating ip addresses as data essentially, which becomes most obvious when they talk about soft-unicast.

Anyway, I just found it interesting to look at this through this lens.

nvarsj · 3 years ago

By stateful NAT they mean connection tracking. In the described solution, the LB/router doesn’t track connections - it simply looks up the server via local mapping from port range to server, and forwards the packet.

Incidentally, this is exactly how GCP Cloud NAT works.

Rasbora · 3 years ago

Whenever I see the name Marek Majkowski come up, I know the blog post is going to be good.

I had to solve this exact problem a year ago when attempting to build an anycast forward proxy, quickly came to the conclusion that it'd be impossible without a massive infrastructure presence. Ironically I was using CF connections to debug how they might go about this problem, when I realized they were just using local unicast routes for egress traffic I stopped digging any deeper.

Maintaining a routing table in unimog to forward lopsided egress connections to the correct DC is brilliant and shows what is possible when you have a global network to play with, however I wonder if this opens up an attack vector where previously distributed connections are now being forwarded & centralized at a single DC, especially if they are all destined for the same port slice...

deftoast · 3 years ago

I think this is no worse than using unicast IPs - they can probably reuse the same firewall technology to guard against attacks as before.

jxhdkdhdks · 3 years ago

you wonder about an attack and everyone else wonder if that was the main point of all this.

xg15 · 3 years ago

> However, while anycast works well in the ingress direction, it can't operate on egress. Establishing an outgoing connection from an anycast IP won't work. Consider the response packet. It's likely to be routed back to a wrong place - a data center geographically closest to the sender, not necessarily the source data center!

Slightly OT question, but why wouldn't this be a problem with ingress, too?

E.g. suppose I want to send a request to https://1.2.3.4. What I don't know is that 1.2.3.4 is an anycast address.

So my client sends a SYN packet to 1.2.3.4:443 to open the connection. The packet is routed to data center #1. The data center duly replies with a SYN/ACK packet, which my client answers with an ACK packet.

However, due to some bad luck, the ACK packet is routed to data center #2 which is also a destination for the anycast address.

Of course, data center #2 doesn't know anything about my connection, so it just drops the ACK or replies with a RST. In the best case, I can eventually resend my ACK and reach the right data center (with multi-second delay), in the worst case, the connection setup will fail.

Why does this not happen on ingress, but is a problem for egress?

Even if the handshake uses SYN cookies and got through on data center #2, what would keep subsequent packets that I send on that connection from being routed to random data centers that don't know anything about the connection?

grogers · 3 years ago

Yep, it can happen that your packet gets routed to a different DC from a prior packet. But the routers in between the client and the anycast destination will do the same thing if the environment is the same. So to get routed to a new location, you would usually need either:

* A new (usually closer) DC comes online. That will probably be your destination from now on.

* The prior DC (or a critical link on the path to it) goes down.

The crucial thing is that the client will typically be routed to the closest destination to it. In the egress case the current DC may not be the closest DC to the server it is trying to reach so the return traffic would go to the wrong place. This system of identifying a server with unique IP/port(s) means that CF's network can forward the return traffic to the correct place.

9935c101ab17a66 · 3 years ago

Dang this stuff is super interesting. As someone who knows very little about networking and DNS, both your comment and the parent comment are incredible insightful and illuminating! Thanks for sharing.

tonyb · 3 years ago

It works because the route to 1.2.3.4 is relatively stable. The routes would only change and end up at data center #2 if data center #1 stopped announcing the routes. In that case the connection would just re-negotiate to data center #2.

Ah, ok, that makes sense. So for a given point of origin, anycast generally routes to the same server?

matsur · 3 years ago

This is a problem in theory. In practice (and through experience) we see very little routing instability in the way you describe.

You mean, it's just luck?

In addition to what others have said about route stability, there is possibly a more relevant point to make: for 'ingress' traffic, the anycast IP is the destination, and for 'egress' traffic it is the source. This distinction is important because the system doesn't have any reasonable way to anticipate where the reply traffic to the anycast IP will actually route to. A packet is launched to the Internet from an anycast IP, and the replies could come to anywhere that IP is announced - most likely not the same place the connection originated. Contrast to when the (unicast) client makes the first move, and the anycast node that gets the initial packets is highly likely to get any follow up packets as well, so in the vast majority of cases, this works just fine.

ratorx · 3 years ago

As others have mentioned, this is not often a problem because routing is normally fairly stable (at least compared to the lifetime of a typical connection). For longer lived connections (e.g. video uploads), it’s more of a problem.

Also, there are a fair number of ASes that attempt to load balance traffic between multiple peering points, without hashing (or only using the src/dst address and not the port). This will also cause the problem you described.

In practice it’s possible to handle this by keeping track of where the connections for an IP address typically ingress and sending packets there instead of handling them locally. Again, since it’s a few ASes that cause problems for typical connections, is also possible to figure out which IP prefixes experience the most instability and only turn on this overlay for them.

Yes, as others have mentioned, route flapping is a problem. But, in practice, not as big a problem as DNS-based routing.

- See: https://news.ycombinator.com/item?id=10636547

- And: https://news.ycombinator.com/item?id=17904663

Besides, SCTP / QUIC aware load balancers (or proxies) are detached from IPs and should continue to hum along just fine regardless of which server IP the packet ends up at.

danrl · 3 years ago

mcint · 3 years ago

Given the discussion under https://news.ycombinator.com/item?id=33743935 about opening a connection to 1.2.3.4:443 (I'm prompted by the same curiosity about ingress load-balancing, statelessly)...

How does the ingress "router" load-balance incoming connections, which it must (even if the "router" is a default host or cluster)? CF isn't opening TCP, then HTTP just to send redirects to another IP for the same cluster.

I guess hashing, on IP and port, is already readily used in routing decisions, so eyeball-addr and -port of the inbound packets, 4-tuple (CF-addr [fixed, shared], CF-port [443], eyeball-addr, eyeball-port), provides a consistent.

I guess that this is good for 99.9 percent of connections, which are short-lived, and opportunistically kept open or reused. I suppose other long-lived connections might take place in the context of an application that tracks data above and outside of TCP-alone. I'm grasping for a missing middle, in size of use case, and can't quickly name things that people might proxy but need stable connections. CloudFlare's reverse proxying to web servers would count, if the web-fronting had to traverse someone else's proxy layer.

What are the rough edges here? What's are next challenges here to build around?

benlivengood · 3 years ago

It's strange/sad how 50% of their problem is caused by geofencing which is caused by archaic laws.

40% is caused by archaic protocols that don't allow enough addresses or local efficient P2P or ISP caching of public content (sort of what IPFS aimed to do), which would alleviate much of the need for CDNs in the first place. The remaining 10% is just solving hard problems.

I'm a little surprised that splitting by port number gives servers enough connections; maybe they are connection-pooling some of them between their egress and final destination. If there was truly a 1:1 mapping of all user requests to TCP connections then I'd expect there to be ~thousands of simultaneous connections to, say, Walmart today (black Friday) which is also probably on an anycast address, limiting the number of unique src:dst ip and port tuples to 65K per cloudflare egress IP address. Maybe that ends up being not so bad and DCs can scale by adding new IPs? https://blog.cloudflare.com/how-to-stop-running-out-of-ephem... covers a lot of the details in solving that and similar problems.

inopinatus · 3 years ago

TLDR: Cloudflare is using five bits from the port number as a subnetting & routing scheme, with optional content policy semantics, for hosts behind anycast addressing and inside their network boundary.

huggingmouth · 3 years ago

By far the best summary I've come across so far. Thank you!