Discord is entirely down right now, both the website and the app itself. Amusingly, a lot of the sites that normally track outages are also down, which made me think it was my internet at first. Downdetector, monitortheinternet, etc.
Lots of other big sites that are down: Patreon, npmjs, DigitalOcean, Coinbase, Zendesk, Medium, GitLab (502), Fiverr, Upwork, Udemy
Edit: 15 min later, looks like things are starting to come back up
My iPhone actually popped up a message saying that my wifi didn't appear to have internet, which was strange and obviously false as I was actively using the internet on it and the laptop next to it, but now it makes sense that it must have been pinging something backed by cloudflare!
Discord attempted to route me to: everydayconsumers.com/displaydirect2?tp1=b49ed5eb-cc44-427d-8d30-b279c92b00bb&kw=attorney&tg1=12570&tg2=216899.marlborotech.com_47.36.66.228&tg3=fbK3As-awso
I can live without creepy instant messengers, but its shocking just how much everything else relies on one, central system. And furthermore, why is it always cloudflair?
I did a pickup order a few weeks ago when of all things Tmobile SMS went down for 3+ hours. I couldn't go in the restaurant (covid) and I couldn't text them the parking # I was sitting at in a packed parking lot. I got a flood of about 50 texts a few hours later. Sat there for about an hour waiting for a $9 sandwich. I have no idea if they didn't get my order until late, or if they finally realized it was me or what. About 45 minutes in I decided to just give up on the day and take a nap, woke up to a door knock.
Kudos to the people at Discord. Just a few minutes after I got disconnected they already tweeted about the issue. Some minutes later and they have a message in their desktop app confirming it's an issue with Cloudflare. All while Cloudflare's statuspage says there are 'minor outages'.
As a percentage of total traffic, a 'minor' outage for Cloudflare probably equates to a significant outage for a non-trivial amount of the internet.
It will also be especially noticeable to end-users, because sites using Cloudflare are typically high-traffic sites, and so a 'minor' issue that affects only a handful of sites is still going to be noticed by millions of people.
I wonder if they are all using Cloudflare's free DNS stuff or if they're paying for business accounts?
My stuff is on Netlify (for the next week or so) and the rest is on a VPS bought from a local business who isn't reselling cloud resources. I'm kinda glad I moved all my stuff from cloudflare.
Crazily, my local name resolution started failing, because I have these names servers: 192.168.0.99, 1.1.1.1 and 8.8.8.8. The first does the local resolution, but macOS wasn't consulting it because 1.1.1.1 was failing?? Crazy. When I removed 1.1.1.1 from the list, everything started working.
Thought something like this was going on. At first I thought it was my router and restarted everything - to no avail. Glad to see confirmation that it wasn't an issue on my end.
Freenode's IRC servers were down which was unexpected for me. I was expecting old-school communication networks to not have a dependency on Cloudflare.
It really defies the original vision of the internet to have so many services depend on a single company. Almost every news site I was reading dropped off at once. I thought for a second that I lost internet in my own house.
Yes its really odd that core backbone providers can go down and everything works like its supposed to. Even trans-pacific cables can be cut and things will usually work with only increased latency. But there is not much redundancy for many companies at this layer; having redundant DNS providers is I'm sure possible but not something we think about very often, and of course many of the sites that are down are depending on the proxy and DOS mitigation services.
On my home network I use Google as a backup DNS provider so the whole internet didn't go dark for me, but I don't have a backup DNS host for my company's DNS records.
Redundant DNS is possible, but challenging when you're making use of features like geo DNS that don't lend themselves to easy replication via zone transfer.
I imagine most people would never expect something like this to happen, so having a fallback option when Cloudflare has a huge interruption of service like this is just unthinkable.
Agreed, but the real problem is DDoS and nobody seems to know how to globally solve it. Fighting DDoS is expensive, so you see consolidation. It's well and good to live in a tiny farming town but when raiders start attacking every week, those castle walls and guards start to look really appealing.
It's nice that Cloudflare provides their services for free but scrubbing has existed for a long time. With your own address space and an appropriate budget it's not difficult to have Cloudflare/Akamai/AWS announce your IP space with a higher weight than a direct path to your infrastructure. That will give you a little bit more fault tolerance for incidents like these.
That's what we get for externalizing costs. It's not hard to track down sources, but network operators usually let it be, hence the incentives are probably counter-productive.
Agreed, but I think people really underestimated the forces at work that would cause so much consolidation into a couple internet giants.
The original idea was that with the barrier to entry being so low, anyone and everyone could set up their own websites, mail servers, etc.
But with it being so easy to compare and contrast service (i.e. the market being so open), it means that the competitive forces naturally consolidate to a winner-take-all model. If when starting out Cloudflare was just 5% better than the competition, it could have easily taken the vast majority of the mindshare on the internet. Couple that with the fact that there are huge advantages with scale to a business like Cloudflare's, and it's not hard to see how so much of the internet has become dependent on it.
DNS is far less of a single point of failure and more decentralized than cloudflare. Nameservers can and are operated redundantly via simple, resolver-side round-robin scheduling and the TLD servers should have longer TTLs that allow plenty of caching. The rootzone even has anycast thanks to using UDP. Take a moment to look at DoH and laugh.
You can also also register your domain on multiple TLDs.
> Unlike previous DNS replacement proposals, D 3 NS is reverse compatible with DNS and allows for incremental implementation within the current system.
And the worst is if you try to raise concerns about cloudflare now it get brushed of as "cf already proxy half the internet, if it goes down our stuff will be minor concern".
I don't understand why the big companies don't always have at least two CDN providers, so they can failover to another one if something like this happens.
I know a lot of big companies do, but I am always surprised when you see ones that don't.
My CRM was nonfunctional. That’s some critical infrastructure for me. And then I’m wondering, is it me or is it my CRM. Turns out it’s door #3 - cloudflare
The point of the status page is so you can point to it for your five nines SLA and go "look? we were only down for one hour". As soon as the money relies on the metric, the metric will reflect the money.
Despite their update, I like how they're saying only their recursive DNS had "degraded performance", while authoritative is "operational". The entire reason everything blew up was because their authoritative nameservers weren't responding.
Ahh I remember when AWS went down (think it was 2 years ago now?) or at least a data center in us-east? Majority of the internet went down and status page went down as well. Man good times.
Status pages are a marketing channel not a channel for developers most of the time. It most likely has to go through some layers before someone updates the status page.
I don't think it's just Cloudflare; I just had a fun 10 minutes seeing servers start flipping on my Server monitoring service[1]. This has only happened once or twice per year, and is usually due to weird global DNS issues.
(To give an update, I'm seeing from my monitoring systems (about 15 points around the globe) sporadic outages for Microsoft, Apple, Reddit, Bing, Node.js, Twitter, Yahoo, and YouTube. And my own servers (not behind CF at all) are also flipping up and down. It started around 21:14 UTC.)
It was interesting that we saw our domains affected from the USA but from Mexico everything looked OK.
The crazier thing is that I tried to login to our CloudFlare account, it never sent me the 2FA code... I still haven't been able to login (Enterprise account)
We were down (downforeveryoneorjustme.com) completely, but back up now (as of a few minutes ago). Our domain wasn't even resolving; we use Cloudflare for frontend and DNS.
We had a surge of people checking if Discord was down on our site, then I noticed everything went down shortly after. Discord is still the top check right now.
I can't ever remember hitting these kind of traffic numbers before.
Funny, I tried to use your site because the website I was trying to access stopped working. But your site was also down so I figured it was just my internet being cranky :/
Thanks! Yep, we have a lot of things on the todo. We want to add more user-focused / location-based outage information since our site is still too reliant on simple HTTP checks to report downtime. This is especially a problem with a Discord outage, for example, where the frontend website is not down, but there might be problems with the API, apps, or other components.
And I'd like to be able to have our site communicate outages like this Cloudflare one, where more than one site might be affected by a larger provider. Automating that is difficult.
This is still a side project, though, so I mostly work on it when I get the urge :)
Something’s wonky, because it’s not just Cloudflare. One of my personal sites is down that uses nothing but a VPS, and I noticed my Unifi AP disconnect from its controller a little bit ago. Fiber cut? Routing issues?
Huh. My Ubiquiti was reporting WAN link down during this outage. I'm using ATT fiber. I'm wondering if "link down" doesn't mean what I think it means. Now that I check, it says "WAN iface [eth2] transition to state [inactive]". I'm wondering if that means link down or if it's doing service checking.
I actually have a WAN2 configured but not plugged in and it was set to "Load Balancing: Failover Only" ... I wonder if all of my 'connection issues' were software assuming my network link is down and switching interfaces to an unavailable one.
to reply to myself, if you have a second interface configured for failover, it actually tests against ping.ubnt.com. I bet every single time my ATT fiber has "gone out" for a minute or two at a time, it's been bogus.
root@USG-PRO-4:~# show load-balance watchdog
Group wan_failover
eth2
status: Running
pings: 2
fails: 0
run fails: 0/3
route drops: 0
ping gateway: ping.ubnt.com - REACHABLE
eth3
status: Waiting on recovery (0/3)
failover-only mode
pings: 1
fails: 1
run fails: 3/3
route drops: 1
ping gateway: ping.ubnt.com - DOWN
last route drop : Fri Jul 17 17:32:58 2020
We can't keep going on like this. The vulnerability of centralised internet infrastructure is a huge problem for everyone. Somebody, somewhere, really ought to sort it all out
10-20 minute router misconfigurations and subsequent fixes are sometimes a fact of life. big network infrastructure is complicated, and sometimes the best laid route tables of mice and men do go abloop and die.
Outages happen no matter what the infrastructure is. There's no solution, they're just something you need to recognize and handle, which Cloudflare seemingly did relatively quickly here.
I feel like for a lot of sites CF & CDNs are the only way to survive Reddit/HN/etc - do you disagree?
I definitely agree in concept with you, but then i think back to how frequently script kiddies took down sites ~10 years ago, or w/e. I feel like what has changes is the massive CDNs in front of so many sites.
So while i do want a better solution, i'm not sure what it looks like. Thoughts?
a) complexity: trick your servers into doing something hard
b) volumetric: overwelm your servers with a lot of traffic
c) volumetric part two: overwelm your servers with a lot of requests, so you respond with a lot of traffic
A and C are things you can work on your self --- try to limit the amount of work your server does in response to requests, and/or make resource consuming responses require resource consuming requests; and monitor and fix hotspots as they're found.
B is tricky, there's two ways to solve volumentric attack; either have enough bandwidth to drop the packets on your end, or convince the other end to drop the packets (usually called null routing). Null routes work great, but usually drop all packets to a particular destination IP, which means you need to move your service to another IP if you want it to stay online; that's hard to do if your IP needs to stay fixed for a meaningful time (TTL for glue records at TLDs is usually at least a day); and IP space is limited, so if your attackers are quick at moving attacks, you could run out of IPs to use. Some attacks are going above 1 Tbps though, so that's a lot of bandwidth if you need to accept and drop; and of course, the more bandwidth people get so they can weather attacks, the more bandwidth that can be used to attack others if it's not well secured.
Lots of other big sites that are down: Patreon, npmjs, DigitalOcean, Coinbase, Zendesk, Medium, GitLab (502), Fiverr, Upwork, Udemy
Edit: 15 min later, looks like things are starting to come back up
Deleted Comment
(Visit at your own risk.)
Hack?
Deleted Comment
I even checked to see if an AWS region was down once I realised it wasn't on my side (I thought it might have been my ISP's DNS servers or something).
The next move was to check Hacker News - thankfully it's not also hosted on Cloudflare, ha!
Ah, well. This too shall pass.
It will also be especially noticeable to end-users, because sites using Cloudflare are typically high-traffic sites, and so a 'minor' issue that affects only a handful of sites is still going to be noticed by millions of people.
My stuff is on Netlify (for the next week or so) and the rest is on a VPS bought from a local business who isn't reselling cloud resources. I'm kinda glad I moved all my stuff from cloudflare.
That is why if you have this question, you should go to google.com
My guess is that there are more resources invested in making sure google.com stays up than for any other site on the internet.
Deleted Comment
I loathe Discord, and I can barely contain myself with schadenfreude at this news.
On my home network I use Google as a backup DNS provider so the whole internet didn't go dark for me, but I don't have a backup DNS host for my company's DNS records.
Deleted Comment
The original idea was that with the barrier to entry being so low, anyone and everyone could set up their own websites, mail servers, etc.
But with it being so easy to compare and contrast service (i.e. the market being so open), it means that the competitive forces naturally consolidate to a winner-take-all model. If when starting out Cloudflare was just 5% better than the competition, it could have easily taken the vast majority of the mindshare on the internet. Couple that with the fact that there are huge advantages with scale to a business like Cloudflare's, and it's not hard to see how so much of the internet has become dependent on it.
We REALLY need a truly decentralized, distributed DNS system that is not owned by private entities.
You can also also register your domain on multiple TLDs.
https://ieeexplore.ieee.org/document/7530014/authors#authors
> Unlike previous DNS replacement proposals, D 3 NS is reverse compatible with DNS and allows for incremental implementation within the current system.
I know a lot of big companies do, but I am always surprised when you see ones that don't.
What's the point of a status page if it doesn't reflect the real status...
It's either the status page goes down with everything else or the status page is wrong. Great.
EDIT: Looks like it's accurate now, 20 minutes later.
[1]https://en.wikipedia.org/wiki/Goodhart%27s_law
Deleted Comment
Deleted Comment
[1] https://servercheck.in/
(To give an update, I'm seeing from my monitoring systems (about 15 points around the globe) sporadic outages for Microsoft, Apple, Reddit, Bing, Node.js, Twitter, Yahoo, and YouTube. And my own servers (not behind CF at all) are also flipping up and down. It started around 21:14 UTC.)
The crazier thing is that I tried to login to our CloudFlare account, it never sent me the 2FA code... I still haven't been able to login (Enterprise account)
We had a surge of people checking if Discord was down on our site, then I noticed everything went down shortly after. Discord is still the top check right now.
I can't ever remember hitting these kind of traffic numbers before.
And I'd like to be able to have our site communicate outages like this Cloudflare one, where more than one site might be affected by a larger provider. Automating that is difficult.
This is still a side project, though, so I mostly work on it when I get the urge :)
I actually have a WAN2 configured but not plugged in and it was set to "Load Balancing: Failover Only" ... I wonder if all of my 'connection issues' were software assuming my network link is down and switching interfaces to an unavailable one.
That could be the slogan for 2020
Outages happen no matter what the infrastructure is. There's no solution, they're just something you need to recognize and handle, which Cloudflare seemingly did relatively quickly here.
Level 3 or Telia going offline is perfectly survivable for any customer who has multiple upstreams.
I definitely agree in concept with you, but then i think back to how frequently script kiddies took down sites ~10 years ago, or w/e. I feel like what has changes is the massive CDNs in front of so many sites.
So while i do want a better solution, i'm not sure what it looks like. Thoughts?
Deleted Comment
One question is how to do DDoS protection without somebody like Cloudflare. Some new protocol for edge caching, perhaps?
a) complexity: trick your servers into doing something hard
b) volumetric: overwelm your servers with a lot of traffic
c) volumetric part two: overwelm your servers with a lot of requests, so you respond with a lot of traffic
A and C are things you can work on your self --- try to limit the amount of work your server does in response to requests, and/or make resource consuming responses require resource consuming requests; and monitor and fix hotspots as they're found.
B is tricky, there's two ways to solve volumentric attack; either have enough bandwidth to drop the packets on your end, or convince the other end to drop the packets (usually called null routing). Null routes work great, but usually drop all packets to a particular destination IP, which means you need to move your service to another IP if you want it to stay online; that's hard to do if your IP needs to stay fixed for a meaningful time (TTL for glue records at TLDs is usually at least a day); and IP space is limited, so if your attackers are quick at moving attacks, you could run out of IPs to use. Some attacks are going above 1 Tbps though, so that's a lot of bandwidth if you need to accept and drop; and of course, the more bandwidth people get so they can weather attacks, the more bandwidth that can be used to attack others if it's not well secured.