Readit News logoReadit News
Animats · 4 years ago
There's still no connectivity to Facebook's DNS servers:

    > traceroute a.ns.facebook.com
      traceroute to a.ns.facebook.com (129.134.30.12), 30 hops max, 60 byte packets
      1  dsldevice.attlocal.net (192.168.1.254)  0.484 ms  0.474 ms  0.422 ms
      2  107-131-124-1.lightspeed.sntcca.sbcglobal.net (107.131.124.1)  1.592 ms  1.657 ms  1.607 ms 
      3  71.148.149.196 (71.148.149.196)  1.676 ms  1.697 ms  1.705 ms
      4  12.242.105.110 (12.242.105.110)  11.446 ms  11.482 ms  11.328 ms
      5  12.122.163.34 (12.122.163.34)  7.641 ms  7.668 ms  11.438 ms
      6  cr83.sj2ca.ip.att.net (12.122.158.9)  4.025 ms  3.368 ms  3.394 ms
      7  * * *
      ...
So they're hours into this outage and still haven't re-established connectivity to their own DNS servers.

Animats · 4 years ago
"facebook.com" is registered with "registrarsafe.com" as registrar. "registrarsafe.com" is unreachable because it's using Facebook's DNS servers and is probably a unit of Facebook. "registrarsafe.com" itself is registered with "registrarsafe.com".

I'm not sure of all the implications of those circular dependencies, but it probably makes it harder to get things back up if the whole chain goes down. That's also probably why we're seeing the domain "facebook.com" for sale on domain sites. The registrar that would normally provide the ownership info is down.

Anyway, until "a.ns.facebook.com" starts working again, Facebook is dead.

Animats · 4 years ago
Notes as Facebook comes back up:

"registrarsafe.com" is back up. It is, indeed, Facebook's very own registrar for Facebook's own domains. "RegistrarSEC, LLC and RegistrarSafe, LLC are ICANN-accredited registrars formed in Delaware and are wholly-owned subsidiaries of Facebook, Inc. We are not accepting retail domain name registrations." Their address is Facebook HQ in Menlo Park.

That's what you have to do to really own a domain.

robalfonso · 4 years ago
This is not completely accurate. The whole reason a registrar with domain abc.com can use ns1.abc.com is because glue records are established at the registry, this allows a bootstrap that keeps you in from a circular dependency. All that said it’s usually a bad idea, for someone as large as Facebook they should have nameservers across zones ie a.ns.fb.com b.ns.fb.org c.ns.fb.co Etc…
jacurtis · 4 years ago
Facebook does operate their own private Registrar, since they operate tens of thousands of domains. Most of these are misspellings and domains from other countries and so forth.

So yes, the registrar that is to blame is themselves.

Source: I know someone within the company that works in this capacity.

thiht · 4 years ago
> That's also probably why we're seeing the domain "facebook.com" for sale on domain sites. The registrar that would normally provide the ownership info is down.

That’s not how it works. The info of whether a domain name is available is provided by the registry, not by the registrars. It’s usually done via a domain:check EPP command or via a DAS system. It’s very rare for registrar to registrar technical communication to occur.

Although the above is the clean way to do it, it’s common for registrars to just perform a dig on a domain name to check if it’s available because it’s faster and usually correct. In this case, it wasn’t.

BillinghamJ · 4 years ago
When the NS hostname is dependent on the domain it serves, "glue records" cover the resolution to the NS IP addresses. So there's no circular dependency type issue
john37386 · 4 years ago
Good catch. Hopefully, they won't need an email sent to fb.com from registrarsafe.com to update an important record to fix this. What a loop.
mdtancsa · 4 years ago
Its partially there. C and D are still not in the global tables according to routeviews ie. 185.89.219.12 is still not being advertised to anyone. My peers to them in Toronto have routes from them, but not sure how far they are supposed to go inside their network. (past hop 2 is them)

% traceroute -q1 -I a.ns.facebook.com

traceroute to a.ns.facebook.com (129.134.30.12), 64 hops max, 48 byte packets 1 torix-core1-10G (67.43.129.248) 0.133 ms

2 facebook-a.ip4.torontointernetxchange.net (206.108.35.2) 1.317 ms

3 157.240.43.214 (157.240.43.214) 1.209 ms

4 129.134.50.206 (129.134.50.206) 15.604 ms

5 129.134.98.134 (129.134.98.134) 21.716 ms

6 *

7 *

% traceroute6 -q1 -I a.ns.facebook.com

traceroute6 to a.ns.facebook.com (2a03:2880:f0fc:c:face:b00c:0:35) from 2607:f3e0:0:80::290, 64 hops max, 20 byte packets

1 toronto-torix-6 0.146 ms

2 facebook-a.ip6.torontointernetxchange.net 17.860 ms

3 2620:0:1cff:dead:beef::2154 9.237 ms

4 2620:0:1cff:dead:beef::d7c 16.721 ms

5 2620:0:1cff:dead:beef::3b4 17.067 ms

6 *

7 *

8 *

boshomi · 4 years ago
Kevin Beaumont:

   »The Facebook outage has another major impact: lots of mobile apps constantly poll Facebook in the background = everybody is being slammed who runs large scale DNS, so knock on impacts elsewhere the long this goes on.«

https://twitter.com/GossiTheDog/status/1445118907187175427

Twisol · 4 years ago
Oh my gosh, their IPv6 address contains "face:b00c"...

> 2a03:2880:f0fc:c:face:b00c:0:35

mikefromhome · 4 years ago
dead beef sounds about right
kiernanmcgowan · 4 years ago
My suspicion is that since a lot of internal comms runs through the FB domain and since everyone is still WFH, then its probably a massive issue just to get people talking to each other to solve the problem.
okwubodu · 4 years ago
I don’t know how true it is but a few reports claim employees can’t get into the building with their badges.
still_grokking · 4 years ago
You mean the same problem as when GMail goes down and Googlers can't reach each other?

I guess good decentralized public communication services could solve those issues for everybody.

threevox · 4 years ago
LOL - score one against building out all tooling internally (a la Amazon and apparently Facebook too)
strulovich · 4 years ago
Those communications are done over irc at FB for exactly this purpose.
rStar · 4 years ago
time to start working at your mfing desk again, johnson
justinzollars · 4 years ago
What do you think will be the impact on WFH and office requirements?
secondcoming · 4 years ago
Unlikely, PagerDuty was invented for this kind of thing
winternett · 4 years ago
Heck of a coincidence I must say...

I can imagine this affects many other sites that use FB for authentication and tracking.

If people pay proper attention to it, this is not just an average run of the mill "site outage", and instead of checking on or worrying about backups of my FB data (Thank goodness I can afford to lose it all), I'm making popcorn...

Hopefully law makers all study up and pay close attention.

What transpires next may prove to be very interesting.

forgotpwd16 · 4 years ago
Indeed, what happened shows a good reason not to rely only on social log-in for various sites.
kossTKR · 4 years ago
NYT tech reporter Sheera Frenkel gives us this update:

>Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors.

https://twitter.com/sheeraf/status/1445099150316503057

adriancooney · 4 years ago
Got a good chuckle imagining a fuming Zuckerberg not being allowed into his office, thinking the world is falling apart.
rootusrootus · 4 years ago
I just got off a short pre-interview conversation with a manager at Instagram and he had to dial in with POTS. I got the impression that things are very broken internally.
askvictor · 4 years ago
How much of modern POTS is reliant on VOIP? In Australia at least, POTS has been decommissioned entirely, but even where it's still running, I'm wondering where IP takes over?
wolverine876 · 4 years ago
This person has a POTS line in their current location, and a modem, and the software stack to use it, and Instagram has POTS lines and modems and software that connect to their networks? Wow. How well do Instagram and their internal applications work over 56K?
otikthecessna · 4 years ago
I read that as POTUS at first and paused for a minute
dividedbyzero · 4 years ago
What is POTS?
lbruder · 4 years ago
Looks like they misconfigured a web interface that they can't reach anymore now that they're off the net.

"anyone have a Cisco console cable lying around?"

CommieBobDole · 4 years ago
The only one they have is serial and the company's one usb-to-serial converter is missing.
john37386 · 4 years ago
Yeah the patch to fix BGP to reach the DNS is sent by email to @facebook.com. Ooops no DNS to resolve the MX records to send the patch to fix the BGP routers.
yoelo · 4 years ago
Seriously? Is that how it works?
alexvoda · 4 years ago
Can someone explain why it is also down when trying to access it via Tor using its onion address: http://facebookwkhpilnemxj7asaniu7vnjjbiltxjqhye3mhbshg7kx5t...

Or when trying ips directly: https://www.lifewire.com/what-is-the-ip-address-of-facebook-...

I would have expected a DNS issue to not affect either of these.

I can understand the onionsite being down if facebook implemented it the way a thirdparty would (a proxy server accessing facebook.com) instead of actually having it integrated into its infrastructure as a first class citizen.

spiantino · 4 years ago
You can get through to a web server, but that web server uses DNS records or those routes to hit other services necessary to render the page. So the server you hit will also time out eventually and return a 500
gamacodre · 4 years ago
The issue here is that this outage was a result of all the routes into their data centers being cut off (seemingly from the inside). So knowing that one of the servers in there is at IP address "1.2.3.4" doesn't help, because no-one on the outside even knows how to send a packet to that server anymore.

Deleted Comment

KaiserPro · 4 years ago
routing was down _everywhere_ so tor is getting a better experience than most people by getting a 500 error
keithnoizu · 4 years ago
DNS is back, looks like systems are still coming online.

Deleted Comment

Deleted Comment

Dead Comment

suyash · 4 years ago
kossTKR · 4 years ago
Reddit r/Sysadmin user that claims to be on the "Recovery Team" for this ongoing issue:

>As many of you know, DNS for FB services has been affected and this is likely a symptom of the actual issue, and that's that BGP peering with Facebook peering routers has gone down, very likely due to a configuration change that went into effect shortly before the outages happened (started roughly 1540 UTC). There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified. Part of this is also due to lower staffing in data centers due to pandemic measures.

User is providing live updates of the incident here:

https://www.reddit.com/r/sysadmin/comments/q181fv/looks_like...

guidopallemans · 4 years ago
He just deleted all his updates.

user:

https://old.reddit.com/user/ramenporn

some messages:

* This is a global outage for all FB-related services/infra (source: I'm currently on the recovery/investigation team).

* Will try to provide any important/interesting bits as I see them. There is a ton of stuff flying around right now and like 7 separate discussion channels and video calls.

* Update 1440 UTC: \

    As many of you know, DNS for FB services has been affected and this is likely a symptom of the actual issue, and that's that BGP peering with Facebook peering routers has gone down, very likely due to a configuration change that went into effect shortly before the outages happened (started roughly 1540 UTC).

    There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified.

    Part of this is also due to lower staffing in data centers due to pandemic measures.

Narushia · 4 years ago
The 1440 UTC update is also archived on the Wayback Machine: https://web.archive.org/web/20211004171424/https://old.reddi...

And archive.today: https://archive.ph/sMgCi

yholio · 4 years ago
Essentially, they locked themselves out with an uninspired command line at the exact moment the datacenter was being hijacked by ape-people.

Yup, corporate comms won't love these status updates.

Ueland · 4 years ago
And there his account went poof, thanks for archiving.
dschiavu · 4 years ago
It's time to decentralize and open up the Internet again, as it once was (ie. IRC, NNTP and other open protocols) instead of relying on commercial entities (Google, Facebook, Amazon) to control our data and access to it.
meragrin_ · 4 years ago
The account has been deleted as well.

Deleted Comment

Deleted Comment

pmlnr · 4 years ago
> the people with physical access is separate from the people with knowledge of [...]

Welcome to the brave new world of troubleshooting. This will seriously bite us one day.

rvnx · 4 years ago
I like how FB decided to send "ramenporn" as their spokesperson.
cheese_van · 4 years ago
This is why so many teams fight back against the audit findings:

"The information systems office did not enforce logical access to the system in accordance with role-based access policies."

Invariably, you want your best people to have full access to all systems.

jfrunyon · 4 years ago
I can't fathom how they didn't plan for this. In any business of size, you have to change configuration remotely on a regular basis, and can easily lock yourself out on a regular basis. Every single system has a local user with a random password that we can hand out for just this kind of circumstance...
formerly_proven · 4 years ago
This sounds like something that might have been done with security in mind. Although generally speaking, remote hands don't have to be elite hackors.
suyash · 4 years ago
folks with physical access are also denied. source - https://twitter.com/YourAnonOne/status/1445100431181598723
gbil · 4 years ago
this is not new, this is everyday life with helping hands, on duty engineers, l2-l3 levels telling people with physical access which commands to run etc. etc. etc.
munk-a · 4 years ago
Telecommunication satellite communication issues might seriously shut down whole regions if they occur.

Deleted Comment

lmilcin · 4 years ago
I don't think so. I bet nobody is ever going to make that mistake at FB again after today.
est · 4 years ago
I think it's the same with supply chains.
dsr_ · 4 years ago
It just bit FB.
RobRivera · 4 years ago
like today! xD
IceWreck · 4 years ago
> Even in the biggest of organizations, they still have to wait for somebody to race down to the datacenter and plug his laptop into a router.

I love this comment.

MuffinFlavored · 4 years ago
Imagine having the a huge portion of the digital world internationally riding on your shoulders...
yawnxyz · 4 years ago
for something as distributed as Facebook, do multiple somebodys all have to race down each individual datacenter and plug their laptops into the routers?

As someone with no experience in this, it sounds like a terrifying situation for the admins...

bennyp101 · 4 years ago
Interesting that they published stuff about their BGP setup and infrastructure a few months ago - maybe a little tweak to roll backs is needed.

"... We demonstrate how this design provides us with flexible control over routing and keeps the network reliable. We also describe our in-house BGP software implementation, and its testing and deployment pipelines. These allow us to treat BGP like any other software component, enabling fast incremental updates..."

tedmiston · 4 years ago

    # todo: add rollbacks

pbhjpbhj · 4 years ago
Surely Facebook don't update routing systems between data centres (IIRC the situation) when they don't have people present to fix things if they go wrong? Or have an out-of-band connection (satellite, or dial-up (?), or some other alternate routing?).

I must be misunderstanding this situation here.

[Aside: I recall updating wi-fi settings on my laptop and first checking I had direct Ethernet connection working ... and that when I didn't have anything important to do (could have done a reinstall with little loss). Is that a reasonable analogy?]

kerng · 4 years ago
Wondering how Facebook communicates now internally - most of their work streams likely depend on Facebooks systems which are all down.

Can engineers and security teams even access prod systems anymore? Like, would "Bastion" hosts be reachable?

Wonder if they use Signal and Slack now?

alasdair_ · 4 years ago
There are various non-FB fallback measures, including IRC as a last-ditch method. The IRC fallback is usually tested once a year for each engineer.
flyingswift · 4 years ago
FB uses a separate IRC instance for these kinds of issues, at least when I used to work there
not2b · 4 years ago
I would think that their internal network would correctly resolve facebook.com even though they've borked DNS for the external world, or if not they could immediately fix that. So at least they'd be able to talk to each other.
markchristian · 4 years ago
To the communication angle, I've worked at two different BigCo's in my career, and both times there was a fallback system of last resort to use when our primary systems were unavailable.
ThinkBeat · 4 years ago
I haven't worked for a FAANG but it would be unthinkable that FB does not have backup measures in place for communications entirely outside of Facebook.

Hmm well I mean for key people, ops and so on. Not for every employee.

Only a few people need that type of access, and they should have it ready. They need to bring more people there should be an easy way to do it.

Maybe the internal FB Messenger app has a slide button to switch to the backup network for those in need.

xgme · 4 years ago
Facebook does use IRC and Zoom as a fallback.
gfosco · 4 years ago
Actually, in this situation: Discord.
slaymaker1907 · 4 years ago
If they planned ahead, they should have had their oncalls practice on the backup systems (like Signal/Slack/Zoom) before now.
LordHumungous · 4 years ago
My team set up a discord lol
jptech · 4 years ago
Don't they have a separate instance for internal communications?
tiborsaas · 4 years ago
"I believe the original change was 'automatic' (as in configuration done via a web interface). However, now that connection to the outside world is down, remote access to those tools don't exist anymore, so the emergency procedure is to gain physical access to the peering routers and do all the configuration locally."

Hmm, could be a UI/UX bug then :)

sjg007 · 4 years ago
Seems odd to not have a redundant backdoor on a different network interface. Maybe that is too big of a security risk but idk.
cotillion · 4 years ago
So, does anyone know where to one can buy an LTE gateway with a serial port interface? Asking for a friend.
daper · 4 years ago
Our security team complained that we have some services like monitoring or SSH access to some Jump Hosts accessible without a VPN because VPN should be mandatory to access all internal services. I'm afraid once comply we could be in similar situation where Facebook is now...
mekatter · 4 years ago
These are readily available, OpenGear and others have offered them forever. I can't believe fb doesn't have out of band access to their core networking in some fashion. OOB access to core networking is like insurance, rarely appreciated until the house is on fire.
Pasorrijer · 4 years ago
Facebook is likely scrambling private jets as we speak to get the right people to the right places.
aero-glide2 · 4 years ago
Reminds me of that episode in Mr Robot
larntz · 4 years ago
This tweet seems to confirm it is a bgp issue...

https://twitter.com/GossiTheDog/status/1445063880963674121?s...

adamredwoods · 4 years ago
Cloudflare also confirmed it:

https://twitter.com/jgrahamc/status/1445068309288951820

Also, the Domain name is for sale???

https://whois.domaintools.com/facebook.com

snickersnee11 · 4 years ago
Just imagine the amount of stress on this people, hope the money really worth it.
mov31tmov31t · 4 years ago
It shouldn't be too stressful. Well-managed companies blame processes rather than people, and have systems set up to communicate rapidly when large-scale events occur.

It can be sort of exciting, but it's not like there is one person typing at a keyboard with a hundred managers breathing down their neck. These resolutions are collaborative, shared efforts.

yupper32 · 4 years ago
The stress for me usually goes away once the incident is fully escalated and there's a team with me working on the issue. I imagine that happened quite quick in this case...
tomjen3 · 4 years ago
This is a one off event, not a chronic stress trigger. I find them envigorating personally, as long as everybody concerned understands that this is not good in the long run, and that you are not going to write your best code this way.
ds206 · 4 years ago
Well, those comments have been deleted now... I guess someone's boss didn't like the unofficial updates going out? :)
winternett · 4 years ago
Also, equally important to note, there was a massive expose on FaceBook yesterday that is reverberating across social media and news networks, and today, when I tried to make a post including the tag #deletefacebook, my post mysteriously could not be published and the page refreshed, mysteriously wiping my post...

This is possibly the equivalent of a corporate watergate if you ask me... Just my personal opinion as a developer though... Not presented as fact... But hrmmm.

blobbers · 4 years ago
So what you're saying is facebook... deleted itself?

The singularity is happening. It realized it would end society, so it ended itself.

r721 · 4 years ago
Archived version: https://archive.is/QvdmH
strenholme · 4 years ago
The Reddit post is down but not before it was archived: https://archive.is/QvdmH and https://archive.is/TNrFv

Deleted Comment

cecilpl2 · 4 years ago
User has now deleted the update.
rexreed · 4 years ago
I am sure this is not what they specifically mean by fail fast and break things often.
wolverine876 · 4 years ago
> Reddit r/Sysadmin user that claims to be on the "Recovery Team"

They have time to make public posts, and think it's a good idea?

Sure, I'm on the 'Recovery Team' too! How about you?

mdip · 4 years ago
If it's anything like my past employers, they probably have a lot of time. They probably also got in a lot of trouble.

When we'd have situation bridges put in place to work a critical issue, there would usually be 2-3 people who were actively troubleshooting and a bunch of others listening in, there because "they were told to join" but with little-to-nothing to do. In the worst cases, there was management there, also.

Most of the time I was one of the 2 or 3 and generally preferred if the rest of them weren't paying much attention to what was going on. It's very frustrating when you have a large group of people who know little about what's going on injecting their opinions while you're feverishly trying to (safely) resolve a problem.

It was so bad that I once announced[0] to a C-Level and VP that they needed to exit the bridge, immediately because the discussion devolved into finger-pointing. All of management was "kicked out". We were close to solving it but technical staff was second-guessing themselves in the presence of folks with the power to fire them. 30 minutes later we were working again. My boss at the time explained that management created their own bridge and the topic was "what do to about 'me'" which quickly went from "fire me" to "get them all a large Amazon gift card". Despite my undiplomatic handling of the situation, that same C-Level negotiated to get me directly beneath during a reorganization about six months later and I stayed in that spot for years with a very good working relationship. One of my early accomplishments was to limit any of management's participation in situation bridges to once/hour, and only when absolutely necessary, for status updates assuming they couldn't be gotten any other way (phones always worked, but the other communication options may not have).

[0] This was the 16th hour of a bridge that started at 11:00 PM after a full work day early in my career -- I was a systems person with a title equivalent to 'peon', we were all very raw by then and my "announcement" was, honestly, very rude, which I wasn't proud of. Assertive does not have to be rude, but figuring out the fine line between expressing urgency and telling people off is a skill that has to be learned.

cwkoss · 4 years ago
Uh oh that user deleted their account. Hope they are OK.
costcofries · 4 years ago
Looks like those updates have now been deleted
PeterCorless · 4 years ago
Comment now seems to be deleted by user.
sbierwagen · 4 years ago
That reddit comment has been deleted.
qnsi · 4 years ago
he started deleting the comments

Deleted Comment

rocho · 4 years ago
For Facebook and WhatsApp it looks like a DNS issue, name resolution fails with SERVFAIL:

    $ dig facebook.com

    ; <<>> DiG 9.16.21 <<>> facebook.com
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 23982
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 512
    ;; QUESTION SECTION:
    ;facebook.com.   IN A

    ;; Query time: 16 msec
    ;; SERVER: 8.8.8.8#53(8.8.8.8)
    ;; WHEN: Mon Oct 04 17:53:00 CEST 2021
    ;; MSG SIZE  rcvd: 41

r721 · 4 years ago
John Graham-Cumming:

>Between 15:50 UTC and 15:52 UTC Facebook and related properties disappeared from the Internet in a flurry of BGP updates. This is what it looked like to @Cloudflare.

https://twitter.com/jgrahamc/status/1445065270272434176 (thread)

UPD

>About five minutes before Facebook's DNS stopped working we saw a large number of BGP changes (mostly route withdrawals) for Facebook's ASN.

https://twitter.com/jgrahamc/status/1445068309288951820

simlevesque · 4 years ago
Maybe they tried everything else before that.

At first it was working but they couldn't serve responses: https://i.imgur.com/UaCtOiX.png

Notice the "2020"

rvnx · 4 years ago
The servers struggle to reply a basic 5xx answer.

Two possibilities:

- the DNS services internally have issues (most likely, as this could explain the snowball effect)

- it could be also a core storage issue and all their VMs are relying on it and so they don't want to block third-party websites and think it will last for a long time, so they prefer to answer nothing for now in the DNS (so it will fail instantly to the client, and drain the application/database servers so they can reboot with less load)

ctur · 4 years ago
It isn't just DNS. If you happen to have cached entries, the site is returning errors as well.
Nextgrid · 4 years ago
Presumably the DNS being down also wreaks havoc in their internal infrastructure as services can no longer resolve each other's names.
ikiris · 4 years ago
agreed, they fell off the internet according to routeviews

Deleted Comment

WillPostForFood · 4 years ago
I'm seeing similar DNS errors for many non-Facebook sites.
robjan · 4 years ago
My ISP's DNS server went down a few minutes after the Facebook outage, presumably because all the residential customers' devices keep querying.
rstupek · 4 years ago
Seeing the same thing with 8.8.8.8 name servers. Everything I query returns an error
Spare_account · 4 years ago
Do you have some examples?
pul · 4 years ago
john37386 · 4 years ago
Even the Name servers are not returning any values. That's bad.

dig @8.8.8.8 +short facebook.com NS

These are usually anycasted, meaning that 1 ip return in NS are in fact several servers spread in several regions. They are distributed to closer match through agreements with ISP with the BGP protocol. Very interesting, because it seems that it took 1 DNS entry misconfiguration to withdraw M$ worth of devices from the internet.

LordRishav · 4 years ago
It's always DNS
sysadmindotfail · 4 years ago
>It's always DNS

How is this not the top comment? Underrated

Animats · 4 years ago
Even Google's 8.8.8.8 DNS server says can't find, SERVFAIL.
Hokusai · 4 years ago
Is this related in any way to what happened to Slack recently in their DNS?
skywhopper · 4 years ago
So far the pattern isn't the same. Slack published a DNSSEC record that got cached and then deleted it, which broke clients that tried to validate DNSSEC for slack.com. But in this case, the records are just completely gone. As if "facebook.com", "instagram.com", et al just didn't exist.
hulitu · 4 years ago
Thank god we have DoH.
dvratil · 4 years ago
It's DNS over HTTPS. It relies on the same system as plain DNS, so DoH won't really help in this case...

Dead Comment

jbverschoor · 4 years ago
Same here on facebook.com , [api]whatsapp.com (instagram.com works)
elboru · 4 years ago
Is it just me or HN also feels kinda laggy?
bentcorner · 4 years ago
General tip: If HN is being laggy and you're determined you want to waste some time here, open it in a private window. HN works extremely quickly if it doesn't know who you are.
iamthemalto · 4 years ago
Wow this really works, thank you. What actually is the reason for it being much faster in a private window? Is there so much tracking going on in a normal window?
Ancalagon · 4 years ago
This would make for a very good deep-dive technical discussion in an interview setting, I'm using this.
elboru · 4 years ago
That explains why it works fine in my computer, where I haven’t logged in. Thanks for the tip.
quaintdev · 4 years ago
This works like charm. Thank you!
rocho · 4 years ago
I can confirm, HN, GitHub and Slack are very slow for me as well. Google is very fast, on the other hand.
pilsetnieks · 4 years ago
All running their DNS on AWS. My guess is that AWS is seeing a massive flood of failed and retried DNS requests for facebook properties, similar to what jgrahamc mentions here for Cloudflare: https://twitter.com/jgrahamc/status/1445066136547217413
gcoguiec · 4 years ago
Dropping that many BGP routes will have its high latency toll on the whole internet backbone for minutes/hours, I'm not surprised. I wonder if the recent LE's DST Root CA X3 deprecation has something to do with the outage (some DC internal tool/API not accessible because its certificate is expired or something like that).
alexellisuk · 4 years ago
Also slow here. I can't see anything on the AWS Service Dashboard https://status.aws.amazon.com
yk · 4 years ago
People either have to work, creating load on GitHub, or waste their time elsewhere, creating load on HN and Slack.
szundi · 4 years ago
People probably got more time to work.
ggerules · 4 years ago
Also slow for me also.
i_like_apis · 4 years ago
Probably traffic related. Lots of people reallocated to checking other sites.
eeegnu · 4 years ago
Probably people flooding in to see if anyone knows why things are down. Even Google speed test was down, presumably from too many people testing if it's their internet that's at issue.
deadalus · 4 years ago
kzrdude · 4 years ago
I'd guess that automatic processes dominate. Maybe billions of phone apps polling for facebook connectivity (FB messenger is down, for example).
rocky1138 · 4 years ago
A couple of years ago, an admin at Hacker News asked those of us who are just reading to log out because their system is architected in such a way that logged in users use more resources than anonymous ones. So, if you're feeling altruistic, log out of HN!
busymom0 · 4 years ago
Logging out does work! Probably delivering a cache.
foobarbecue · 4 years ago
Internet's got a case of the Mondays for sure
busymom0 · 4 years ago
Yep. I am the developer of HN client HACK for iOS and Android and a bunch of users emailed me asking if my app was broken. Looks like something bigger is afoot.
alexdumitru · 4 years ago
Something's wrong with your app. It's not working at all, while Harmonic works perfectly.
gridder · 4 years ago
Best HN client app ever. Thanks for the great work!
rocho · 4 years ago
I can confirm, HN, GitHub and Slack are very slow for me as well. Google is very fast, on the other hand.

EDIT: actually HN failed to post this comment the first time I posted it!

flypaca · 4 years ago
Not just you. It is very laggy on my end too.
neom · 4 years ago
HN lagging, BBC was also very laggy about 30 minutes ago, and 35 minutes ago our whole company got booted out of their various hangouts simultaneously apart from the people in the states.
14 · 4 years ago
Definitely laggy for me as well. Went to Facebook and couldn’t so come here to check in and the load time made me think oh it must be my wifi is not working with 2 sites not opening then finally HN opened. Then tried to hit reply to your post and again seemed like it wouldn’t load then finally did. So yes laggy usually this is the one site that loads almost instantly.
ourcat · 4 years ago
Here too. Just had the "We're having some trouble serving your request. Sorry!" error.
Jyaif · 4 years ago
With the Facebook properties down, the rest of the internet will have a significant increase in usage.
dcminter · 4 years ago
Plus I don't know about you, but I came to HN just now specifically to check if there was any insight into why it was down! The thundering herd just arrived :)
amelius · 4 years ago
Yes, it's slow here as well, and posting this comment failed the first and second time.
comeonseriously · 4 years ago
Can confirm. HN, YT, Google, etc are all a bit laggy for me at the moment (eating lunch so I'm trying to entertain myself).
amelius · 4 years ago
Yes, it's slow here as well, and posting this comment failed the first and second and third and fourth time.
bradenb · 4 years ago
This is either a hilarious accident or genius comedy.
amelius · 4 years ago
Yes, it's slow here as well, and posting this comment failed the first and second and third time.
amelius · 4 years ago
Yes, it's slow here as well, and posting this comment failed the first time.
raymondh · 4 years ago
It is slow for me too.
Yuioup · 4 years ago
Same here. Sounds like another cloudflare-like problem.

Deleted Comment

jose-cl · 4 years ago
yes, me too (I'm in south-america)
Jamie9912 · 4 years ago
Yep, struggled to load the homepage and this
donkarma · 4 years ago
yeah lagging for me too

Dead Comment

cvhashim · 4 years ago
Some internet backbone provider is probably down itself.
leafygreene · 4 years ago
Or some country has started a war.
yawnxyz · 4 years ago
Funny enough, I went to https://www.isitdownrightnow.com/ to check if Facebook is down, and isitdownrightnow is down itself... probably from the massive number requests coming to check if Facebook is down
EvanAnderson · 4 years ago
Seems like the perfect time to launch isisitdowndownrightnow.com.
epalm · 4 years ago
Seems like it should be isisitdownrightnowdownrightnow.com
lostmsu · 4 years ago
You missed one rightnow in the middle
michaelmior · 4 years ago
I personally like https://isup.me (alias of downforeveryoneorjustme.com) because it's much shorter.

isup.me/facebook gets me what I want.

thrdbndndn · 4 years ago
Their methodology is flawed it seems.

It says Google is down but it's not. [1]

[1] https://downforeveryoneorjustme.com/google

skizm · 4 years ago
https://downdetector.com/ seems to be working for me at least.
spiantino · 4 years ago
It's amusing that the top 3 trending reports are the FB sites that are down, and then the mobile carriers themselves, presumably because when FB doesn't load they assume it's their mobile network's fault. People really do think FB is the internet
horsellama · 4 years ago
'Unusual traffic patterns detected' now
cecilpl2 · 4 years ago
aaronharnly · 4 years ago
Amusingly, that returns:

> Is Facebook down right now?

> Uh oh! Something went wrong on our side. It's not you, it's us. Feel free to contact us if this persists.

jbkkd · 4 years ago
Noticed the same. I started to suspect my mobile plan ran out
zekrioca · 4 years ago
Which in turn, reminds of this paper [1] (from someone who previously worked at Facebook).

TLDR; Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed.

[1] Metastable Failures in Distributed Systems - https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s...

homeskool · 4 years ago
yep down for me too
zrail · 4 years ago
hugops for the engineers having to deal with this. It's incredibly stressful and I personally feel like they deserve some empathy, even if I don't like Facebook.

I wonder if maybe part of the lesson will be to run the root of your authoritative DNS hierarchy on separate infrastructure with a separate domain name. Using facebook.com as your root is cool and all but when that label disappears it causes huge issues.

chasd00 · 4 years ago
There will be so many meetings over this. If powerpoint was listed on the stock exchange i'd say now's a good time to buy hah.
poetaster · 4 years ago
I used to do this properly. One vanity got the better of me. Got some work to do. TGF SQL.
antocv · 4 years ago
Its alive!

drill @1.1.1.1 www.facebook.com ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 2172 ;; flags: qr rd ra ; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;; www.facebook.com. IN A

;; ANSWER SECTION: www.facebook.com. 3401 IN CNAME star-mini.c10r.facebook.com. star-mini.c10r.facebook.com. 3403 IN A 31.13.72.36

strenholme · 4 years ago
Kinda sorta. There are four DNS servers for Facebook: 129.134.30.12, 129.134.31.12, 185.89.218.12, and 185.89.219.12.

Of those, only 185.89.219.12 is up right now (Edit All four DNS servers are now up). For people who want to add Facebook to hosts.txt, the A record (IP) I’m getting right now is 157.240.11.35 (it was 31.13.70.36)

kblev · 4 years ago
"Sorry, something went wrong. Facebook © 2020"
simonklitj · 4 years ago
Yes, even Facebook falls prey to the wrong copyright year. Anyway, I got further now to a page that says "Account Temporarily Unavailable." and has the old Facebook layout. Would love a peek inside the Facebook codebase to see how this happens, hah!
daniellehmann · 4 years ago
See e.g. https://www.digwebinterface.com/?hostnames=facebook.com&ns=a... for responses from different nameservers.
RedShift1 · 4 years ago
Nothing of value was lost
finolex1 · 4 years ago
I get that this in jest, but a lot of people rely on Whatsapp and FB Messenger for messaging.
m-chrzan · 4 years ago
A lot of people, out of habit, rely on high fructose corn syrup for calories.
dekerta · 4 years ago
There are plenty of ways to communicate with friends and family. If Facebook is down long enough, many people will just move to something else. (And I hope they do)
ekianjo · 4 years ago
Making poor choices seems to be the curse of humankind.
simlevesque · 4 years ago
Instagram messaging is also very popular, at least around me.
annadane · 4 years ago
Sure they do. And it's why Whatsapp needs to be broken off from Facebook, because they blatantly lied about it and only bought it to kill off their competition
bborud · 4 years ago
Maybe they shouldn't.
riffic · 4 years ago
they shouldn't.
paul7986 · 4 years ago
They relied on AOL Instant Messenger too...
erdos4d · 4 years ago
I certainly do and I dream of the day that everyone I message switches, so I can too.
subsaharancoder · 4 years ago
A lot of people, many of them home based businesses, also rely on FB Marketplace as a primary source of income.
subsaharancoder · 4 years ago
Many people don't realize that with the 2020 lockdown and next to zero face to face transactions happening, platforms like FB Marketplace provided an opportunity for many people to set up businesses and generate income. I understand the angst people have with FB, but there's a bigger world out there beyond our keyboards.
tantalor · 4 years ago
That's terrifying.
walrus01 · 4 years ago
for one example of this look at certain ethnic food catering/delivery services that exist in many major cities and operate almost exclusively on facebook.
belter · 4 years ago
I felt a great disturbance in the Force, as if millions of voices suddenly cried out in terror and were suddenly silenced....Finally!
madeofpalk · 4 years ago
I can't message my friends on whatsapp :(
heywherelogingo · 4 years ago
Seize the moment - switch to signal.
can16358p · 4 years ago
Just because a company has questionable or even straight evil business practices doesn't mean that literally millions of companies/people don't rely on them to do business and communicate.
solmag · 4 years ago
It is a good start.
winter_blue · 4 years ago
Well, I know you jest, but a lot of conversations, with many people, over years and years would be lost. It'd be akin to hundreds of email threads with friends being deleted.
mcheung610 · 4 years ago
But positive social value was gained
johnwheeler · 4 years ago
much value was gained!
ozfive · 4 years ago
This cannot be said enough.
blowski · 4 years ago
On the contrary, it's said far too much. Facebook is extremely valuable for a lot of people. I dislike Facebook as much as most people on here, but saying "it's totally pointless" is silly and it's not surprising that those who say it are ignored by those who use Facebook.
mattfrommars · 4 years ago
Facebook bashing is getting old. It's 2021, dammit.
zthrowaway · 4 years ago
And every year it makes the world worse.