It just occurred to me to wonder if Facebook has a Twitter account and if they used it to update people about the outage. It turns out they do, and they did, which makes sense. Boy, it must have been galling to have to use a competing communication network to tell people that your network is down.
It looks like Zuckerberg doesn't have a personal Twitter though, nor does Jack Dorsey have a public Facebook page (or they're set not to show up in search).
Ah, I saw that one, but it wasn't verified so I figured it was an imposter. It has only a handful of tweets from 2009 and 1 from 2012, but it could really be him, I suppose.
I think you need to re-adjust your expectations, it's not reasonable to have a fully fleshed out RCA blog post available within hours of incident resolution. Most other cloud providers take a few days for theirs.
Disagree -- it's here to establish something that a lot of people have been speculating about, which is whether it's hacking-related. It doesn't say much because its purpose is to deliver a single bit of information: { hacking: boolean }
It's less vague than you realize. It points out that the problem was within Facebook's network between its datacenters. This not only suggests it's related to express backbone, but also suggests that the DNS BGP withdrawal which Cloudflare observed was not the primary issue.
It's not a full root cause analysis, to be sure, and leaves many open questions, but I definitely wouldn't describe it as painfully vague.
A point of distinction - there is no "DNS BGP withdrawl".
DNS is related to BGP only that without the right BGP routes in the routers, no packets can get to the facebook networks and thus the facbook DNS servers.
That their DNS servers were taken out was a side affect of the root issue - they withdrew all the routes to their networks from the rest of the Internet.
Not picking on you - but there has been a lot of confusion around DNS that is mostly a red herring and people should just drop it from the conversation. Everything on facebook networks disappeared, not just DNS. The main issue is they effectively took a pair of scissors to every one of their internet connections - d'oh!
Its important for many stakeholders to understand it wasn’t a hack/exploit or malicious third party or malicious insider
Its much better for some random committee in Congress to debate antitrust forever, instead of bigger committees and agencies debating national security threats
Even though the angle grinder story wasn’t accurate, it’d still be interesting to know what percentage of the time to fix the outage was spent on regaining physical access:
I worked with a network engineer who misconfigured a router that was connecting a bank to it's DR site. The engineer had to drive across town to manually patch into the router to fix it.
DR downtime was about an hour, but the bank fired him anyway.
Given that Zuck lost a substantial amount of money, I wonder if the engineer faced any ramifications.
Sidenote: I asked the bank infrastructure team why the DR site was in the same earthquake zone, and they thought I was crazy. They said if there's an earthquake we'll have bigger problems to deal with.
"I worked with a network engineer who misconfigured a router that was connecting a bank to it's DR site. The engineer had to drive across town to manually patch into the router to fix it.
DR downtime was about an hour, but the bank fired him anyway."
so prod wasn't down and he fixed it in a hour and they fired the guy who knew how to fix such things so quickly. Idiot manager at the bank.
Had a DBA once who was playing around with database projects in visual studio and he managed to hose the production database in the course of it. This caused our entire system to go down.
Prostrate, he came before the COO expecting to be canned with much malice. The COO just asked if he learned his lesson and said all is forgiven.
I agree it was very heavy handed, but I suspect there was more at play (not the first mistake, and some regulatory reporting that may have looked bad for higher ups)
Facebook has a very healthy approach to incident response (one of the reasons it's so rare for the site to go down at all despite the enormous traffic and daily code pushes).
Unless there was some kind of nefarious intent, it's very unlikely anyone will be 'punished'. The likely ramifications will be around changes to processes, tests, automations, and fallbacks to 1) prevent the root sequence of events from happening again and 2) make it easier to recover from similar classes of problems in the future.
I've never understood companies that fire individuals when policies were followed and an incident happened. Or, when no policies existed. Or, when policies are routinely bypassed.
Organizational failures require organizational solutions. That seems pretty obvious.
Harsh. Unless there is more to the story being fired for a mistake like that is ridiculous. Everyone fucks up occasionally, and on the scale of fuck ups I've certainly done worse than a 1hr DR site outage, as I'm sure pretty much anyone who's ever run infrastructure has. A consistent pattern of fucking up is grounds for termination, but not any one off instance unless there was an extreme level of negligence on display.
> Sidenote: I asked the bank infrastructure team why the DR site was in the same earthquake zone, and they thought I was crazy. They said if there's an earthquake we'll have bigger problems to deal with.
I bring this sort of thing up all the time in disaster planning. There are scales of disaster so big that business continuity is simply not going to be a priority.
>DR downtime was about an hour, but the bank fired him anyway
The US, not even once.
The guy should have had "reload in 10", an outage window and config review. There must be more to this story than it being a firable offence for causing a P2 outage for an hour.
So this is pure conspiracy theory, but to me this could be a security issue. What if something deep in the core of your infrastructure is compromised? Everything at risk? Id ask my best engineer, hed suggest to shut it down, and the best way to do that is to literally pull the plug on what makes you public. Tell everyone we accidentally messed up a BGP and thats it.
BGP is public routing information and multiple external sources are able to confirm that aspect of the story. It makes for a good conspiracy theory but the BGP withdrawal is as real as the Moon landing.
Many have pointed out that a couple of weeks ago Facebook had a paper out on how they had implemented a fancy new automated system to manage their BGP routes.
Whoops! Never attribute to malice that which can more easily be explained by stupidity and all that.
It was interesting to visit the subreddits of random countries (eg /r/Mongolia) and see the top posts all asking if fb/Insta/WhatsApp being down was local or global. I got the impression this morning that it was only affecting NA and Europe, but it looks like it was totally global. The numbers must be staggering of the number of people trying to login.
Having been on the team that issued postmortems before, I can tell you that we said as little as possible in as vague a way as possible while meeting our minimum legal requirements. Actual Facebook customers (i.e. those who pay money to Facebook) will get a slightly more detailed release. But the whole goal is to give as little information as possible while appearing to be open. As an engineer that makes me growl, but that's how it is in this litigous world -- don't want to give someone a reason to sue.
It looks like Zuckerberg doesn't have a personal Twitter though, nor does Jack Dorsey have a public Facebook page (or they're set not to show up in search).
He does: https://twitter.com/finkd
It was amazing. I’m sad he remove it.
hint: "some people".
It's not a full root cause analysis, to be sure, and leaves many open questions, but I definitely wouldn't describe it as painfully vague.
DNS is related to BGP only that without the right BGP routes in the routers, no packets can get to the facebook networks and thus the facbook DNS servers.
That their DNS servers were taken out was a side affect of the root issue - they withdrew all the routes to their networks from the rest of the Internet.
Not picking on you - but there has been a lot of confusion around DNS that is mostly a red herring and people should just drop it from the conversation. Everything on facebook networks disappeared, not just DNS. The main issue is they effectively took a pair of scissors to every one of their internet connections - d'oh!
Why? Why couldn't you just post that the RCA is still ongoing and that proper updates will follow? Otherwise all you get is meaningless fluff.
Deleted Comment
Its much better for some random committee in Congress to debate antitrust forever, instead of bigger committees and agencies debating national security threats
https://mobile.twitter.com/mikeisaac/status/1445196576956162...
DR downtime was about an hour, but the bank fired him anyway.
Given that Zuck lost a substantial amount of money, I wonder if the engineer faced any ramifications.
Sidenote: I asked the bank infrastructure team why the DR site was in the same earthquake zone, and they thought I was crazy. They said if there's an earthquake we'll have bigger problems to deal with.
Prostrate, he came before the COO expecting to be canned with much malice. The COO just asked if he learned his lesson and said all is forgiven.
Unless there was some kind of nefarious intent, it's very unlikely anyone will be 'punished'. The likely ramifications will be around changes to processes, tests, automations, and fallbacks to 1) prevent the root sequence of events from happening again and 2) make it easier to recover from similar classes of problems in the future.
Organizational failures require organizational solutions. That seems pretty obvious.
> Sidenote: I asked the bank infrastructure team why the DR site was in the same earthquake zone, and they thought I was crazy. They said if there's an earthquake we'll have bigger problems to deal with.
I bring this sort of thing up all the time in disaster planning. There are scales of disaster so big that business continuity is simply not going to be a priority.
The US, not even once.
The guy should have had "reload in 10", an outage window and config review. There must be more to this story than it being a firable offence for causing a P2 outage for an hour.
But yeah, likely not.
I wasn't aware that Stanley Kubrick was now in NetOps. /s
Whoops! Never attribute to malice that which can more easily be explained by stupidity and all that.
Will a real postmortem follow? Or is this the best we are gonna get?