Update about the October 4th outage

Gotta love how painfully vague this is. Sounds like a PR piece for investors, not an engineering blog piece.

johnduhart · 4 years ago

I think you need to re-adjust your expectations, it's not reasonable to have a fully fleshed out RCA blog post available within hours of incident resolution. Most other cloud providers take a few days for theirs.

padolsey · 4 years ago

I mean, not an RCA per se, but info more akin to cloudflare's blog post would be v welcome IMHO: https://blog.cloudflare.com/october-2021-facebook-outage/

tinus_hn · 4 years ago

It’s not reasonable to demand any details at all, it’s nice of them to notify people of what went wrong but it really is none of our business.

ajkjk · 4 years ago

Disagree -- it's here to establish something that a lot of people have been speculating about, which is whether it's hacking-related. It doesn't say much because its purpose is to deliver a single bit of information: { hacking: boolean }

yuliyp · 4 years ago

It's less vague than you realize. It points out that the problem was within Facebook's network between its datacenters. This not only suggests it's related to express backbone, but also suggests that the DNS BGP withdrawal which Cloudflare observed was not the primary issue.

It's not a full root cause analysis, to be sure, and leaves many open questions, but I definitely wouldn't describe it as painfully vague.

EricE · 4 years ago

A point of distinction - there is no "DNS BGP withdrawl".

DNS is related to BGP only that without the right BGP routes in the routers, no packets can get to the facebook networks and thus the facbook DNS servers.

That their DNS servers were taken out was a side affect of the root issue - they withdrew all the routes to their networks from the rest of the Internet.

Not picking on you - but there has been a lot of confusion around DNS that is mostly a red herring and people should just drop it from the conversation. Everything on facebook networks disappeared, not just DNS. The main issue is they effectively took a pair of scissors to every one of their internet connections - d'oh!

paxys · 4 years ago

RCAs take time. It's best to issue vague statements during and right after an incident rather than make guesses.

KronisLV · 4 years ago

> It's best to issue vague statements during and right after an incident rather than make guesses.

Why? Why couldn't you just post that the RCA is still ongoing and that proper updates will follow? Otherwise all you get is meaningless fluff.

rplnt · 4 years ago

Not worth clicking really, everything is in the url.

Deleted Comment

vmception · 4 years ago

Its important for many stakeholders to understand it wasn’t a hack/exploit or malicious third party or malicious insider

Its much better for some random committee in Congress to debate antitrust forever, instead of bigger committees and agencies debating national security threats

bawolff · 4 years ago

It clearly is a PR piece for investors and customers. And that's ok, not everything is an eng blog.

retSava · 4 years ago

Pointing out that this is published under _engineering_.fb.com.

I worked with a network engineer who misconfigured a router that was connecting a bank to it's DR site. The engineer had to drive across town to manually patch into the router to fix it.

DR downtime was about an hour, but the bank fired him anyway.

Given that Zuck lost a substantial amount of money, I wonder if the engineer faced any ramifications.

Sidenote: I asked the bank infrastructure team why the DR site was in the same earthquake zone, and they thought I was crazy. They said if there's an earthquake we'll have bigger problems to deal with.

xchaotic · 4 years ago

"I worked with a network engineer who misconfigured a router that was connecting a bank to it's DR site. The engineer had to drive across town to manually patch into the router to fix it. DR downtime was about an hour, but the bank fired him anyway." so prod wasn't down and he fixed it in a hour and they fired the guy who knew how to fix such things so quickly. Idiot manager at the bank.

datavirtue · 4 years ago

Had a DBA once who was playing around with database projects in visual studio and he managed to hose the production database in the course of it. This caused our entire system to go down.

Prostrate, he came before the COO expecting to be canned with much malice. The COO just asked if he learned his lesson and said all is forgiven.

go_prodev · 4 years ago

I agree it was very heavy handed, but I suspect there was more at play (not the first mistake, and some regulatory reporting that may have looked bad for higher ups)

quartz · 4 years ago

Facebook has a very healthy approach to incident response (one of the reasons it's so rare for the site to go down at all despite the enormous traffic and daily code pushes).

Unless there was some kind of nefarious intent, it's very unlikely anyone will be 'punished'. The likely ramifications will be around changes to processes, tests, automations, and fallbacks to 1) prevent the root sequence of events from happening again and 2) make it easier to recover from similar classes of problems in the future.

ethbr0 · 4 years ago

I've never understood companies that fire individuals when policies were followed and an incident happened. Or, when no policies existed. Or, when policies are routinely bypassed.

Organizational failures require organizational solutions. That seems pretty obvious.

AnIdiotOnTheNet · 4 years ago

Harsh. Unless there is more to the story being fired for a mistake like that is ridiculous. Everyone fucks up occasionally, and on the scale of fuck ups I've certainly done worse than a 1hr DR site outage, as I'm sure pretty much anyone who's ever run infrastructure has. A consistent pattern of fucking up is grounds for termination, but not any one off instance unless there was an extreme level of negligence on display.

> Sidenote: I asked the bank infrastructure team why the DR site was in the same earthquake zone, and they thought I was crazy. They said if there's an earthquake we'll have bigger problems to deal with.

I bring this sort of thing up all the time in disaster planning. There are scales of disaster so big that business continuity is simply not going to be a priority.

drcross · 4 years ago

>DR downtime was about an hour, but the bank fired him anyway

The US, not even once.

The guy should have had "reload in 10", an outage window and config review. There must be more to this story than it being a firable offence for causing a P2 outage for an hour.

imgabe · 4 years ago

It just occurred to me to wonder if Facebook has a Twitter account and if they used it to update people about the outage. It turns out they do, and they did, which makes sense. Boy, it must have been galling to have to use a competing communication network to tell people that your network is down.

It looks like Zuckerberg doesn't have a personal Twitter though, nor does Jack Dorsey have a public Facebook page (or they're set not to show up in search).

melvinmt · 4 years ago

> It looks like Zuckerberg doesn't have a personal Twitter though

He does: https://twitter.com/finkd

nostromo · 4 years ago

His LinkedIn photo used to be this really awkward laptop camera photo of roughly this face: (-_-)

It was amazing. I’m sad he remove it.

Ah, I saw that one, but it wasn't verified so I figured it was an imposter. It has only a handful of tweets from 2009 and 1 from 2012, but it could really be him, I suppose.

jell · 4 years ago

They have an official account. https://twitter.com/Facebook/status/1445061804636479493

hint: "some people".

e9 · 4 years ago

I’m not sure they are competing though. They serve different purposes and co-exist pretty well together.

geerlingguy · 4 years ago

stephenhuey · 4 years ago

Even though the angle grinder story wasn’t accurate, it’d still be interesting to know what percentage of the time to fix the outage was spent on regaining physical access:

https://mobile.twitter.com/mikeisaac/status/1445196576956162...

gannon- · 4 years ago

This is a funny post to have suggested at the bottom of the article: https://engineering.fb.com/2021/08/09/connectivity/backbone-...

itronitron · 4 years ago

Looks like the 'Failure Generator' was brought online.

lionkor · 4 years ago

So this is pure conspiracy theory, but to me this could be a security issue. What if something deep in the core of your infrastructure is compromised? Everything at risk? Id ask my best engineer, hed suggest to shut it down, and the best way to do that is to literally pull the plug on what makes you public. Tell everyone we accidentally messed up a BGP and thats it.

But yeah, likely not.

colordrops · 4 years ago

Speaking of conspiracies, one that is floating around is that this was done to cover up spread of information around the Pandora Leak.

fragmede · 4 years ago

BGP is public routing information and multiple external sources are able to confirm that aspect of the story. It makes for a good conspiracy theory but the BGP withdrawal is as real as the Moon landing.

throw0101a · 4 years ago

> It makes for a good conspiracy theory but the BGP withdrawal is as real as the Moon landing.

I wasn't aware that Stanley Kubrick was now in NetOps. /s

laurent92 · 4 years ago

If Facebook had been under actual attack, and defended by taking itself off the internet… that would be the most hands-on approach to security.

can16358p · 4 years ago

Even though it's probably not that, I must admit the fact that I absolutely love reading theories like this.

Many have pointed out that a couple of weeks ago Facebook had a paper out on how they had implemented a fancy new automated system to manage their BGP routes.

Whoops! Never attribute to malice that which can more easily be explained by stupidity and all that.

runawaybottle · 4 years ago

It was interesting to visit the subreddits of random countries (eg /r/Mongolia) and see the top posts all asking if fb/Insta/WhatsApp being down was local or global. I got the impression this morning that it was only affecting NA and Europe, but it looks like it was totally global. The numbers must be staggering of the number of people trying to login.

andrewxdiamond · 4 years ago

This more or less confirms what we’ve heard, and I appreciate the speed, but it’s incredibly lame from a details point of view.

Will a real postmortem follow? Or is this the best we are gonna get?

badtux · 4 years ago

Having been on the team that issued postmortems before, I can tell you that we said as little as possible in as vague a way as possible while meeting our minimum legal requirements. Actual Facebook customers (i.e. those who pay money to Facebook) will get a slightly more detailed release. But the whole goal is to give as little information as possible while appearing to be open. As an engineer that makes me growl, but that's how it is in this litigous world -- don't want to give someone a reason to sue.

Dylan16807 · 4 years ago

Sue for what, that they couldn't do with zero information? I don't buy that excuse. (Not that I blame you for the excuse.)

laegooose · 4 years ago

How would you explain that AWS, GCE, Cloudflare, GitLab publish very detailed post-mortems?