Overheating datacenter stopped 2.5M bank transactions

Seems like more testing is in order:

> Upon the outage, both banks immediately activated IT disaster recovery and business continuity plans.

> "However," according to Tan, "both banks encountered technical issues which prevented them from fully recovering their affected systems at their respective backup datacenters – DBS due to a network misconfiguration and Citibank due to connectivity issues."

Perhaps regularly flipping between them on a regular basis, at least for a short time, so the 'passive' part of the active-passive pair gets some work, is worth considering.

oooyay · 2 years ago

Kind of a spicy opinion as a SRE, but if your disaster recovery strategy isn't part of your operational strategy then it's not actually a strategy. It's a moonshot in the face of failure.

kevincox · 2 years ago

That's a strategy. And depending on your downtime tolerance it may be an acceptable strategy. In this case it probably wasn't an appropriate strategy.

nyc_data_geek1 · 2 years ago

In the business continuity/disaster recovery space we like to say, if you don't regularly test your DR plan, you don't really have one.

hinkley · 2 years ago

Moonshot is being generous.

It’s two steps away from Apollo 13. Build a CO2 scrubber out of parts in this box. You have twelve hours until the crew suffocates and dies.

dylan604 · 2 years ago

"No recovery plan is worth anything if it is not tested regularly."

"An untested recovery plan is worth less than the paper on which it is printed."

I'm sure there are many other versions of the quote

hinkley · 2 years ago

Have you seen this video of the guy who on some sort of “adventure” outing? They are way up in the air on some sort of platform where you are supposed to jump across these planks from one platform to another. Someone is filming from either a platform or plateau next to the rig.

There’s a harness connected to a rail above you to catch you if you fall.

In the video the man gets halfway across and the cable attached to his harness just pops off. The instructor behind him instinctively reaches out as if he can use The Force to catch him if he biffs, but he makes it to the other side unharmed, and is all smiles until the cable catches up and bops him in the back. At which point he turns around and connects the dots.

It’s a bit like that. And this is the image that comes into my mind every time I discover we have untested recovery processes while we are in the middle of a recovery.

user3939382 · 2 years ago

If you haven’t tested your backup you don’t have one.

bsder · 2 years ago

"Backup--very, very boring. Restore--very, very exciting."

pphysch · 2 years ago

But think of all the boxes you cannot check if you are honest about your recovery plan

f001 · 2 years ago

I’m surprised at this to be honest. At my employer (a bank) we have to do a flip a couple times a year and just continue running from the alternate DC until the next flip. Pain in the butt on the weekends of the flip but at least we know for sure that our DR plan works and is good enough to serve our regular load…

stuff4ben · 2 years ago

We used to do this for some large Artifactory clusters (~1bn artifacts) we ran on both the east and west coast of the US. Failing over to our DR cluster and our (internal) global DNS load balancer handled it just fine. Made upgrading those clusters seamless for the most part. Gave me confidence we could continue development in the event our main site fell into the ocean for some reason.

SandraBucky · 2 years ago

Bureaucracy triumphs over engineering considerations most of the time. Engineering division at banks might have suggested testing backup but top execs might disagree as it is unnecessary risk on operational system. Of-course they will play the blame game afterward and some low-level contractor will receive the heat cz of inability of decision making at top. Disclaimer: I worked as ELV contractor for an asset management megacorp.

bell-cot · 2 years ago

> Perhaps regularly flipping between them on a regular basis [...] is worth considering.

True. But getting the PHB's at a bank (and perhaps the banking regulators) to sign off on doing that...could prove to be a non-trivial task.

paulddraper · 2 years ago

Probably observer's bias, but I've never seen a recovery plan work like it's supposed to.

BirAdam · 2 years ago

I’ve worked a few places where the company did regular disaster recovery testing to ensure stuff worked how it was intended to. Most of the time, it worked. Every once in a while, it failed.

tgsovlerkhgsel · 2 years ago

Seeing this play out in IT makes me really worried when it comes to critical infrastructure.

On paper, there are plans and redundancies upon redundancies, and everything is perfectly safe.

In practice, I haven't seen them exercised, and some of these plans sound incredibly optimistic...

HideousKojima · 2 years ago

Or you could do chaos engineering: https://en.m.wikipedia.org/wiki/Chaos_engineering

stuff4ben · 2 years ago

I would venture to say that most non-tech startup companies are not equipped with the knowledge, the hardware, or the political fortitude to do something like that. Especially when it's customer's money at stake. It's one thing if Joe Schmo can't watch the latest Great British Baking Show. It's another matter entirely if Joe's paycheck is missing.

TheIronMark · 2 years ago

It's odd that regular testing wasn't part of a compliance framework for them. BC/DR testing has been part of every security/compliance framework I've worked with.

captainkrtek · 2 years ago

Something something, the spare tire was flat.

Blaming contractors/subcontractors is so hot right now.

Cloudflare. Equinix. No accountability.

ToucanLoucan · 2 years ago

Does this surprise you? What is a corporation if not a legal entity designed to launder responsibility?

Have a problem with a service and you find out the company you bought from subcontracts to an LLC you've never heard of which in turn subcontracts to four other LLCs one of which is actually liable but only if you can figure out which one, and oh by the way yours has two employees and it's headquarters is a UPS Store box, good fucking luck.

gottorf · 2 years ago

> What is a corporation if not a legal entity designed to launder responsibility?

Corporations (and other legal entities used for business) are legal fictions that allow people to pool capital for productive endeavors and yes, with limited liability on the parts of the owners, because we as a society decided that encouraging productive endeavors this way is better than the alternative.

In a world with no limited liability, you could happily wring the owner's neck if they screwed up, but I'd imagine society as a whole would be a lot poorer off, because very few people would take that risk to engage in productive endeavors.

> Have a problem with a service and you find out the company you bought from subcontracts to an LLC you've never heard of which in turn subcontracts to four other LLCs one of which is actually liable

It doesn't matter, because the company you actually paid money to is the one that's responsible to you.

kristjank · 2 years ago

Considering how every vendor has been shoving cloud, serverless, and anything-but-on-premise solutions down everyone's throat for the better part of the century so far, does that really surprise you?

It's shitty landlord syndrome all over again. Tenants should have the right to complain and wave their sharpened pitchforks at them.

screwturner68 · 2 years ago

you get what you pay for, these companies have been racing to the bottom for years, trying to save that extra penny because IT is considered a cost center and always has been. In addition moving to the cloud is just a way to shrug off responsibility, it's not our fault it's AWS.....actually it is your fault because you were trying trying to save money and you handed off your business to the lowest priced contractor and now are surprised that the contractor doesn't care about your business nearly as much as you do.

throw0101c · 2 years ago

> […] is so hot right now.

Literally in this case.