Readit News logoReadit News
throw0101c · 2 years ago
Seems like more testing is in order:

> Upon the outage, both banks immediately activated IT disaster recovery and business continuity plans.

> "However," according to Tan, "both banks encountered technical issues which prevented them from fully recovering their affected systems at their respective backup datacenters – DBS due to a network misconfiguration and Citibank due to connectivity issues."

Perhaps regularly flipping between them on a regular basis, at least for a short time, so the 'passive' part of the active-passive pair gets some work, is worth considering.

oooyay · 2 years ago
Kind of a spicy opinion as a SRE, but if your disaster recovery strategy isn't part of your operational strategy then it's not actually a strategy. It's a moonshot in the face of failure.
kevincox · 2 years ago
That's a strategy. And depending on your downtime tolerance it may be an acceptable strategy. In this case it probably wasn't an appropriate strategy.
nyc_data_geek1 · 2 years ago
In the business continuity/disaster recovery space we like to say, if you don't regularly test your DR plan, you don't really have one.
hinkley · 2 years ago
Moonshot is being generous.

It’s two steps away from Apollo 13. Build a CO2 scrubber out of parts in this box. You have twelve hours until the crew suffocates and dies.

dylan604 · 2 years ago
"No recovery plan is worth anything if it is not tested regularly."

"An untested recovery plan is worth less than the paper on which it is printed."

I'm sure there are many other versions of the quote

hinkley · 2 years ago
Have you seen this video of the guy who on some sort of “adventure” outing? They are way up in the air on some sort of platform where you are supposed to jump across these planks from one platform to another. Someone is filming from either a platform or plateau next to the rig.

There’s a harness connected to a rail above you to catch you if you fall.

In the video the man gets halfway across and the cable attached to his harness just pops off. The instructor behind him instinctively reaches out as if he can use The Force to catch him if he biffs, but he makes it to the other side unharmed, and is all smiles until the cable catches up and bops him in the back. At which point he turns around and connects the dots.

It’s a bit like that. And this is the image that comes into my mind every time I discover we have untested recovery processes while we are in the middle of a recovery.

user3939382 · 2 years ago
If you haven’t tested your backup you don’t have one.
bsder · 2 years ago
"Backup--very, very boring. Restore--very, very exciting."
pphysch · 2 years ago
But think of all the boxes you cannot check if you are honest about your recovery plan
f001 · 2 years ago
I’m surprised at this to be honest. At my employer (a bank) we have to do a flip a couple times a year and just continue running from the alternate DC until the next flip. Pain in the butt on the weekends of the flip but at least we know for sure that our DR plan works and is good enough to serve our regular load…
stuff4ben · 2 years ago
We used to do this for some large Artifactory clusters (~1bn artifacts) we ran on both the east and west coast of the US. Failing over to our DR cluster and our (internal) global DNS load balancer handled it just fine. Made upgrading those clusters seamless for the most part. Gave me confidence we could continue development in the event our main site fell into the ocean for some reason.
SandraBucky · 2 years ago
Bureaucracy triumphs over engineering considerations most of the time. Engineering division at banks might have suggested testing backup but top execs might disagree as it is unnecessary risk on operational system. Of-course they will play the blame game afterward and some low-level contractor will receive the heat cz of inability of decision making at top. Disclaimer: I worked as ELV contractor for an asset management megacorp.
bell-cot · 2 years ago
> Perhaps regularly flipping between them on a regular basis [...] is worth considering.

True. But getting the PHB's at a bank (and perhaps the banking regulators) to sign off on doing that...could prove to be a non-trivial task.

paulddraper · 2 years ago
Probably observer's bias, but I've never seen a recovery plan work like it's supposed to.
BirAdam · 2 years ago
I’ve worked a few places where the company did regular disaster recovery testing to ensure stuff worked how it was intended to. Most of the time, it worked. Every once in a while, it failed.
tgsovlerkhgsel · 2 years ago
Seeing this play out in IT makes me really worried when it comes to critical infrastructure.

On paper, there are plans and redundancies upon redundancies, and everything is perfectly safe.

In practice, I haven't seen them exercised, and some of these plans sound incredibly optimistic...

HideousKojima · 2 years ago
Or you could do chaos engineering: https://en.m.wikipedia.org/wiki/Chaos_engineering
stuff4ben · 2 years ago
I would venture to say that most non-tech startup companies are not equipped with the knowledge, the hardware, or the political fortitude to do something like that. Especially when it's customer's money at stake. It's one thing if Joe Schmo can't watch the latest Great British Baking Show. It's another matter entirely if Joe's paycheck is missing.
TheIronMark · 2 years ago
It's odd that regular testing wasn't part of a compliance framework for them. BC/DR testing has been part of every security/compliance framework I've worked with.
captainkrtek · 2 years ago
Something something, the spare tire was flat.
psychlops · 2 years ago
The overheating datacenter didn't stop the transactions. It was the technical inability of the banks to fail over which should have been seamless.

Outages happen.

jimt1234 · 2 years ago
IMHO, "fail over" will always fail. Any multi-data-center architecture should actively utilize all data centers, all the time, maybe dedicate more traffic to one or the other. But, if you've got a "passive" data center that you wanna activate during an outage, you're gonna have a bad day. Every time.
ms_frizzle · 2 years ago
My experience contradicts your hyperbole. Over the past two decades I've been part of several fail overs with active/passive and build from scratch DR solutions. Most succeeded. The ones that didn't succeed on the first attempt did so on the second one. Not a perfect track record, but it does happen with proper documentation, testing, and skilled staff to handle it.
cm2187 · 2 years ago
How to make this problem even worse: make all the banks in the street use the same datacenters, namely either AWS or Azure.
bobbiechen · 2 years ago
In the EU, the Digital Operational Resilience Act (DORA) mandates that financial services companies must manage their cloud risk - https://www.digital-operational-resilience-act.com/

Could lead to more multi-cloud adoption among financial services companies who are in scope.

(no relation to the other DORA, DevOps Research and Assessment, by the way)

lawlessone · 2 years ago
AWS version "Overheating employee stopped 2.5M bank transacti..."
xyst · 2 years ago
Blaming contractors/subcontractors is so hot right now.

Cloudflare. Equinix. No accountability.

ToucanLoucan · 2 years ago
Does this surprise you? What is a corporation if not a legal entity designed to launder responsibility?

Have a problem with a service and you find out the company you bought from subcontracts to an LLC you've never heard of which in turn subcontracts to four other LLCs one of which is actually liable but only if you can figure out which one, and oh by the way yours has two employees and it's headquarters is a UPS Store box, good fucking luck.

gottorf · 2 years ago
> What is a corporation if not a legal entity designed to launder responsibility?

Corporations (and other legal entities used for business) are legal fictions that allow people to pool capital for productive endeavors and yes, with limited liability on the parts of the owners, because we as a society decided that encouraging productive endeavors this way is better than the alternative.

In a world with no limited liability, you could happily wring the owner's neck if they screwed up, but I'd imagine society as a whole would be a lot poorer off, because very few people would take that risk to engage in productive endeavors.

> Have a problem with a service and you find out the company you bought from subcontracts to an LLC you've never heard of which in turn subcontracts to four other LLCs one of which is actually liable

It doesn't matter, because the company you actually paid money to is the one that's responsible to you.

kristjank · 2 years ago
Considering how every vendor has been shoving cloud, serverless, and anything-but-on-premise solutions down everyone's throat for the better part of the century so far, does that really surprise you?

It's shitty landlord syndrome all over again. Tenants should have the right to complain and wave their sharpened pitchforks at them.

screwturner68 · 2 years ago
you get what you pay for, these companies have been racing to the bottom for years, trying to save that extra penny because IT is considered a cost center and always has been. In addition moving to the cloud is just a way to shrug off responsibility, it's not our fault it's AWS.....actually it is your fault because you were trying trying to save money and you handed off your business to the lowest priced contractor and now are surprised that the contractor doesn't care about your business nearly as much as you do.
throw0101c · 2 years ago
> […] is so hot right now.

Literally in this case.

Deleted Comment

devaiops9001 · 2 years ago
We are no longer in an era where this should be allowed to happen, even for say... a legacy application that runs on Windows 2000 and needs NTFS on a network block device.

After 9/11, the "zero nines" whitepaper was released to push for cloud (never let a disaster go to waste).

I have hands-on deployed many K8s+Ceph clusters. Ceph's block storage can be deployed multi-AZ where multiple datacenters each with their own power systems are in close proximity. There are methods like Raft (used by K8s and Patroni) and a really great method demonstrated by Google's CloudSQL that uses a SQL table to infer which SQL server should be a "leader".

So like wtf guys.

forgingahead · 2 years ago
The article seems to be talking about an outage that happened in Oct 2023, but then references a LinkedIn post that was written in 2021?

One of the banks, DBS, has had several outages in recent years, but this seems like sloppy writing.

dan_can_code · 2 years ago
I wonder what the cost of a datacenter within Singapore is? The cost per sq/m in that country, which is the most expensive in the world, must be huge.
latchkey · 2 years ago
I've run equipment in dc's in multiple places in the US, Germany, and Singapore. By far, the most expensive was Singapore.
Scoundreller · 2 years ago
Is it because it costs the operator a lot to run one or because they can charge a lot?

(I guess over time those should converge, but sometimes takes a while or doesn’t)

c_o_n_v_e_x · 2 years ago
Singapore DCs build vertical to save on footprint. Quite a few DCs were built on the west side of the island where it's industrial and land is/was cheaper.
BirAdam · 2 years ago
Well, considering you’d have most of the needed utilities already connected and present, a large population of well educated people, and extremely high bandwidth connections in close proximity, it’s probably only slightly higher than elsewhere.