> Upon the outage, both banks immediately activated IT disaster recovery and business continuity plans.
> "However," according to Tan, "both banks encountered technical issues which prevented them from fully recovering their affected systems at their respective backup datacenters – DBS due to a network misconfiguration and Citibank due to connectivity issues."
Perhaps regularly flipping between them on a regular basis, at least for a short time, so the 'passive' part of the active-passive pair gets some work, is worth considering.
Kind of a spicy opinion as a SRE, but if your disaster recovery strategy isn't part of your operational strategy then it's not actually a strategy. It's a moonshot in the face of failure.
Have you seen this video of the guy who on some sort of “adventure” outing? They are way up in the air on some sort of platform where you are supposed to jump across these planks from one platform to another. Someone is filming from either a platform or plateau next to the rig.
There’s a harness connected to a rail above you to catch you if you fall.
In the video the man gets halfway across and the cable attached to his harness just pops off. The instructor behind him instinctively reaches out as if he can use The Force to catch him if he biffs, but he makes it to the other side unharmed, and is all smiles until the cable catches up and bops him in the back. At which point he turns around and connects the dots.
It’s a bit like that. And this is the image that comes into my mind every time I discover we have untested recovery processes while we are in the middle of a recovery.
I’m surprised at this to be honest. At my employer (a bank) we have to do a flip a couple times a year and just continue running from the alternate DC until the next flip. Pain in the butt on the weekends of the flip but at least we know for sure that our DR plan works and is good enough to serve our regular load…
We used to do this for some large Artifactory clusters (~1bn artifacts) we ran on both the east and west coast of the US. Failing over to our DR cluster and our (internal) global DNS load balancer handled it just fine. Made upgrading those clusters seamless for the most part. Gave me confidence we could continue development in the event our main site fell into the ocean for some reason.
Bureaucracy triumphs over engineering considerations most of the time. Engineering division at banks might have suggested testing backup but top execs might disagree as it is unnecessary risk on operational system. Of-course they will play the blame game afterward and some low-level contractor will receive the heat cz of inability of decision making at top.
Disclaimer: I worked as ELV contractor for an asset management megacorp.
I’ve worked a few places where the company did regular disaster recovery testing to ensure stuff worked how it was intended to. Most of the time, it worked. Every once in a while, it failed.
I would venture to say that most non-tech startup companies are not equipped with the knowledge, the hardware, or the political fortitude to do something like that. Especially when it's customer's money at stake. It's one thing if Joe Schmo can't watch the latest Great British Baking Show. It's another matter entirely if Joe's paycheck is missing.
It's odd that regular testing wasn't part of a compliance framework for them. BC/DR testing has been part of every security/compliance framework I've worked with.
IMHO, "fail over" will always fail. Any multi-data-center architecture should actively utilize all data centers, all the time, maybe dedicate more traffic to one or the other. But, if you've got a "passive" data center that you wanna activate during an outage, you're gonna have a bad day. Every time.
My experience contradicts your hyperbole. Over the past two decades I've been part of several fail overs with active/passive and build from scratch DR solutions. Most succeeded. The ones that didn't succeed on the first attempt did so on the second one. Not a perfect track record, but it does happen with proper documentation, testing, and skilled staff to handle it.
Does this surprise you? What is a corporation if not a legal entity designed to launder responsibility?
Have a problem with a service and you find out the company you bought from subcontracts to an LLC you've never heard of which in turn subcontracts to four other LLCs one of which is actually liable but only if you can figure out which one, and oh by the way yours has two employees and it's headquarters is a UPS Store box, good fucking luck.
> What is a corporation if not a legal entity designed to launder responsibility?
Corporations (and other legal entities used for business) are legal fictions that allow people to pool capital for productive endeavors and yes, with limited liability on the parts of the owners, because we as a society decided that encouraging productive endeavors this way is better than the alternative.
In a world with no limited liability, you could happily wring the owner's neck if they screwed up, but I'd imagine society as a whole would be a lot poorer off, because very few people would take that risk to engage in productive endeavors.
> Have a problem with a service and you find out the company you bought from subcontracts to an LLC you've never heard of which in turn subcontracts to four other LLCs one of which is actually liable
It doesn't matter, because the company you actually paid money to is the one that's responsible to you.
Considering how every vendor has been shoving cloud, serverless, and anything-but-on-premise solutions down everyone's throat for the better part of the century so far, does that really surprise you?
It's shitty landlord syndrome all over again. Tenants should have the right to complain and wave their sharpened pitchforks at them.
you get what you pay for, these companies have been racing to the bottom for years, trying to save that extra penny because IT is considered a cost center and always has been. In addition moving to the cloud is just a way to shrug off responsibility, it's not our fault it's AWS.....actually it is your fault because you were trying trying to save money and you handed off your business to the lowest priced contractor and now are surprised that the contractor doesn't care about your business nearly as much as you do.
We are no longer in an era where this should be allowed to happen, even for say... a legacy application that runs on Windows 2000 and needs NTFS on a network block device.
After 9/11, the "zero nines" whitepaper was released to push for cloud (never let a disaster go to waste).
I have hands-on deployed many K8s+Ceph clusters. Ceph's block storage can be deployed multi-AZ where multiple datacenters each with their own power systems are in close proximity. There are methods like Raft (used by K8s and Patroni) and a really great method demonstrated by Google's CloudSQL that uses a SQL table to infer which SQL server should be a "leader".
Singapore DCs build vertical to save on footprint. Quite a few DCs were built on the west side of the island where it's industrial and land is/was cheaper.
Well, considering you’d have most of the needed utilities already connected and present, a large population of well educated people, and extremely high bandwidth connections in close proximity, it’s probably only slightly higher than elsewhere.
> Upon the outage, both banks immediately activated IT disaster recovery and business continuity plans.
> "However," according to Tan, "both banks encountered technical issues which prevented them from fully recovering their affected systems at their respective backup datacenters – DBS due to a network misconfiguration and Citibank due to connectivity issues."
Perhaps regularly flipping between them on a regular basis, at least for a short time, so the 'passive' part of the active-passive pair gets some work, is worth considering.
It’s two steps away from Apollo 13. Build a CO2 scrubber out of parts in this box. You have twelve hours until the crew suffocates and dies.
"An untested recovery plan is worth less than the paper on which it is printed."
I'm sure there are many other versions of the quote
There’s a harness connected to a rail above you to catch you if you fall.
In the video the man gets halfway across and the cable attached to his harness just pops off. The instructor behind him instinctively reaches out as if he can use The Force to catch him if he biffs, but he makes it to the other side unharmed, and is all smiles until the cable catches up and bops him in the back. At which point he turns around and connects the dots.
It’s a bit like that. And this is the image that comes into my mind every time I discover we have untested recovery processes while we are in the middle of a recovery.
True. But getting the PHB's at a bank (and perhaps the banking regulators) to sign off on doing that...could prove to be a non-trivial task.
On paper, there are plans and redundancies upon redundancies, and everything is perfectly safe.
In practice, I haven't seen them exercised, and some of these plans sound incredibly optimistic...
Outages happen.
Could lead to more multi-cloud adoption among financial services companies who are in scope.
(no relation to the other DORA, DevOps Research and Assessment, by the way)
Cloudflare. Equinix. No accountability.
Have a problem with a service and you find out the company you bought from subcontracts to an LLC you've never heard of which in turn subcontracts to four other LLCs one of which is actually liable but only if you can figure out which one, and oh by the way yours has two employees and it's headquarters is a UPS Store box, good fucking luck.
Corporations (and other legal entities used for business) are legal fictions that allow people to pool capital for productive endeavors and yes, with limited liability on the parts of the owners, because we as a society decided that encouraging productive endeavors this way is better than the alternative.
In a world with no limited liability, you could happily wring the owner's neck if they screwed up, but I'd imagine society as a whole would be a lot poorer off, because very few people would take that risk to engage in productive endeavors.
> Have a problem with a service and you find out the company you bought from subcontracts to an LLC you've never heard of which in turn subcontracts to four other LLCs one of which is actually liable
It doesn't matter, because the company you actually paid money to is the one that's responsible to you.
It's shitty landlord syndrome all over again. Tenants should have the right to complain and wave their sharpened pitchforks at them.
Literally in this case.
Deleted Comment
After 9/11, the "zero nines" whitepaper was released to push for cloud (never let a disaster go to waste).
I have hands-on deployed many K8s+Ceph clusters. Ceph's block storage can be deployed multi-AZ where multiple datacenters each with their own power systems are in close proximity. There are methods like Raft (used by K8s and Patroni) and a really great method demonstrated by Google's CloudSQL that uses a SQL table to infer which SQL server should be a "leader".
So like wtf guys.
One of the banks, DBS, has had several outages in recent years, but this seems like sloppy writing.
(I guess over time those should converge, but sometimes takes a while or doesn’t)