I remember that outage. It was finally blamed (as described in this brief) on phone switch manufacturer DSC. IIRC this killed the company. Their SLA with their customers was something like three minutes of downtime per decade.
DSC was our customer at Cygnus. They were interesting as a customer (tough requirements but they paid a lot for them). For example if they reported a bug and got a fix they diffed the binaries and looked at every difference to be sure that the change was a result of the fix, and nothing else (no, they didn’t want upgrades).
> Their SLA with their customers was something like three minutes of downtime per decade.
That is insane. I really feel like modern SLAs are only getting worse - so much so that most companies fudge them, and try their hardest to never declare any sort of outage.
The Bell System standard was an extremely high Erlang number, which of course they strove to provide at the lowest cost which meant extremely high utilization rate on hardware which in turn meant extreme uptime (compare to the then contemporary PTT QoS even in major economies like France).
This is also why the software itself was designed with so many internal defenses and what I would consider an “immune system”. I’ve never seen anything like it even on an aircraft control system. That is mentioned in passing in the brief article but is easily missed if you don’t know what it’s referring to.
Most of what is done on the Internet at, say, “layer 5 or above” isn’t at all important so there’s no need for this level of SLA, but the actual backbone carriers do still carry SLAs at around that level. With packet switching it’s easier for them to provide than it was in the days of the 4ESS and 5ESS.
Telco is a highly regulated business. If you go down, the FCC and the state regulator are all over you. If you break 911 service, heaven help you.
As a result, five or six nines availability is pretty standard. Six nines means you can lose no more than 31 seconds a year. As a result, traditional telco had all kinds of cool tech in it, eg the Nortel DMS had live code patching as early as 1990 iirc.
At one time I worked at a vendor of IP telco switches, aiming to replace the legacy Nortel and Lucent with smaller tech. We had to learn some very, very hard lessons about reliability, but we eventually got there.
Today, I see cloud hyperscalers claiming that can run telco workloads, but I remain pretty skeptical until they can prove that they can switch a call, mid stream, from one node to another, without losing the audio, while transcoding it from one codec to another. Im not saying that public cloud needs to make the same tech choices re resilience, but today's web tech absolutely will not cut it.
Most modern SLAs are worthless. The penalty is meaningless and/or "up" is carefully defined in such a way that even a service failing 100% of requests is "up" because it's responding, or is defined such that a single customer can have a total outage but the service is up because it's servicing others.
Networking is the last bastion of SLAs that actually seem to matter.
> For example if they reported a bug and got a fix they diffed the binaries and looked at every difference to be sure that the change was a result of the fix, and nothing else (no, they didn’t want upgrades).
According to a tech that actually applied tbe updates (Long Lines group in FB), “The DSC incident occurred in June 1991, unrelated to this AT&T incident.”
Thanks. It was reported otherwise in the news back then, but TBH I'd trust someone likely to have been the person responsible, or close to that person.
(it might sound crazy these days to trust "someone on the Internet" but really: what's the incentive not to tell the truth about something like this?)
> Clearly, the use of C programs and compilers contributed to the breakdown. A more structured programming language with stricter compilers would have made this particular defect much more obvious.
Nice to see that "should have used Rust" has been a thing since before Rust existed.
This is an example of why you want interoperable diversity in complex distributed systems.
By having everything so standardized and consistent, they had the exact same failure mode everywhere and lost redundant fault tolerance. If they had different interoperable switches, running different software, the outage wouldn't have been absolute.
When large complex distributed systems grow organically over time, they tend to wind up with diversity. It usually takes a big centralized project focused on efficiency to destroy that property.
I appreciate this comment. In my world of packet pushing, I try to promote vendor diversity for this reason.
The practical downsides of this diversity live in the complexity of the interop (often slowing feature velocity), operations, and procurement/support.
But issues like the AT&T 4ESS outage have occurred before in IP networks, as an example, in some BGP bug. Diversity alleviates some of the global impact.
There are other ways of accomplishing this like doing staged rollouts without giving up the cost efficiencies of implementing your own network only once and avoiding a combinatorial explosion in testing complexity.
You can sometimes play this game with vendors because you want them to give you an interoperable interface so that you avoid vendor lock-in and have better pricing, but that’s a secondary benefit and staged rollouts should still be performed even if you have heterogenous software.
Staged rollouts do not protect you from long lurking bugs. Even in this ATT case they most certainly did do a staged rollout just because they couldn’t just shut off the entire phone network to run an update across all systems.
Obviously you do everything possible to stop an outage like this happening...
But when it inevitably does, you should be prepared for a full system simultaneous restart. Ie. So that no 'bad' signals or data from the old system can impact the new.
That is the sort of thing you should practice in the staging environment from time to time, just for when it might be needed. It could have taken this outage from many hours down to just many minutes.
You should also design all your code to be rollbackable... But for the very rare case that a rollback won't solve the problem (eg. An outage is caused by changes outside your organisation's control), you also need to be able to do a rapid code change, recompile and push. Many companies aren't able to do this for example their release process involves multiple days worth of interlocked manual steps.
If some one wants to read a lot more details an out this incident, there is book about it. It’s been a decade or two since I read it, but I remember it being well written.
‘The Day the Phones Stopped Ringing’ by Leonard Lee
> ‘The Day the Phones Stopped Ringing’ by Leonard Lee
Definitely going to look into this! The whole *ESS architecture still underpins a lot of the telephony system. There are quite a few still running, even though other TDM equipment is being phased out.
DSC was our customer at Cygnus. They were interesting as a customer (tough requirements but they paid a lot for them). For example if they reported a bug and got a fix they diffed the binaries and looked at every difference to be sure that the change was a result of the fix, and nothing else (no, they didn’t want upgrades).
That is insane. I really feel like modern SLAs are only getting worse - so much so that most companies fudge them, and try their hardest to never declare any sort of outage.
This is also why the software itself was designed with so many internal defenses and what I would consider an “immune system”. I’ve never seen anything like it even on an aircraft control system. That is mentioned in passing in the brief article but is easily missed if you don’t know what it’s referring to.
Most of what is done on the Internet at, say, “layer 5 or above” isn’t at all important so there’s no need for this level of SLA, but the actual backbone carriers do still carry SLAs at around that level. With packet switching it’s easier for them to provide than it was in the days of the 4ESS and 5ESS.
As a result, five or six nines availability is pretty standard. Six nines means you can lose no more than 31 seconds a year. As a result, traditional telco had all kinds of cool tech in it, eg the Nortel DMS had live code patching as early as 1990 iirc.
At one time I worked at a vendor of IP telco switches, aiming to replace the legacy Nortel and Lucent with smaller tech. We had to learn some very, very hard lessons about reliability, but we eventually got there.
Today, I see cloud hyperscalers claiming that can run telco workloads, but I remain pretty skeptical until they can prove that they can switch a call, mid stream, from one node to another, without losing the audio, while transcoding it from one codec to another. Im not saying that public cloud needs to make the same tech choices re resilience, but today's web tech absolutely will not cut it.
Networking is the last bastion of SLAs that actually seem to matter.
That sounds pretty diligent!
(it might sound crazy these days to trust "someone on the Internet" but really: what's the incentive not to tell the truth about something like this?)
Nice to see that "should have used Rust" has been a thing since before Rust existed.
https://en.wikipedia.org/wiki/Protel
I briefly had an interest in learning Q, then looked at some code: https://github.com/KxSystems/cookbook/blob/master/start/buil...
Why not just build what you need with C/arrow/parquet?
By having everything so standardized and consistent, they had the exact same failure mode everywhere and lost redundant fault tolerance. If they had different interoperable switches, running different software, the outage wouldn't have been absolute.
When large complex distributed systems grow organically over time, they tend to wind up with diversity. It usually takes a big centralized project focused on efficiency to destroy that property.
The practical downsides of this diversity live in the complexity of the interop (often slowing feature velocity), operations, and procurement/support.
But issues like the AT&T 4ESS outage have occurred before in IP networks, as an example, in some BGP bug. Diversity alleviates some of the global impact.
You can sometimes play this game with vendors because you want them to give you an interoperable interface so that you avoid vendor lock-in and have better pricing, but that’s a secondary benefit and staged rollouts should still be performed even if you have heterogenous software.
But when it inevitably does, you should be prepared for a full system simultaneous restart. Ie. So that no 'bad' signals or data from the old system can impact the new.
That is the sort of thing you should practice in the staging environment from time to time, just for when it might be needed. It could have taken this outage from many hours down to just many minutes.
Don't get yourself in that position.
‘The Day the Phones Stopped Ringing’ by Leonard Lee
Definitely going to look into this! The whole *ESS architecture still underpins a lot of the telephony system. There are quite a few still running, even though other TDM equipment is being phased out.
http://sunnyday.mit.edu/nasa-class/Ariane5-report.html
https://news.ycombinator.com/item?id=30556601