All Circuits are Busy Now: The 1990 AT&T Long Distance Network Collapse (1995)

I remember that outage. It was finally blamed (as described in this brief) on phone switch manufacturer DSC. IIRC this killed the company. Their SLA with their customers was something like three minutes of downtime per decade.

DSC was our customer at Cygnus. They were interesting as a customer (tough requirements but they paid a lot for them). For example if they reported a bug and got a fix they diffed the binaries and looked at every difference to be sure that the change was a result of the fix, and nothing else (no, they didn’t want upgrades).

yunohn · 3 years ago

> Their SLA with their customers was something like three minutes of downtime per decade.

That is insane. I really feel like modern SLAs are only getting worse - so much so that most companies fudge them, and try their hardest to never declare any sort of outage.

gumby · 3 years ago

The Bell System standard was an extremely high Erlang number, which of course they strove to provide at the lowest cost which meant extremely high utilization rate on hardware which in turn meant extreme uptime (compare to the then contemporary PTT QoS even in major economies like France).

This is also why the software itself was designed with so many internal defenses and what I would consider an “immune system”. I’ve never seen anything like it even on an aircraft control system. That is mentioned in passing in the brief article but is easily missed if you don’t know what it’s referring to.

Most of what is done on the Internet at, say, “layer 5 or above” isn’t at all important so there’s no need for this level of SLA, but the actual backbone carriers do still carry SLAs at around that level. With packet switching it’s easier for them to provide than it was in the days of the 4ESS and 5ESS.

kjellsbells · 3 years ago

Telco is a highly regulated business. If you go down, the FCC and the state regulator are all over you. If you break 911 service, heaven help you.

As a result, five or six nines availability is pretty standard. Six nines means you can lose no more than 31 seconds a year. As a result, traditional telco had all kinds of cool tech in it, eg the Nortel DMS had live code patching as early as 1990 iirc.

At one time I worked at a vendor of IP telco switches, aiming to replace the legacy Nortel and Lucent with smaller tech. We had to learn some very, very hard lessons about reliability, but we eventually got there.

Today, I see cloud hyperscalers claiming that can run telco workloads, but I remain pretty skeptical until they can prove that they can switch a call, mid stream, from one node to another, without losing the audio, while transcoding it from one codec to another. Im not saying that public cloud needs to make the same tech choices re resilience, but today's web tech absolutely will not cut it.

wincy · 3 years ago

What do you mean, Microsoft 358 is an excellent product that has uptime for exactly what they’ve got on the label!

foobiekr · 3 years ago

Most modern SLAs are worthless. The penalty is meaningless and/or "up" is carefully defined in such a way that even a service failing 100% of requests is "up" because it's responding, or is defined such that a single customer can have a total outage but the service is up because it's servicing others.

Networking is the last bastion of SLAs that actually seem to matter.

rubatuga · 3 years ago

Worst are SLAs where you have to prove the outage as a customer like wtf

JosephRedfern · 3 years ago

> For example if they reported a bug and got a fix they diffed the binaries and looked at every difference to be sure that the change was a result of the fix, and nothing else (no, they didn’t want upgrades).

That sounds pretty diligent!

peteradio · 3 years ago

Sounds like a pain in the ass, it means custom branching for eternity.

hexbus · 3 years ago

According to a tech that actually applied tbe updates (Long Lines group in FB), “The DSC incident occurred in June 1991, unrelated to this AT&T incident.”

gumby · 3 years ago

Thanks. It was reported otherwise in the news back then, but TBH I'd trust someone likely to have been the person responsible, or close to that person.

(it might sound crazy these days to trust "someone on the Internet" but really: what's the incentive not to tell the truth about something like this?)

This is an example of why you want interoperable diversity in complex distributed systems.

By having everything so standardized and consistent, they had the exact same failure mode everywhere and lost redundant fault tolerance. If they had different interoperable switches, running different software, the outage wouldn't have been absolute.

When large complex distributed systems grow organically over time, they tend to wind up with diversity. It usually takes a big centralized project focused on efficiency to destroy that property.

yusyusyus · 3 years ago

I appreciate this comment. In my world of packet pushing, I try to promote vendor diversity for this reason.

The practical downsides of this diversity live in the complexity of the interop (often slowing feature velocity), operations, and procurement/support.

But issues like the AT&T 4ESS outage have occurred before in IP networks, as an example, in some BGP bug. Diversity alleviates some of the global impact.

vlovich123 · 3 years ago

There are other ways of accomplishing this like doing staged rollouts without giving up the cost efficiencies of implementing your own network only once and avoiding a combinatorial explosion in testing complexity.

You can sometimes play this game with vendors because you want them to give you an interoperable interface so that you avoid vendor lock-in and have better pricing, but that’s a secondary benefit and staged rollouts should still be performed even if you have heterogenous software.

kortilla · 3 years ago

Staged rollouts do not protect you from long lurking bugs. Even in this ATT case they most certainly did do a staged rollout just because they couldn’t just shut off the entire phone network to run an update across all systems.

twisteriffic · 3 years ago

> Clearly, the use of C programs and compilers contributed to the breakdown. A more structured programming language with stricter compilers would have made this particular defect much more obvious.

Nice to see that "should have used Rust" has been a thing since before Rust existed.

bee_rider · 3 years ago

I’m sure people have been saying “shouldn’t have used C” for longer than most of us here have been alive.

shrubble · 3 years ago

The Nortel DMS switches were written in a custom dialect of Pascal...

https://en.wikipedia.org/wiki/Protel

kccqzy · 3 years ago

Do you seriously think this kind of logic error would have been prevented in Rust?

IntelMiner · 3 years ago

Their quip was a playful jab at the "rewrite every possible thing in Rust" mentality

EGreg · 3 years ago

Should have used Q !

conjectureproof · 3 years ago

Why?

I briefly had an interest in learning Q, then looked at some code: https://github.com/KxSystems/cookbook/blob/master/start/buil...

Why not just build what you need with C/arrow/parquet?

vhold · 3 years ago

segfaultbuserr · 3 years ago

It's funny that what caused the AT&T telephone switch system to break was a "break" statement in a "switch" control structure.

pifm_guy · 3 years ago

Obviously you do everything possible to stop an outage like this happening...

But when it inevitably does, you should be prepared for a full system simultaneous restart. Ie. So that no 'bad' signals or data from the old system can impact the new.

That is the sort of thing you should practice in the staging environment from time to time, just for when it might be needed. It could have taken this outage from many hours down to just many minutes.

You should also design all your code to be rollbackable... But for the very rare case that a rollback won't solve the problem (eg. An outage is caused by changes outside your organisation's control), you also need to be able to do a rapid code change, recompile and push. Many companies aren't able to do this for example their release process involves multiple days worth of interlocked manual steps.

Don't get yourself in that position.

pacificmint · 3 years ago

If some one wants to read a lot more details an out this incident, there is book about it. It’s been a decade or two since I read it, but I remember it being well written.

‘The Day the Phones Stopped Ringing’ by Leonard Lee

acadiel · 3 years ago

> ‘The Day the Phones Stopped Ringing’ by Leonard Lee

Definitely going to look into this! The whole *ESS architecture still underpins a lot of the telephony system. There are quite a few still running, even though other TDM equipment is being phased out.

anonymousiam · 3 years ago

Not all that different from the root cause of this (more costly) disaster:

http://sunnyday.mit.edu/nasa-class/Ariane5-report.html

https://news.ycombinator.com/item?id=30556601

mdmglr · 3 years ago

So what was the fix? Remove lines 9-10? Or do “set up pointers to optional parameters” before break?