In the entire history of the Bell System, no electromechanical exchange was ever down for more than 30 minutes for any reason other than a natural disaster. With one exception, a major fire in New York City. Three weeks of downtime for 170,000 phones for that.[1] The Bell System pulled in resources and people from all over the system to replace and rewire several floors of equipment and cabling.
That record has not been maintained in the digital era.
The long distance system did not originally need the Bedminster, NJ network control center to operate. Bedminster sent routing updates periodically to the regional centers, but they could fall back to static routing if necessary. There was, by design, no single point of failure. Not even close. That was a basic design criterion in telecom prior to electronic switching. The system was designed to have less capacity but still keep running if parts of it went down.
That electro mechanical system also switched significantly less calls than the digital counterparts!
Most modern day telcos that I have seen still have multiple power/line cards/uplinks in place and designed for redundancy. However the new systems can also just do so much more and are so more flexible that they can be configured out of existence just as easily!
Some of this as well is just poor software, on some of the big carrier grade routers you can configure many things but the combination of things that you can figure may also just cause things to not work correctly, or even worse pull down the entire chassis, I don't have immediate experience on how good the early 2000s software was, but I would take a guess and say that configurability/flexibility has had a serious cost on reliability of the network
The expectation should be that, as you switch more and more, so that the cost of a 30 minute pause gets higher and higher, the situation would improve, and a more modern system might have been expected to boast that it never had a break lasting more than, say, 30s outside of a natural disaster.
A relatively famous example of the extent to which Indiana Bell went to avoid disrupting telephone service: rotating and relocating its headquarters over a few weeks.
Surely, that downtime quote is apocryphal. The early decades were plagued by poor service due to maintenance, mechanical failure, and human error. In the early decades, manual switchboards were the primary method of connecting calls. The intense workload and physical demands on operators often led to service disruptions . In 1907, over 400 operators went on strike in Toronto, severely impacting phone service. The strike was driven by wage disputes, increased working hours, and poor working conditions
They didn't have downtime logs, but that doesn't mean that the rapid growth of telephone demand didn't outpace the Bell System's capacity to provide adequate service. The company struggled to balance expansion with maintaining service quality, leading to intermittent service issues .
Bell System faced significant public dissatisfaction due to poor service quality. This was compounded by internal issues such as poor employee morale and fierce competition.
Bell Canada had a major outage on July 17th 1999 when a tool was dropped on the bus bar for the main battery power that ignited the hydrogen from the batteries in one of the exchangeds in downtown Toronto. The fire department insisted that all power in the area be shut down which lead to the main switch that handled long distance call routing for all 1-800 numbers being offline for the better part of a day.
One thing that was fascinating about the Rogers outage was on the wireless side: because "just" the core was down, the towers were still up.
So mobile phones would try to make a connection to the tower just enough to connect but not be able to do anything, like call 9-1-1 without trying to fail-over to other mobile networks. Devices showed zero bars, but field test mode would show some handshake succeeding.
(The CTO was roaming out-of-country, had zero bars and thought nothing of it... how they had no idea an enterprise-risking update was scheduled, we'll never know)
Supposedly you could remove your SIM card (who carries that tool doohickey with them at all times?), or disable that eSIM, but you'd have to know that you can do that. Unsure if you'd still be at the mercy of Rogers being the most powerful signal and still failing to get your 9-1-1 call through.
Rogers claimed to have no ability to power down the towers without a truck-roll (which is how another aspect where widespread OOB could have come in handy).
Various stories of radio stations (which Rogers also owns a lot of) not being able to connect the studio to the transmitter, so some tech went with an mp3 player to play pre-recorded "evergreen" content. Others just went off-air.
> Supposedly you could remove your SIM card (who carries that tool doohickey with them at all times?)
In sane handsets (ones where the battery is still removable), that tool was and still is a fingernail, which most have on their person.
I believe the innovation of the need for a special SIM eject tool was bestowed upon us by the same fruit company that gave us floppy and optical drives without manual eject buttons over 30 years ago.
I have fond memories of the fruit company taking the next step: removing the floppy drive bezel from the drive and instead having the floppy drive slot be part of the overall chassis front panel. Of course, their mechanical tolerances were nothing like they are today, so if you looked at the computer crosseyed, the front panel would fail to align to the actual internal disk path, and ejecting the disk would cause it to get stuck behind the front chassis panel. One could rescue it by careful wiggling with a tool to guide the disk through the slot or by removing the entire front panel.
Meanwhile “PCs” had a functional but ugly rectangular opening the size of the entire drive, and the drive had its own bezel, and imperfect alignment between drive and case looked a bit ugly but had no effect on function.
(I admit I’m suspicious that Apple’s approach was a cost optimization, not an aesthetic optimization.)
You could operate the ejection mechanism by hand both on optical and floppy disk drives with an uncurled paperclip (or a SIM card ejection tool were they to exist at that point in time). But I wouldn't ascribe the introduction of the motorized tray to the fruit company, it was the wordmark company: https://youtu.be/bujOWWTfzWQ
Sounds like a problem that should be (rather easily) fixable in the Operating System, no?
If the emergency call doesn’t go through, try the call over a different network.
This would also mitigate problems we see from time to time where emergency calls don’t work because the uplink to the emergency call center was impacted either physically or by a bad software update.
> (The CTO was roaming out-of-country, had zero bars and thought nothing of it... how they had no idea an enterprise-risking update was scheduled, we'll never know)
How does one know ahead of time if any particular change is "enterprise-risking"? It appeared to be a fairly routine set of changes that were going just fine:
> The report summary says that in the weeks leading up to the outage, Rogers was undergoing a seven-phase process to upgrade its network. The outage occurred during the sixth's phase of the upgrade.
It turns out that they self-DoSed certain components:
> Staff at Rogers caused the shutdown, the report says, by removing a control filter that directed information to its appropriate destination.
> Without the filter in place, a flood of information was sent into Rogers' core network, overloading and crashing the system within minutes of the control filter being removed.
* Ibid
> In a letter to the CRTC, Rogers stated that the deletion of a routing filter on its distribution routers caused all possible routes to the internet to pass through the routers, exceeding the capacity of the routers on its core network.
> Rogers staff removed the Access Control ListFootnote 5 policy filter from the configuration of the distribution routers. This consequently resulted in a flood of IP routing information into the core network routers, which triggered the outage. The core network routers allow Rogers wireline and wireless customers to access services such as voice and data. The flood of IP routing data from the distribution routers into the core routers exceeded their capacity to process the informationFootnote 6. The core routers crashed within minutes from the time the policy filter was removed from the distribution routers configuration.
> In October, Facebook suffered a historic outage when their automation software mistakenly withdrew the anycasted BGP routes handling its authoritative DNS rendering its services unusable. Last month, Cloudflare suffered a 30-minute outage when they pushed a configuration mistake in their automation software which also caused BGP routes to be withdrawn.
BTW: anyone who wants to really experience how complex internet routing is, go and join DN42 (https://dn42.dev). This is a fake internet built on a network of VPN tunnels and using the same routing systems. As long as you're just acting as a leaf node, it's pretty straightforward. If you want to attach to the network at multiple points and not just VPN them all to the same place, now you have to design a network just like an ISP would, with IGP and so on.
Router config changes are simultaneously very commonplace and incredibly risky.
I've seen outages caused by a single bad router advertisement that caused global crashes due to route poisoning interacting with a vendor bug. RPKI enforcement caused massive congestion on transit links. Route leaks have DoSed entire countries (https://www.internetsociety.org/blog/2017/08/google-leaked-p...). Even something as simple as a peer removing rules for clearing ToS bits resulted in a month of 20+ engineers trying to figure out why an engineering director was sporadically being throttled to ~200kbps when trying to access Google properties.
Running a large-scale production network is hard.
edit: in case it is not obvious: I agree entirely with you -- the routine config changes that do risk the enterprise are often very hard to identify ahead of time.
I’m reminded of when an old AT&T building went on sale as a house, and one of its selling points was that you could get power from two different power companies if you wanted. This highlighted to me the level of redundancy required to take such things seriously. It probably cost the company a lot to hook up the wires, and I doubt the second power company paid anything for the hookup. Big Bell did it there, and I’m sure they did it everywhere else too.
Edit: I bet it had diesel generators when it was in service with AT&T to boot.
> I bet it had diesel generators when it was in service with AT&T to boot.
20 to 25 years ago I visited a telecom switch center in Paris, the one under the Tuileries garden next to the Louvre. They had a huge and empty diesel generators room. They had all been replaced by a small turbine (not sure it's the right English term), just the same as what's used to power an helicopter. It was in a relatively small soundproof box, with a special vent for the exhaust, kind of lost on the side of a huge underground room.
As the guy in charge explained to us, it was much more compact and convenient. The big risk was in getting it started, this was the tricky part. Once started it was extremely reliable.
> Edit: I bet it had diesel generators when it was in service with AT&T to boot.
That's where AT&T screwed up in Nashville when their DC got bombed. They relied on natural gas generators for their electrical backup. No diesel tank farm. Big fire = fire department shuts down natural gas as wide as deemed necessary and everything slowly dies as the UPS batteries die.
They also didn't have roll-up generator electrical feed points, so they had to figure out how to wire those up once they could get access again, delaying recovery.
I've seen some power outages in california, and noticed that comcast/xfinity had these generator trailers rolled up next to telephone poles, probably powering the low voltage network infrastructure below the power lines.
It’s trivial when you have the resources that come from being one of Canada’s 3 telecom oligopoly members.
Unfortunately the CRTC is run by former execs/management of Bell, Telus, and Rogers, and our anti-competition bureau doesn’t seem to understand their purpose when they consistently allow these 3 to buy up and any all small competitors that gain even a regional market share.
Meanwhile their service is mediocre and overpriced, which they’ll chalk up to geographical challenges of operating in Canada while all offering the exact same plans at the exact same prices, buying sports teams, and paying a reliable dividend.
It's worse than that: 2 of the 3 telecom oligopoly members share (most) of their entire wireless network, with one providing most towers in the West, and the other in the East.
I'm sure those 2 compete very hard with each other with that level of co-dependency.
There is OOB for carriers and OOB for non-carriers. OOB for carriers is significantly more complex and resource intensive than OOB for non-carriers. This topic (OOB or to forgo) has been beat to death over the last 20 years in the operator circles; the responsible consensus is trying to shave a % off operating expenses by cheaping out on your OOB is wrong. That said it does shock me that one of the tier-1 carriers in Canada was this... ignorant? Did they never expect it to rain or something? Wild.
When I see out of band management at remote locations (usually for a dedicated doctors network run by the health authority that gets deployed at offices and clinics) it's generally analog phone line -> modem -> console port. Dialup is more than enough if all you need to do is reset a router config.
Not 100% out of band for a telco though, unless they made sure to use a competitors lines.
Here in Australia, POTS lines have been completely decommissioned, UK will be switched off by end of 2025 and I'm assuming there's similar timelines in lots of other countries.
They're on the way out in France, too. New buildings don't get copper anymore, only fiber.
However, as I understand it, at least for commercial use, the phone company provides some kind of box that has battery-backing so it can provide phone service for a certain duration in case of emergency.
Reminds me of a data center that said they had a backup connection and I pointed out that only one fiber was coming into the data center. They said, "Oh its on a different lambda[1]" :-)
[1] Wave division multiplexing sends multiple signals over the same fiber by using different wavelengths for different channels. Each wavelength is sometimes referred to as a lambda.
1. The risk, when you use a competitor's service, of your competitor cutting off service, especially at an inopportune time (like your service undergoing a major disruption, where cutting off your OOBM would be kicking you while you are down, but such is business).
2. The risk that you and your competitor unknowingly share a common dependency, like utility lines; if the common dependency fails then both you and your OOBM are offline.
The whole point of paying for and maintaining an OOBM is to manage and compensate for the risks of disruption to your main infrastructure. Why would you knowingly add risks you can't control for on top of a framework meant to help you manage risk? It misses the point of why you have the OOBM in the first place.
Maybe 10-15 years ago there was a local Rogers outage that would have had the #2 failure you're describing. From what I recall, SaskTel had a big bundle of about 3,000 twisted pairs running under a park. Some of those went to a SaskTel tower, some to SaskTel residential wireline customers and some of those went to a Rogers facility. Along comes a backhoe and slices through the entire bundle.
That record has not been maintained in the digital era.
The long distance system did not originally need the Bedminster, NJ network control center to operate. Bedminster sent routing updates periodically to the regional centers, but they could fall back to static routing if necessary. There was, by design, no single point of failure. Not even close. That was a basic design criterion in telecom prior to electronic switching. The system was designed to have less capacity but still keep running if parts of it went down.
[1] https://www.youtube.com/watch?v=f_AWAmGi-g8
Most modern day telcos that I have seen still have multiple power/line cards/uplinks in place and designed for redundancy. However the new systems can also just do so much more and are so more flexible that they can be configured out of existence just as easily!
Some of this as well is just poor software, on some of the big carrier grade routers you can configure many things but the combination of things that you can figure may also just cause things to not work correctly, or even worse pull down the entire chassis, I don't have immediate experience on how good the early 2000s software was, but I would take a guess and say that configurability/flexibility has had a serious cost on reliability of the network
Unreliability is unreliability even of it comes through software and we ahould treat broken software as broken, not as "just a software error".
Deleted Comment
https://en.wikipedia.org/wiki/AT%26T_Building_(Indianapolis)
They didn't have downtime logs, but that doesn't mean that the rapid growth of telephone demand didn't outpace the Bell System's capacity to provide adequate service. The company struggled to balance expansion with maintaining service quality, leading to intermittent service issues .
Bell System faced significant public dissatisfaction due to poor service quality. This was compounded by internal issues such as poor employee morale and fierce competition.
So mobile phones would try to make a connection to the tower just enough to connect but not be able to do anything, like call 9-1-1 without trying to fail-over to other mobile networks. Devices showed zero bars, but field test mode would show some handshake succeeding.
(The CTO was roaming out-of-country, had zero bars and thought nothing of it... how they had no idea an enterprise-risking update was scheduled, we'll never know)
Supposedly you could remove your SIM card (who carries that tool doohickey with them at all times?), or disable that eSIM, but you'd have to know that you can do that. Unsure if you'd still be at the mercy of Rogers being the most powerful signal and still failing to get your 9-1-1 call through.
Rogers claimed to have no ability to power down the towers without a truck-roll (which is how another aspect where widespread OOB could have come in handy).
Various stories of radio stations (which Rogers also owns a lot of) not being able to connect the studio to the transmitter, so some tech went with an mp3 player to play pre-recorded "evergreen" content. Others just went off-air.
https://www.theregister.com/2022/07/25/canadian_isp_rogers_o...
In sane handsets (ones where the battery is still removable), that tool was and still is a fingernail, which most have on their person.
I believe the innovation of the need for a special SIM eject tool was bestowed upon us by the same fruit company that gave us floppy and optical drives without manual eject buttons over 30 years ago.
Meanwhile “PCs” had a functional but ugly rectangular opening the size of the entire drive, and the drive had its own bezel, and imperfect alignment between drive and case looked a bit ugly but had no effect on function.
(I admit I’m suspicious that Apple’s approach was a cost optimization, not an aesthetic optimization.)
If the emergency call doesn’t go through, try the call over a different network. This would also mitigate problems we see from time to time where emergency calls don’t work because the uplink to the emergency call center was impacted either physically or by a bad software update.
I didn’t think this failure mode was even possible.
How does one know ahead of time if any particular change is "enterprise-risking"? It appeared to be a fairly routine set of changes that were going just fine:
> The report summary says that in the weeks leading up to the outage, Rogers was undergoing a seven-phase process to upgrade its network. The outage occurred during the sixth's phase of the upgrade.
* https://www.cbc.ca/news/politics/rogers-outage-human-error-s...
It turns out that they self-DoSed certain components:
> Staff at Rogers caused the shutdown, the report says, by removing a control filter that directed information to its appropriate destination.
> Without the filter in place, a flood of information was sent into Rogers' core network, overloading and crashing the system within minutes of the control filter being removed.
* Ibid
> In a letter to the CRTC, Rogers stated that the deletion of a routing filter on its distribution routers caused all possible routes to the internet to pass through the routers, exceeding the capacity of the routers on its core network.
* https://en.wikipedia.org/wiki/2022_Rogers_Communications_out...
> Rogers staff removed the Access Control ListFootnote 5 policy filter from the configuration of the distribution routers. This consequently resulted in a flood of IP routing information into the core network routers, which triggered the outage. The core network routers allow Rogers wireline and wireless customers to access services such as voice and data. The flood of IP routing data from the distribution routers into the core routers exceeded their capacity to process the informationFootnote 6. The core routers crashed within minutes from the time the policy filter was removed from the distribution routers configuration.
* https://crtc.gc.ca/eng/publications/reports/xona2024.htm
These types of things happen:
> In October, Facebook suffered a historic outage when their automation software mistakenly withdrew the anycasted BGP routes handling its authoritative DNS rendering its services unusable. Last month, Cloudflare suffered a 30-minute outage when they pushed a configuration mistake in their automation software which also caused BGP routes to be withdrawn.
* https://www.kentik.com/blog/a-deeper-dive-into-the-rogers-ou...
I've seen outages caused by a single bad router advertisement that caused global crashes due to route poisoning interacting with a vendor bug. RPKI enforcement caused massive congestion on transit links. Route leaks have DoSed entire countries (https://www.internetsociety.org/blog/2017/08/google-leaked-p...). Even something as simple as a peer removing rules for clearing ToS bits resulted in a month of 20+ engineers trying to figure out why an engineering director was sporadically being throttled to ~200kbps when trying to access Google properties.
Running a large-scale production network is hard.
edit: in case it is not obvious: I agree entirely with you -- the routine config changes that do risk the enterprise are often very hard to identify ahead of time.
Edit: I bet it had diesel generators when it was in service with AT&T to boot.
20 to 25 years ago I visited a telecom switch center in Paris, the one under the Tuileries garden next to the Louvre. They had a huge and empty diesel generators room. They had all been replaced by a small turbine (not sure it's the right English term), just the same as what's used to power an helicopter. It was in a relatively small soundproof box, with a special vent for the exhaust, kind of lost on the side of a huge underground room.
As the guy in charge explained to us, it was much more compact and convenient. The big risk was in getting it started, this was the tricky part. Once started it was extremely reliable.
That's the right English word yes. And that's pretty cool!
That's where AT&T screwed up in Nashville when their DC got bombed. They relied on natural gas generators for their electrical backup. No diesel tank farm. Big fire = fire department shuts down natural gas as wide as deemed necessary and everything slowly dies as the UPS batteries die.
They also didn't have roll-up generator electrical feed points, so they had to figure out how to wire those up once they could get access again, delaying recovery.
https://old.reddit.com/r/sysadmin/comments/kk3j0m/nashville_...
I've seen some power outages in california, and noticed that comcast/xfinity had these generator trailers rolled up next to telephone poles, probably powering the low voltage network infrastructure below the power lines.
Listing removed a couple weeks ago.
Unfortunately the CRTC is run by former execs/management of Bell, Telus, and Rogers, and our anti-competition bureau doesn’t seem to understand their purpose when they consistently allow these 3 to buy up and any all small competitors that gain even a regional market share.
Meanwhile their service is mediocre and overpriced, which they’ll chalk up to geographical challenges of operating in Canada while all offering the exact same plans at the exact same prices, buying sports teams, and paying a reliable dividend.
I'm sure those 2 compete very hard with each other with that level of co-dependency.
Not 100% out of band for a telco though, unless they made sure to use a competitors lines.
However, as I understand it, at least for commercial use, the phone company provides some kind of box that has battery-backing so it can provide phone service for a certain duration in case of emergency.
[1] Wave division multiplexing sends multiple signals over the same fiber by using different wavelengths for different channels. Each wavelength is sometimes referred to as a lambda.
2. The risk that you and your competitor unknowingly share a common dependency, like utility lines; if the common dependency fails then both you and your OOBM are offline.
The whole point of paying for and maintaining an OOBM is to manage and compensate for the risks of disruption to your main infrastructure. Why would you knowingly add risks you can't control for on top of a framework meant to help you manage risk? It misses the point of why you have the OOBM in the first place.