"Out of Band" network management is not trivial

Animats · a year ago

In the entire history of the Bell System, no electromechanical exchange was ever down for more than 30 minutes for any reason other than a natural disaster. With one exception, a major fire in New York City. Three weeks of downtime for 170,000 phones for that.[1] The Bell System pulled in resources and people from all over the system to replace and rewire several floors of equipment and cabling.

That record has not been maintained in the digital era.

The long distance system did not originally need the Bedminster, NJ network control center to operate. Bedminster sent routing updates periodically to the regional centers, but they could fall back to static routing if necessary. There was, by design, no single point of failure. Not even close. That was a basic design criterion in telecom prior to electronic switching. The system was designed to have less capacity but still keep running if parts of it went down.

[1] https://www.youtube.com/watch?v=f_AWAmGi-g8

benjojo12 · a year ago

That electro mechanical system also switched significantly less calls than the digital counterparts!

Most modern day telcos that I have seen still have multiple power/line cards/uplinks in place and designed for redundancy. However the new systems can also just do so much more and are so more flexible that they can be configured out of existence just as easily!

Some of this as well is just poor software, on some of the big carrier grade routers you can configure many things but the combination of things that you can figure may also just cause things to not work correctly, or even worse pull down the entire chassis, I don't have immediate experience on how good the early 2000s software was, but I would take a guess and say that configurability/flexibility has had a serious cost on reliability of the network

atoav · a year ago

And part of the reason why it is software is because people keep saying it is "just" software.

Unreliability is unreliability even of it comes through software and we ahould treat broken software as broken, not as "just a software error".

tsimionescu · a year ago

The expectation should be that, as you switch more and more, so that the cost of a 30 minute pause gets higher and higher, the situation would improve, and a more modern system might have been expected to boast that it never had a break lasting more than, say, 30s outside of a natural disaster.

varjag · a year ago

Not sure about the calls, since apparently millennials and zoomers never pick up.

Deleted Comment

macintux · a year ago

A relatively famous example of the extent to which Indiana Bell went to avoid disrupting telephone service: rotating and relocating its headquarters over a few weeks.

https://en.wikipedia.org/wiki/AT%26T_Building_(Indianapolis)

maxhodges · a year ago

Surely, that downtime quote is apocryphal. The early decades were plagued by poor service due to maintenance, mechanical failure, and human error. In the early decades, manual switchboards were the primary method of connecting calls. The intense workload and physical demands on operators often led to service disruptions . In 1907, over 400 operators went on strike in Toronto, severely impacting phone service. The strike was driven by wage disputes, increased working hours, and poor working conditions

They didn't have downtime logs, but that doesn't mean that the rapid growth of telephone demand didn't outpace the Bell System's capacity to provide adequate service. The company struggled to balance expansion with maintaining service quality, leading to intermittent service issues .

Bell System faced significant public dissatisfaction due to poor service quality. This was compounded by internal issues such as poor employee morale and fierce competition.

bcrl · a year ago

Bell Canada had a major outage on July 17th 1999 when a tool was dropped on the bus bar for the main battery power that ignited the hydrogen from the batteries in one of the exchangeds in downtown Toronto. The fire department insisted that all power in the area be shut down which lead to the main switch that handled long distance call routing for all 1-800 numbers being offline for the better part of a day.

gosub100 · a year ago

that was back before the phone system was defiled by robo-dialers and scammers.

Scoundreller · a year ago

One thing that was fascinating about the Rogers outage was on the wireless side: because "just" the core was down, the towers were still up.

So mobile phones would try to make a connection to the tower just enough to connect but not be able to do anything, like call 9-1-1 without trying to fail-over to other mobile networks. Devices showed zero bars, but field test mode would show some handshake succeeding.

(The CTO was roaming out-of-country, had zero bars and thought nothing of it... how they had no idea an enterprise-risking update was scheduled, we'll never know)

Supposedly you could remove your SIM card (who carries that tool doohickey with them at all times?), or disable that eSIM, but you'd have to know that you can do that. Unsure if you'd still be at the mercy of Rogers being the most powerful signal and still failing to get your 9-1-1 call through.

Rogers claimed to have no ability to power down the towers without a truck-roll (which is how another aspect where widespread OOB could have come in handy).

Various stories of radio stations (which Rogers also owns a lot of) not being able to connect the studio to the transmitter, so some tech went with an mp3 player to play pre-recorded "evergreen" content. Others just went off-air.

https://www.theregister.com/2022/07/25/canadian_isp_rogers_o...

wannacboatmovie · a year ago

> Supposedly you could remove your SIM card (who carries that tool doohickey with them at all times?)

In sane handsets (ones where the battery is still removable), that tool was and still is a fingernail, which most have on their person.

I believe the innovation of the need for a special SIM eject tool was bestowed upon us by the same fruit company that gave us floppy and optical drives without manual eject buttons over 30 years ago.

amluto · a year ago

I have fond memories of the fruit company taking the next step: removing the floppy drive bezel from the drive and instead having the floppy drive slot be part of the overall chassis front panel. Of course, their mechanical tolerances were nothing like they are today, so if you looked at the computer crosseyed, the front panel would fail to align to the actual internal disk path, and ejecting the disk would cause it to get stuck behind the front chassis panel. One could rescue it by careful wiggling with a tool to guide the disk through the slot or by removing the entire front panel.

Meanwhile “PCs” had a functional but ugly rectangular opening the size of the entire drive, and the drive had its own bezel, and imperfect alignment between drive and case looked a bit ugly but had no effect on function.

(I admit I’m suspicious that Apple’s approach was a cost optimization, not an aesthetic optimization.)

rzzzt · a year ago

You could operate the ejection mechanism by hand both on optical and floppy disk drives with an uncurled paperclip (or a SIM card ejection tool were they to exist at that point in time). But I wouldn't ascribe the introduction of the motorized tray to the fruit company, it was the wordmark company: https://youtu.be/bujOWWTfzWQ

pgraf · a year ago

Sounds like a problem that should be (rather easily) fixable in the Operating System, no?

If the emergency call doesn’t go through, try the call over a different network. This would also mitigate problems we see from time to time where emergency calls don’t work because the uplink to the emergency call center was impacted either physically or by a bad software update.

sidewndr46 · a year ago

Very few phones have the calls handled by the OS. It usually is handled by something called the baseband, which is more like firmware

Scoundreller · a year ago

Yes, seems to be a deficiency in the gsm spec as this affected most/all devices.

I didn’t think this failure mode was even possible.

throw0101d · a year ago

> (The CTO was roaming out-of-country, had zero bars and thought nothing of it... how they had no idea an enterprise-risking update was scheduled, we'll never know)

How does one know ahead of time if any particular change is "enterprise-risking"? It appeared to be a fairly routine set of changes that were going just fine:

> The report summary says that in the weeks leading up to the outage, Rogers was undergoing a seven-phase process to upgrade its network. The outage occurred during the sixth's phase of the upgrade.

* https://www.cbc.ca/news/politics/rogers-outage-human-error-s...

It turns out that they self-DoSed certain components:

> Staff at Rogers caused the shutdown, the report says, by removing a control filter that directed information to its appropriate destination.

> Without the filter in place, a flood of information was sent into Rogers' core network, overloading and crashing the system within minutes of the control filter being removed.

* Ibid

> In a letter to the CRTC, Rogers stated that the deletion of a routing filter on its distribution routers caused all possible routes to the internet to pass through the routers, exceeding the capacity of the routers on its core network.

* https://en.wikipedia.org/wiki/2022_Rogers_Communications_out...

> Rogers staff removed the Access Control ListFootnote 5 policy filter from the configuration of the distribution routers. This consequently resulted in a flood of IP routing information into the core network routers, which triggered the outage. The core network routers allow Rogers wireline and wireless customers to access services such as voice and data. The flood of IP routing data from the distribution routers into the core routers exceeded their capacity to process the informationFootnote 6. The core routers crashed within minutes from the time the policy filter was removed from the distribution routers configuration.

* https://crtc.gc.ca/eng/publications/reports/xona2024.htm

These types of things happen:

> In October, Facebook suffered a historic outage when their automation software mistakenly withdrew the anycasted BGP routes handling its authoritative DNS rendering its services unusable. Last month, Cloudflare suffered a 30-minute outage when they pushed a configuration mistake in their automation software which also caused BGP routes to be withdrawn.

* https://www.kentik.com/blog/a-deeper-dive-into-the-rogers-ou...

immibis · a year ago

BTW: anyone who wants to really experience how complex internet routing is, go and join DN42 (https://dn42.dev). This is a fake internet built on a network of VPN tunnels and using the same routing systems. As long as you're just acting as a leaf node, it's pretty straightforward. If you want to attach to the network at multiple points and not just VPN them all to the same place, now you have to design a network just like an ISP would, with IGP and so on.

vitus · a year ago

Router config changes are simultaneously very commonplace and incredibly risky.

I've seen outages caused by a single bad router advertisement that caused global crashes due to route poisoning interacting with a vendor bug. RPKI enforcement caused massive congestion on transit links. Route leaks have DoSed entire countries (https://www.internetsociety.org/blog/2017/08/google-leaked-p...). Even something as simple as a peer removing rules for clearing ToS bits resulted in a month of 20+ engineers trying to figure out why an engineering director was sporadically being throttled to ~200kbps when trying to access Google properties.

Running a large-scale production network is hard.

edit: in case it is not obvious: I agree entirely with you -- the routine config changes that do risk the enterprise are often very hard to identify ahead of time.

sidewndr46 · a year ago

It is likely no one at the C-Level knew the update had any risk

Scoundreller · a year ago

it was "high risk", and then downgraded to "low risk" as this several week and phase long project proceeded without initial issue.

gavindean90 · a year ago

I’m reminded of when an old AT&T building went on sale as a house, and one of its selling points was that you could get power from two different power companies if you wanted. This highlighted to me the level of redundancy required to take such things seriously. It probably cost the company a lot to hook up the wires, and I doubt the second power company paid anything for the hookup. Big Bell did it there, and I’m sure they did it everywhere else too.

Edit: I bet it had diesel generators when it was in service with AT&T to boot.

yaantc · a year ago

> I bet it had diesel generators when it was in service with AT&T to boot.

20 to 25 years ago I visited a telecom switch center in Paris, the one under the Tuileries garden next to the Louvre. They had a huge and empty diesel generators room. They had all been replaced by a small turbine (not sure it's the right English term), just the same as what's used to power an helicopter. It was in a relatively small soundproof box, with a special vent for the exhaust, kind of lost on the side of a huge underground room.

As the guy in charge explained to us, it was much more compact and convenient. The big risk was in getting it started, this was the tricky part. Once started it was extremely reliable.

tonyarkles · a year ago

> by a small turbine (not sure it's the right English term)

That's the right English word yes. And that's pretty cool!

Scoundreller · a year ago

> Edit: I bet it had diesel generators when it was in service with AT&T to boot.

That's where AT&T screwed up in Nashville when their DC got bombed. They relied on natural gas generators for their electrical backup. No diesel tank farm. Big fire = fire department shuts down natural gas as wide as deemed necessary and everything slowly dies as the UPS batteries die.

They also didn't have roll-up generator electrical feed points, so they had to figure out how to wire those up once they could get access again, delaying recovery.

https://old.reddit.com/r/sysadmin/comments/kk3j0m/nashville_...

m463 · a year ago

Interesting.

I've seen some power outages in california, and noticed that comcast/xfinity had these generator trailers rolled up next to telephone poles, probably powering the low voltage network infrastructure below the power lines.

thakoppno · a year ago

https://www.realtor.com/realestateandhomes-detail/13229-Sout...

Listing removed a couple weeks ago.

woleium · a year ago

Crypto Collective eh?

transcriptase · a year ago

It’s trivial when you have the resources that come from being one of Canada’s 3 telecom oligopoly members.

Unfortunately the CRTC is run by former execs/management of Bell, Telus, and Rogers, and our anti-competition bureau doesn’t seem to understand their purpose when they consistently allow these 3 to buy up and any all small competitors that gain even a regional market share.

Meanwhile their service is mediocre and overpriced, which they’ll chalk up to geographical challenges of operating in Canada while all offering the exact same plans at the exact same prices, buying sports teams, and paying a reliable dividend.

Scoundreller · a year ago

It's worse than that: 2 of the 3 telecom oligopoly members share (most) of their entire wireless network, with one providing most towers in the West, and the other in the East.

I'm sure those 2 compete very hard with each other with that level of co-dependency.

1992spacemovie · a year ago

There is OOB for carriers and OOB for non-carriers. OOB for carriers is significantly more complex and resource intensive than OOB for non-carriers. This topic (OOB or to forgo) has been beat to death over the last 20 years in the operator circles; the responsible consensus is trying to shave a % off operating expenses by cheaping out on your OOB is wrong. That said it does shock me that one of the tier-1 carriers in Canada was this... ignorant? Did they never expect it to rain or something? Wild.

goatsi · a year ago

When I see out of band management at remote locations (usually for a dedicated doctors network run by the health authority that gets deployed at offices and clinics) it's generally analog phone line -> modem -> console port. Dialup is more than enough if all you need to do is reset a router config.

Not 100% out of band for a telco though, unless they made sure to use a competitors lines.

no_carrier · a year ago

Here in Australia, POTS lines have been completely decommissioned, UK will be switched off by end of 2025 and I'm assuming there's similar timelines in lots of other countries.

vladvasiliu · a year ago

They're on the way out in France, too. New buildings don't get copper anymore, only fiber.

However, as I understand it, at least for commercial use, the phone company provides some kind of box that has battery-backing so it can provide phone service for a certain duration in case of emergency.

hujun · a year ago

now there are LTE modems

ChuckMcM · a year ago

Reminds me of a data center that said they had a backup connection and I pointed out that only one fiber was coming into the data center. They said, "Oh its on a different lambda[1]" :-)

[1] Wave division multiplexing sends multiple signals over the same fiber by using different wavelengths for different channels. Each wavelength is sometimes referred to as a lambda.

knocknock · a year ago

My previous org OOB used a data only SIM card from a different service provider. Curious why that wouldn't be a good solution?

solatic · a year ago

1. The risk, when you use a competitor's service, of your competitor cutting off service, especially at an inopportune time (like your service undergoing a major disruption, where cutting off your OOBM would be kicking you while you are down, but such is business).

2. The risk that you and your competitor unknowingly share a common dependency, like utility lines; if the common dependency fails then both you and your OOBM are offline.

The whole point of paying for and maintaining an OOBM is to manage and compensate for the risks of disruption to your main infrastructure. Why would you knowingly add risks you can't control for on top of a framework meant to help you manage risk? It misses the point of why you have the OOBM in the first place.

tonyarkles · a year ago

Maybe 10-15 years ago there was a local Rogers outage that would have had the #2 failure you're describing. From what I recall, SaskTel had a big bundle of about 3,000 twisted pairs running under a park. Some of those went to a SaskTel tower, some to SaskTel residential wireline customers and some of those went to a Rogers facility. Along comes a backhoe and slices through the entire bundle.