How Southwest Airlines melted down

nthitz · 3 years ago

burlesona · 3 years ago

It’s fascinating that the same hopscotch travel pattern that allows SWA to offer better service to more places is also what caused the network to suffer cascading failure. Once a critical mass of pieces (planes/crew) were out of position the whole network fell apart, and it’s large enough that it seems like neither the humans nor software can easily reason about how to resume operations. Hence the need for a “full system reboot” over many days.

Anecdotally, I flew Southwest just before Christmas. The network was already buckling and we had major delays, but we were lucky and made it through. Despite the stress, the SWA crews were helpful, empathetic, and polite. They handled it better than I would have if I had been in their shoes.

TheCondor · 3 years ago

Is resumption difficult or is it resumption and then make whole the tens of thousands of customers that were supposed to be moved a week prior?

No idea what SkySolver actually does in totality, I'm sure it's complicated but I would think a flight crew could indicate where they are right now and then it could maybe pickup the next possible course they could perform. Not sure why the phone lines "jam up" exactly, don't you have a hierarchical management structure for this sort of thing? Or do 1000 pilots all report to one person?

They've got like 1000 planes and like 100-150 destinations, it's not the traveling sales man problem, an optimal plan isn't needed now so much as a functional one.

Of course, it's easy to bitch about it not being hard when I've never seen the code. Maybe is also tracks hours and does payroll and a dozen other functions.

TimTheTinker · 3 years ago

It's essentially an optimizing constraint solving problem. As far as I know, there are 2 main known approaches to this class of problem:

- use a tree traversal algorithm with a lot of built-in optimizations and pruning. Google OR-Tools is one well-known solver.

- use a set of meta-heuristics (Tabu search, simulated annealing, etc.) against arbitrary predicate expressions. The best example of this kind of silver is OptaPlanner.

The advantage of tree traversal is that it is exhaustive and guaranteed to be optimal; but it requires a fixed computing budget for a given set of constraints. Given a large compute cluster it's ideal, but on individual machines/servers these solutions tend to take hours or days. Southwest's system would likely require a significant supercomputer to run its scheduling through an exhaustive optimizing constraint solver, since it would have to re-run the entire solution (or significant numbers of subtrees) when any variable changes (which happens likely several times per second).

Meta-heuristics are more flexible and allow for all sorts of interesting, convenient features that can be very helpful when your compute budget is less than a Tiobe-500 supercomputer. They offer time-constrained solving, real-time monitoring of a solution in progress (you can watch it get better as more of the solution space is searched), and over-constrained solving (where the requested solution is impossible, but we need to make a best-effort attempt).

masklinn · 3 years ago

> Is resumption difficult

The resumption itself.

> Not sure why the phone lines "jam up" exactly, don't you have a hierarchical management structure for this sort of thing? Or do 1000 pilots all report to one person?

It's not a problem of management (well it is a problem of 15 years of management fuckup which is the root cause), it's a problem of the scheduling system needing manual updates when things didn't go to plan.

At this point it's completely screwed, so it needs to be completely reset, as in the entire scheduling system needs to be reconfigured from empty, more or less.

And because SouthWest operates entirely on point to point, the same cascading properties which led to its complete collapse mean it needs to restart in a somewhat synchronised manner, otherwise you fly 3 planes, there's no followup, and you're hosed again.

> They've got like 1000 planes and like 100-150 destinations, it's not the traveling sales man problem, an optimal plan isn't needed now so much as a functional one.

The system completely lost track of crews, so all of them need to be relocated, their work cycles reworked, flight slots need to be reallocated, flights need to be re-encoded.

lbotos · 3 years ago

I suspect crews timing out also play a big factor into this planning challenge:

https://flyingbynumbers.com/cabin-crew-duty-timeouts/

scarface74 · 3 years ago

I’m by no means an expert in this area. But from my experience writing field service software in another life, even on a city wide scale when you may have 20 people and around 200 stops it got tedious when the system went down and you had to manually schedule it.

I use to write field service software for ruggedized Windows CE devices.

DrBazza · 3 years ago

> Once a critical mass of pieces (planes/crew) were out of position the whole network fell apart, and it’s large enough that it seems like neither the humans nor software can easily reason about how to resume operations. Hence the need for a “full system reboot” over many days.

We had days like that in the UK but with the rail system. And it when it happens it’s also due to snow. We’ve seen it on global scale recently due to covid putting ships in the wrong places so that optimised shipping routes become a mess.

amluto · 3 years ago

I don’t see how a full system reboot should take days. If you don’t care about serving customers (which they don’t right now), then the problem could be simplified to getting every plane and every crew home to where they would spend the night under a normal schedule. With few or no paying customers on each plane, there should be plenty of capacity to move misplaced crew members around. None of this needs to approximate normal routing, and fewer segments than normal are needed, since passengers can be ignored.

That being said, the software sucks. Southwest may have lost track of where their employees are. The ground crews are quitting. I wouldn’t be utterly shocked if management doesn’t even have a good overview of their the planes are.

(Obviously anyone halfway competent could hack up a script to find all the planes based on ADS-B data in a few hours. And it wouldn’t be terribly hard to text a link to all crew asking them to fill out a simple form with their location, nearest airport, and when they can get there. But this requires competence and agility.)

jasonwatkinspdx · 3 years ago

It's far more complicated than that. For example, there's a lot of strict regulation around how many hours crew can work, how much rest they need between, etc. Likewise planes can't just randomly fly where-ever they want, whenever they want. There's mandatory maintenance, etc. You also have a limited number of hangers and jetways at each airport, so you have to coordinate how planes move around to not overload things. Oh we need to be sure there's enough fuel, fresh oxygen bottles, and so many other things.

The problem is nothing like writing a script to scrape ADS-B data. This is the classic fallacy of a programmer thinking the most cartoonish imagination of the technology is the problem while being completely blind to the fundamental difficulty of organizing large groups of humans in some activity.

smaudet · 3 years ago

As others have said, you are ridiculously oversimplifying the process.

If you are running a little video game that plots flight charts, sure perhaps you could write such a script. You cannot, however, script a bunch of logisitics support together across disparate, independent, multinational airports.

Add the obligitary paperwork (we don't really want to be shipping people around in coffins, enabling human trafficking or illegal substance smuggling, trust me), and add your standard bit of managerial incompetence managed from an excel document and sharepoint, and it's actually downright impressive if it's only a couple days to completely reboot the system.

Arguably, management doing the standard "oh that will never happen" is probably why it's not even better - you would think the airports would be able to automatically "fix" themselves with a self healing protocol, but that was probably deemed too expensive and left out of the feature set.

nemothekid · 3 years ago

>then the problem could be simplified to getting every plain and every crew home to where they would spend the night under a normal schedule.

I'm confused, maybe my calibration of the scale involved here is wrong, but how do you not see this process taking days? Even if you had perfect information of where all the planes and crews were and where they should be, just coordinating with various airports to schedule the flights would take days; and that's assuming every plane is (1) flying empty, (2) fueled and (3) isn't carrying any passenger luggage.

irjustin · 3 years ago

I'm downvoting.

It's simple to say, "oh it would be this easy..." and not consider a single real life scenario.

> problem could be simplified to getting every plain and every crew home to where they would spend the night under a normal schedule

This alone would take days under your own recommendation.

mgsouth · 3 years ago

Well, let's see... A bit of internet searching shows SWA has about 60,000 total employees. Roughly 6,000 of those are pilots. So probably about 10,000 cabin crew (3 or 4 flight attendants, 2 pilots per flight). Most of those 15,000+ people need to be in exactly the expected place at the expected time, or things will get snarled up. If any one of those 5 or 6 people is a no-show then the plane can't legally fly passengers. There's limited fungibility; one of the pilots has to be a Captain. You don't want two Captains flying, if for no other reason than some other flight will then be short a Captain. One of the cabin crew has to be a Purser (supervisor). It's really preferable to have at least one of the pilots flying to the airports they usually frequent. Off-shift crew still need to be in the proper place at the proper time, or the plane has to stop mid-cycle. Then there's the ground and gate crews.

And you can't just say "go to your usual starting airport". Flights are 24/7. Days of the week matter. Holidays matter. Even your local burger joint doesn't have cast-in-stone schedules. Much less a huge interconnected network that's trying to reboot. Even if 80% of the crew could go to a "usual airport", how would you know if you were in the 20%? Oh... everybody has to phone in, or the system has to send a huge number of notifications out. System crash. (Remember, normally the system depends on scheduling in advance, and only a few last-minute changes need to be handled.)

And how do these what, 20k+ people get where they should be? If they aren't already at that airport, then normally by hopping on a SWA flight. Which aren't happening. So now what? Book flights on other airlines? Who's going to be doing the booking? How does it get paid? How many employees have a company credit card (any?) Use the employee's cards? How many are maxed out for Christmas?

So let's suppose you're scheduled everyone and everybody knows where there's supposed to be and when. What happens when 10% call back in (without crashing the system!) and say they can't get there on time. (Remember, many SWA customers are stuck and are having a hard time arraning alternate travel. Crew without access to SWA flights would have the same trouble.) You're going to either apply massive changes to the existing schedule, or start all over and reschedule everyone. Which is exactly the problem they're currently having. They can't handle massive corrections. What's worse, a crew member may think they're good, and then their travel gets delayed somehow.

How sensitive is the scheduling? If 10% of the crew are no-show or delayed then about 9% of the planes are affected; every 8 hours 90 planes will have the following 8 hours of flights cancelled or delayed (throwing further monkey wrenches in the schedule). About 40 of those planes will be missing a pilot, which means they can't even be deadheaded to where they're supposed to be next.

So how to reboot? They've had to cancel two-thirds of their flights, so apparently they're able to keep 1/3 of them going. Keep those flying so they can shuttle crew around. You're initially only scheduling 1/3 of the crew, so the poor overloaded system can handle it. 2/3 of the flights are just outright cancelled, days in advance, so the customer support load is reduced. ("I'm sorry, your flight has been cancelled. We can't reschedule you until x days from now." vs. much back-and-forth trying to find something with an overloaded system.) Slowly add additional crew and flights so the number of phone-ins is kept manageable.

benced · 3 years ago

Something worth pointing out here is that according to levels.fyi, a tech lead there who has worked for 20 years has a TC of 174K. It’s unlikely their technical talent would be up to what you propose which is related to the apparent decision at Southwest to underinvest - by their own reckoning - in technology.

MR4D · 3 years ago

I wonder if it’s that or simply a lack of slack in their system.

It seems to me that just like pre-staged inventory helps in logistics management, that extra planes and crews in the rotation could improve operations under these circumstances.

EMM_386 · 3 years ago

With a normal airline, you have pilots sitting on "reserve" at bases, who can be called in at any time to fill in any gaps that may occur. They are being paid but are not flying, it's quite a good gig if you can get on the reserve list.

I don't know how this is handled at Southwest, who does not fly hub-and-spoke and thus doesn't have a bunch of pilots sitting reserve around a base at, say, Atlanta.

mjevans · 3 years ago

In the past I had a job where some contract required a trained body to be on site 24/7. The company hired EXACTLY enough workers to fill the position, with _zero_ slack for anything.

That lack of slack is hell. It makes any disruption, even minor ones, require the other workers to work more time. Major disruptions mean soul-crushing crunch level hours to just get by.

Slack _must_ be planned into a system, otherwise there won't be any safety / recovery margin, and you're seeing the results live with Southwest's implosion.

Tao3300 · 3 years ago

They probably had that kind of wiggle room once upon a time. Then some sort of pandemic turned everything to just enough planes and skeleton crews.

cratermoon · 3 years ago

> Hence the need for a “full system reboot” over many days.

My understanding is that the full system reboot wouldn't have taken all that long, it's just the the company was trying to do a major fix while keeping whatever was still sort-of-working running. As any sysop will tell you, patching a running system is all kinds of crazy risky.

bogomipz · 3 years ago

>"Anecdotally, I flew Southwest just before Christmas. The network was already buckling and we had major delays, but we were lucky and made it through."

Interesting. You don't say how far before Christmas you were traveling. Had this crazy weather system already started moving from West to East at that point? Or was the system buckling just from passenger volume at the point i.e similar to the Summer meltdown that Southwest had?

firstSpeaker · 3 years ago

I imagine the system cannot account for where all the planes, staff, passengers are and where they want to go economically.

bumby · 3 years ago

According the the article, the system makes some non-sensical tasking:

"In one example during the storm, the system assigned a pilot to deadhead on a flight from Baltimore to Manchester, N.H., and then back to Baltimore the next day, without ever flying a plane"*

* The article defines deadheading as sending a pilot as a passenger to get to another location.

It would be interesting to look at what the system is trying to optimize for to make such choices.

icambron · 3 years ago

I've told this story a few times, but maybe 10 years ago I had a cross-country JetBlue flight that was delayed perhaps 6 hours hours. It was a few days after a major storm. Like Southwest here, JetBlue didn't have much flex capacity and relied on the daisy chain to keep on chaining. Our plane had gotten stuck somewhere, so they had to find a different one at some far-away airport and fly it in, which took hours. But the kicker was that when the plane finally landed, the crew already onboard couldn't man the flight because that would exceed their duty limits. The airline didn't realize this ahead of time, so they had to gather a new crew (like literally call them in), which added a couple of hours to the delay.

Naively, I'd assumed these kinds of things were handled in some sort of mission-control center with warnings from rule engines blinking on some big screen and a team of crack operators mapping out what needed done. But clearly that wasn't so: they were just making things up as they went along. Sounds like Southwest is in a similar spot, but this time on a much bigger scale.

seandoe · 3 years ago

> clearly that wasn't so: they were just making things up as they went along.

Where did you get your information? I have experience in the industry and scheduling logistics is clearly not how you describe it. The issue is that to optimize for profit you sacrifice the ability to maintain service through catastrophic events and can end up in a bit of a dominoes situation.

icambron · 3 years ago

Which information? That they belated realized they'd run out of duty hours and had to call in new crew, after the plane with the soon-to-expire crew had landed? They told us that while we waited at the gate, including updates about the expected time that the newly-called-in crew would arrive. They were quite transparent about everything.

Or are you asking why I think they're making it up as they go along? That is my conclusion, to be sure. It's one thing for dominoes to fall, but if you are in a situation where the dominoes are falling and you are not able to predict which dominoes will fall next and respond accordingly, you are making things up as you go along. I'd have expected that almost any decision they could even hypothetically make would run through a system that checked it for violations of constraints, which would have told them way ahead of time that they needed a fresh crew, and they'd have had the entire flight of the replacement plane to get one in (IIRC it flew Miami->Boston just to get us the plane).

noobermin · 3 years ago

>The issue is that to optimize for profit you sacrifice the ability to maintain service through catastrophic events and can end up in a bit of a dominoes situation.

Sounds like an argument for nationalization.

bombcar · 3 years ago

I’ve seen it happen and it’s always strange that they seem to not realize the crew will be unusable until they arrive … I assume they had been trying to get another crew at the same time.

ipqk · 3 years ago

There are several time limits, but I believe that one of them stops when pulling back from the gate. i.e. once the the plane is moving, they're granted several X more hours, but if they haven't left yet, then they've timed out. So the arriving crew may plausibly be usable upon arrival, but a few minutes late may be all it takes to time them out.

I've been on a delayed plane where one of the pilots timed out while sitting at the gate, so we had another second delay finding another pilot.

mrandish · 3 years ago

There's also the fun wrinkle that not all crews or crew members are current on ratings (ie training courses) to operate all types/versions of planes the airline as in service.

You also need sufficient open gates of the right type in the right place during the right time window or you idle a plane (which can then idle a crew). It's pretty easy to imagine how such a system which may work at a certain level of load can be tipped into a runaway scenario.

As a systems person I'd be an interested reader if Southwest were to someday release an incident analysis and post-mortem of the type that good IT organizations do.

nradov · 3 years ago

Sometimes the dispatchers expect that the crew will still be legal, but then the incoming flight gets delayed just enough to push them over the limit.

walrus01 · 3 years ago

Once you realize how many people in critical industries (power grid, telecom, global cargo/logistics) are in fact making things up as they go along, and there aren't really any highly organized and responsible people running the show, you start to worry.

isiahl · 3 years ago

Reminds me of this Onion headline: "Smart, Qualified People Behind The Scenes Keeping America Safe: ‘We Don't Exist’"

https://www.theonion.com/smart-qualified-people-behind-the-s...

Beltalowda · 3 years ago

What is the alternative? Extensive playbooks for every possible scenario that may or may not make sense in the actual scenario because a few critical details are different? A "mission-control center" is really no different: it's just people making stuff up on the spot.

In the end, there is no substitute for human judgement in the situation itself.

ghaff · 3 years ago

Or you realize a lot of things involving real world physical systems are hard and customers aren't fine with paying "whatever."

nostromo · 3 years ago

The actual answer is buried at the end of a long article.

> Unlike many rival airlines, Southwest’s planes generally hop from one city to another, rather than orbiting a major hub. That approach lets Southwest maximize use of its planes and crew, but the daisy chain structure also makes its network more delicate—problems in one corner of the country can be difficult to contain

phpisthebest · 3 years ago

That is only part of the issue, not every airline is 100% hub system alot most are a pretty big mix.

Further alot of major Hubs where imacted by the Storm, yet those airlines where able to transition. Why?

Well most airlines have mobile apps, and web portals and other techonology so their crews can be reassigned in almost real time, (just like I would get auto booked on a new flight before I even knew my flight was canceled via the mobile app)

Instead Southwest has systems from the 80's they require crews and customers to call and talk to an actual live human...

masklinn · 3 years ago

> Instead Southwest has systems from the 80's they require crews and customers to call and talk to an actual live human...

From what I understand (from /r/flying) it's the reporting / fixing of issues which is manual. This means when the reporting / fixing is overloaded and not given a respite it can't catch up, keeps getting more overloaded, and ultimately the scheduling system completely fell over (it lost track of crews entirely is what I understand).

"Mobile apps, and web portals and other technology" is orthogonal to the issue at hand.

Ekaros · 3 years ago

Major hub getting impacted is less of an issue. Either the planes and crews are stuck there, you cancel flights and send them home. Or they fly to nearest available location and wait until they can get back to the hub.

And explaining that FAA or someone banned flying from this airport is much simpler.

masklinn · 3 years ago

I mean... it's not entirely untrue but at the same time airlines have been winding down their "hub and spoke" model for a point to point one for a while.

That's in part what doomed the A380, which was popular with airlines still going strong with hub-and-spoke (Emirates being by far the most prominent one) but is worthless in a point-to-point model.

Deleted Comment

ChrisMarshallNY · 3 years ago

> the A380

My worst nightmare, was pulling into a gate, and seeing one of those puppies arriving on the same concourse.

Immigration queues are bad enough, with 747s, but the A380 is much worse.

nradov · 3 years ago

There are still a number of major airports that are slot constrained, especially now that passenger demand is growing again. It seems like there could be a profitable market for an efficient twin-engine airliner larger than the Boeing 777X. Perhaps even a double-decker?

ComputerGuru · 3 years ago

Southwest is statistically the worst airline in terms of delays and cancellations but has deluded its customers into thinking its the best (according to surveys asking people to rate airlines on their reliability).

https://www.insidehook.com/daily_brief/travel/airlines-fewes...

skellington · 3 years ago

Thanks for not understanding statistics and linking to an article that also doesn't understand basic math.

SW has a high number of delays and cancellations BECAUSE THEY FLY A HIGH VOLUME OF PEOPLE. By percentage, they are in the middle of the pack for both delays and cancellations, which isn't great, but they are not the worst by any means.

How are HN people so consistently bad with basic information?

enjoylife · 3 years ago

Probably because the raw data is often hidden from the readers so it’s hard to corroborate a stories statistical narrative.

Here is the data which backs up the majority of these low effort cancellation related news articles.

https://public.tableau.com/app/profile/flightaware/viz/Airli...

vl · 3 years ago

But also SW attracts very specific kind of customer. If you fly for business, or just can afford other airline, why would you fly Southwest?

noyoudumbdolt · 3 years ago

They are actually more expensive, which is why they do two things to fool their customers:

1. On their flight search page, it shows the prices on a per-leg basis. Every other airline shows round trip prices. That way, the initial result on Southwest looks great and it’s only when you get to the final payment page that you realize it’s actually twice as expensive as you thought.

2. They refuse to share their data with any of the third party flight search engines like Google Flights or Expedia. Again, so people don’t realize how expensive they actually are.

ComputerGuru · 3 years ago

SW is hardly even the cheapest. If not booking months in advance, SW is almost always twice the price of United or American, at least in my parts.

deathanatos · 3 years ago

The statistics in that article are of the "damned lies" variety: none of the values are normalized (they compared simple number of cancellations and delays without taking that as a per flight value, or perhaps better, per passenger) and they treat all delays (and cancellations) as equal; I'll take an airline often delayed by 5 minutes over an airline sometimes delayed by 3 hours.

Perhaps it's true nonetheless, but the numbers there won't tell you.

(And IME, it's perhaps true that SWA is often delayed … but by tolerable amounts. Compared to delays I've endured with Delta, where, e.g., a flight was delayed longer than the time it would take the plane to drive at highway speeds, from where it was coming from. Or … also Delta … where I was cancelled on twice in the same flight. They wanted to go 0-3 but I gave up and bought a ticket on … SWA.)

tyingq · 3 years ago

Perhaps they are thinking frequency for a trip they take often. It matters less that your Dallas->Houston flight is late when there's another one in 30 minutes during peak times.

variant · 3 years ago

Deluded? Or could it be that customers value economy over predictability?

kube-system · 3 years ago

I believe this has changed in recent years due to similar hiccups, and their reliability in prior years was previously good.

ghaff · 3 years ago

I suspect also that, like JetBlue in its early years, a lot of its flights were out of secondary airports that are generally less exposed to a lot of operational disruptions. (e.g. they flew in Hobby in Houston early on). They also had the advantage of being in a region that per its name that probably has fewer weather issues in general.

The pattern I've seen over the years is that upstart airlines, as they grow, end up having to look--for various reasons--a lot more like legacy carriers over time. Whether that's flying into areas with seasonal bad weather, flying out of default airports, instituting various forms of passenger status, etc.

tmpburning · 3 years ago

You are lucky if your flights are on time 75% of the time on average, with any airline.

quickthrower2 · 3 years ago

Usually better I find as they overestimate the flight time as if headwinds are happening very time

alanbernstein · 3 years ago

(in 2022)

marze · 3 years ago

I find it especially ironic that SWA system failed them, and this large failure was preceded by worse and worse "near failures", since SWA is in the aviation business.

In the aviation arena, high reliability is maintained in part by careful analysis of "near failures": lessons are extracted and improvements are made to aircraft designs, procedures, etc.

By contrast, the "near failures" of the SWA system as a whole don't appear to have been utilized to motivate system improvements.

masklinn · 3 years ago

Per comments on /r/southwestairlines for 20 years the C-suite were Wall Street boys and set that as management culture, as long as shares were up things were fine.

Frontlines have been emitting concerns and warnings for years but management didn’t care until an ops CEO (Bob Jordan) got in recently (as in 2022), but now there’s 20 years of ops neglect to deal with, and less than a year is nowhere near enough to start enacting real changes.

Sakos · 3 years ago

On that note, I think it's interesting how glowingly people talk about Herb Kelleher. Wouldn't he be responsible for allowing that tech debt to build until it finally caused the system to fail catastrophically?

igetspam · 3 years ago

A friend of mine wrote on this topic today as well.

https://www.seat31b.com/2022/12/the-great-southwest-meltdown...

thepasswordis · 3 years ago

I'm surprised they haven't tried to blame a cyberattack yet.

That said, I feel like these sorts of catastrophic ultra-fragile McKinsey-consulted-to-death failures we keep seeing in various industries are basically a giant signal to any adversaries that say "Hi! Check out how easy it would be to grind this entire industry to a halt!"

Resiliency is literally the opposite of efficiency. These systems need to have slack, aka inefficiency built into them. Unfortunately the business culture has moved towards ultra fragile, ultra efficient thinking.

ghaff · 3 years ago

In part because, in this case, customers will buy the ticket that is $10 cheaper.

factsarelolz · 3 years ago

That could potentially increase their cyber security insurance premiums if not cause the insurer to drop them immediately. Not to mention the broader impact on the market and industry as a whole.