Readit News logoReadit News
reactordev · 2 years ago
So they forgot to "geographically disparate" fence their queries. Having built a flight navigation system before, I know this bug. I've seen this bug. I've followed the spec to include a geofence to avoid this bug.
sam0x17 · 2 years ago
Why on earth do they not have GUIDs for these navigation points if the names are not globally unique and inter-region routes are commonplace?
nwallin · 2 years ago
1. Pilots occasionally have to fat finger them into ruggedized I/O devices and read them off to ATC over radios.

2. These are defined by the various regional aviation authorities. The US FAA will define one list, (and they'll be unique in the US) the EU will have one, (EASA?) etc.

The AA965 crash (1995-12-20) was due to an aliased waypoint name. Colombia had two waypoints with the same name within 150 nautical miles of each other. (the name was 'R') This was in violation of ICAO regulations from like the '70s.

https://en.wikipedia.org/wiki/American_Airlines_Flight_965

f1shy · 2 years ago
The names have to be entered manually by pilots, if e.g. they change the route. They have to be transmitted over the air by humans. So they must be short ans simple.
amoerie · 2 years ago
Long story: because changing identifiers is a considerable refactoring, and it takes coordination with multiple worldwide distributed partners to transition safely from the old to the new system, all to avoid a hypothetical issue some software engineer came up with

Short story: money. It costs money to do things well.

paulddraper · 2 years ago
Aviation protocols are extremely backwards compatible and low-tech compatible.

You need to be able to read, write, hear, and speak the identifier. (And receive/transmit in morse code)

Would it be okay to have an "area code prefix" in the identifier? Plausible (but practically speaking too late for that)

gabereiser · 2 years ago
FAA regulations state that fixes, navs, and waypoints must be phonetically transmittable over radio.

I.E. Yankee = YANKY. The pilot and ATC must be location aware. Apparently their software does not.

Topgamer7 · 2 years ago
I would guess because humans have to read this and ascertain meaning from it. Not everyone is a technical resource.
tortue0 · 2 years ago
They do and use lat/lon in some cases. Reviewing and inputting that (when being done manual) is another story - but it's technically possible.
gavinsyancey · 2 years ago
It sounds like for actual processing they replace them with GPS coordinates (or at least augment them with such). But this is the system that is responsible for actually doing that...
ExoticPearTree · 2 years ago
Because they need to be short, that's why they are 5 letters long. And need to be understood phonetically very quickly by pilots.
epanchin · 2 years ago
What three words would be a better solution than a guid, as transmittable over radio.
cutler · 2 years ago
Just curios, what language was used to develop?

Deleted Comment

tppiotrowski · 2 years ago
ICAO standard effective from 1978 to only duplicate identifiers if more than 600 nmi (690 mi; 1,100 km) apart
WalterBright · 2 years ago
> the backup system applied the same logic to the flight plan with the same result

Oops. In software, the backup system should use different logic. When I worked at Boeing on the 757 stab trim system, there were two avionics computers attached to the wires to activate the trim. The attachment was through a comparator, that would shut off the authority of both boxes if they didn't agree.

The boxes were designed with:

1. different algorithms

2. different programming languages

3. different CPUs

4. code written by different teams with a firewall between them

The idea was that bugs from one box would not cause the other to fail in the same way.

dhx · 2 years ago
This would have been a 2oo2 system where the pilot becomes the backup. 2oo2 systems are not highly available.

Air traffic control systems should at least be 2oo3[1] (3 systems independently developed of which 2 must concur at any given time) so that a failure of one system would still allow the other two to continue operation without impacting availability of the aviation industry.

Human backup is not possible because of human resourcing and complexity. ATC systems would need to be available to provide separation under IFR[2] and CVFR[3] conditions.

[1] https://en.wikipedia.org/wiki/Triple_modular_redundancy

[2] https://en.wikipedia.org/wiki/Instrument_flight_rules#Separa...

[3] https://en.wikipedia.org/wiki/Visual_flight_rules#Controlled...

mannykannot · 2 years ago
> Air traffic control systems should at least be 2oo3... Human backup is not possible because of human resourcing and complexity.

But this was a 1oo1 system, and the human backup handled it well enough: a lot of people were inconvenienced, but there were no catastrophes, and (AFAIK) nothing that got close to being one.

As for the benefits of independent development: it might have helped, but the chances of this being so are probably not as much as one would have hoped if one thought programming errors are essentially random defects analogous to, say, weaknesses in a bundle of cables; I had a bit more to say about it here:

https://news.ycombinator.com/item?id=37476624

wcarss · 2 years ago
This reminds me of a backwoods hike I took with a friend some years back. We each brought a compass, "for redundancy", but it wasn't until we were well underway that we noticed our respective compasses frequently disagreed. We often wished we had a third to break the tie!
iudqnolq · 2 years ago
In this case the problem was choosing an excessively naive algorithm. I'm very inexperienced but that seems to me like the solution would be to spend a bit more money on reviewing the one implementation rather than writing two new ones from scratch.
DocTomoe · 2 years ago
> Human backup is not possible because of human resourcing

This is an artificial restraint. In the end, it comes down to risk management: "Are we willing to pay someone to make sure the system stays up when the computer does something unexpected?".

Considering this bug only showed up no, chances are there was a project manager who decided the risk would me extremely low and not worth spending another 200k or so of yearly operation expenses on.

wavemode · 2 years ago
First thought that came to my mind as well when I read it. This failover system seems to be more designed to mitigate hardware failures than software bugs.
WalterBright · 2 years ago
I also understand that it is impractical to implement the ATC system software twice using different algorithms. The software at least checked for an illogical state and exited, which was the right thing to do.

A fix I would consider is to have the inputs more thoroughly checked for correctness before passing them on to the ATC system.

shatnersbassoon · 2 years ago
"When a failsafe system fails, it fails by failing to fail safe."

J. Gall

borissk · 2 years ago
Different teams often make the same mistake. The system you describe is not perfect, but makes sense.
WalterBright · 2 years ago
I neglected to mention there was a third party that reviewed the algorithms to verify they weren't the same.

Nothing is perfect, though, and the pilot is the backup for failure of that system. I.e. turn off the stab trim system.

chii · 2 years ago
if this is true, then would it be a better investment to have the 2nd team produce a fuzz testing/systematic testing mechanism instead of producing a secondary copy of the same system?

In fact, make it adversarial testing such that this team is rewarded (may be financially) if mistakes or problems are found from the 1st team's program.

fransje26 · 2 years ago
As as side note, too bad they knowingly didn't reuse such an approach for the MAX..
WalterBright · 2 years ago
The MAX system relied on the pilot remembering the stab trim cutoff switch and what it was for.
jojobas · 2 years ago
Wouldn't trim be an number of which a significant tolerance is permissible at any given time? Or does "agree" mean "within a preset tolerance"?
WalterBright · 2 years ago
Naturally, any comparator would have some slack in it to account for variations. Even CPU internals have such slack, that's why there's a "clock" to synchronize things.
f1shy · 2 years ago
I would be very interested in knowing which languages were used. Do you know which were? Thanks
WalterBright · 2 years ago
One of them was Pascal. This was around 1980 or so.
lbriner · 2 years ago
I seem to remember another problem at NATS which had the same effect. Primary fell over so they switched over to a secondary that fell over for the exact same reason.

It seems like you should only failover if you know the problem is with the primary and not with the software itself. Failing over "just because" just reinforces the idea that they didn't have enough information exposed to really know what to do.

The bit that makes me feel a bit sick though is that they didn't have a method called "ValidateFlightPlan" that throws an error if for any reason it couldn't be parsed and that error could be handled in a really simple way. What programmer would look at a processor of external input and not think, "what do we do with bad input that makes it fall over?". I did something today for a simple message prompt since I can't guarantee that in all scenarios the data I need will be present/correct. Try/catch and a simple message to the user "Data could not be processed".

d1sxeyes · 2 years ago
Well, if the primary is known not to be in a good state, you might as well fail over and hope that the issue was a fried disk or a cosmic bit flip or something.

The real safety feature is the 4 hour lead time before manual processing becomes necessary.

One of the key safety controls in aviation is “if this breaks for any reason, what do we do”, not so much “how do we stop this breaking in the first place”.

zaphar · 2 years ago
I'm no aviation safety controls expert but it seems to me that there are two types of controls that should be in place:

1. Process controls: What do we do when this breaks for any reason.

2. Engineering controls: What can we do to keep this from breaking in the first place?

Both of them seem to be somewhat essential for a truly safe system.

samus · 2 years ago
It was in a bad state, but in a very inane way: a flight plan in its processing queue was faulty. The system itself was mostly fine. It was just not well-written enough to distinguish an input error from an internal error, and thus didn't just skip the faulty flight plan.

Deleted Comment

j_mo · 2 years ago
No validation, anddd this point from the article stood out to me: --- The programming style is very imperative. Furthermore, the description sounds like the procedure is working directly on the textual representation of the flight plan, rather than a data structure parsed from the text file. This would be quite worrying, but it might also just be how it is explained. --- Given that description, I'd be surprised if it wasn't just running a regex / substring matches against the text and there's no classes / objects / data structure involved. Bearing in mind this is likely decades old C code that can't be rewritten or replaced because the entirety of the UK's aviation runs on it.
jameshh · 2 years ago
> Bearing in mind this is likely decades old C code that can't be rewritten or replaced because the entirety of the UK's aviation runs on it.

It's new code, from 2018 :) Quote from the report:

> An FPRSA sub-system has existed in NATS for many years and in 2018 the previous FPRSA sub- system was replaced with new hardware and software manufactured by Frequentis AG, one of the leading global ATC System providers.

sheepshear · 2 years ago
Failing over is correct because there's no way to discern that the hardware is not at fault. They should have designed a better response to the second failure to avoid the knock-on effects.
anentropic · 2 years ago
I don't think anything in this incident pointed to a hardware fault

The software raised an exception because a "// TODO: this should never happen" case happened

A hardware fault would look like machines not talking to each other or corrupted data file unreadable

1970-01-01 · 2 years ago
Yep. In electrical terms, you replaced the fuse to watch it blow again. There are no more fuses in your shop. Progress?
philjohn · 2 years ago
The Ariane 5 launch failure[1] was a similar issue, albeit with a more spectacular outcome.

Primary suffers integer overflow, fails. Secondary is identical, which also overflows. Angle of attack increases, boosters separate. Rocket goes boom.

[1] https://en.wikipedia.org/wiki/Ariane_flight_V88

asimpleusecase · 2 years ago
And why could the system not put the failed flight plan in a queue for human review and just keep on working for the rest of the flights? I think the lack of that “feature” is what I find so boggling.
adrianmonk · 2 years ago
Because the code classified it as a "this should never happen!" error, and then it happened. The code didn't classify it as a "flight plan has bad data" error or a "flight plan data is OK but we don't support it yet" error.

If a "this should never happen!" error occurs, then you don't know what's wrong with the system or how bad or far-reaching the effects are. Maybe it's like what happened here and you could have continued. Or maybe you're getting the error because the software has a catastrophic new bug that will silently corrupt all the other flight plans and get people killed. You don't know whether it is or isn't safe to continue, so you stop.

samus · 2 years ago
That reasoning is fine, but it rather seems that the programmers triggered this catastrophic "stop the world" error because they were not thorough enough considering all scenarios. As TA expounds, it seems that neither formal methods nor fuzzing were used, which would have gone a long way flushing out such errors.
hn_throwaway_99 · 2 years ago
I agree with the general sentiment "if you see an unexpected error, STOP", but I don't really think that applies here.

That is, when processing a sequential queue which is what this job does, it seems to me reading the article that each job in the queue is essentially totally independent. In that case, the code most definitely should isolate "unexpected error in job" from a larger "something unknown happened processing the higher level queue".

I've actually seen this bug in different contexts before, and the lessons should always be: One bad job shouldn't crash the whole system. Error handling boundaries should be such that a bad job should be taken out of the queue and handled separately. If you don't do this (which really just entails being thoughtful when processing jobs about the types of errors that are specific to an individual job), I guarantee you'll have a bad time, just like these maintainers did.

jameshh · 2 years ago
That's true, but then, why did engineers try to restart the system several times if they had no clue what was happening, and restarting it could have been dangerous?
raverbashing · 2 years ago
And that's why I never (or very rarely) put "this should never happen" exceptions anymore in my code

Because you eventually figure out that, yes, it does happen

pimterry · 2 years ago
To be fair, the article suggests early on that sometimes these plans are being processed for flights already in the air (although at least 4 hours away from the UK).

If you can stop the specific problematic plane taking off then keeping the system running is fine, but once you have a flight in the air it's a different game.

It's not totally unreasonable to say "we have an aircraft en route to enter UK airspace and we don't know when or where - stop planning more flights until we know where that plane is".

If you really can't handle the flight plan, I imagine a reasonable solution would be to somehow force the incoming plane to redirect and land before reaching the UK, until you can work out where it's actually going, but that's definitely something that needs to wait for manual intervention anyway.

krisoft · 2 years ago
> "we have an aircraft en route to enter UK airspace and we don't know when or where - stop planning more flights until we know where that plane is".

Flight plans don't tell where the plane is. Where is this assumption coming from?

hn_throwaway_99 · 2 years ago
To be fair that is exactly what the article said was a major problem, and which the postmortem also said was a major problem. I agree I think this is the most important issue:

> The FPRSA-R system has bad failure modes

> All systems can malfunction, so the important thing is that they malfunction in a good way and that those responsible are prepared for malfunctions.

> A single flight plan caused a problem, and the entire FPRSA-R system crashed, which means no flight plans are being processed at all. If there is a problem with a single flight plan, it should be moved to a separate slower queue, for manual processing by humans. NATS acknowledges this in their "actions already undertaken or in progress":

>> The addition of specific message filters into the data flow between IFPS and FPRSA-R to filter out any flight plans that fit the conditions that caused the incident.

Spivak · 2 years ago
Because they hit "unknown error" and when that happens on safety critical systems you have to assume that all your system's invariants are compromised and you're in undefined behavior -- so all you can do is stop.

Saying this should have been handled as a known error is totally reasonable but that's broadly the same as saying they should have just written bug free code. Even if they had parsed it into some structure this would be the equivalent of a KeyError popping out of nowhere because the code assumed an optional key existed.

For these kinds of things the post mortem and remediation have to kinda take as given that eventually a not predictable in advance unhandled unknown error will occur and then work on how it could be handled better. Because of course the solution to a bug is to fix the bug, but the issue and the reason for the meltdown is a DR plan that couldn't be implemented in a reasonable timeframe. I don't care what programming practices, what style, what language, what tooling. Something of a similar caliber will happen again eventually with probability 1 even with the best coders.

kccqzy · 2 years ago
I agree with your first paragraph but your second paragraph is quite defeatist. I was involved in a quite few of "premortem" meetings where people think of increasing improbable failure modes and devise strategies for them. It's a useful meeting before larges changes to critical systems are made live. In my opinion, this should totally be a known error.

> Having found an entry and exit point, with the latter being the duplicate and therefore geographically incorrect, the software could not extract a valid UK portion of flight plan between these two points.

It doesn't take much imagination to surmise that perhaps real world data is broken and sometimes you are handed data that doesn't have a valid UK portion of flight plan. Bugs can happen, yes, such as in this case where a valid flight plan was misinterpreted to be invalid, but gracefully dealing with the invalid plan should be a requirement.

jjk166 · 2 years ago
> Saying this should have been handled as a known error is totally reasonable but that's broadly the same as saying they should have just written bug free code.

I think there's a world of difference between writing bug free code, and writing code such that a bug in one system doesn't propagate to others. Obviously it's unreasonable to foresee every possible issue with a flight plan and handle each, but it's much more reasonable to foresee that there might be some issue with some flight plan at some point, and structure the code such that it doesn't assume an error-free flight plan, and the damage is contained. You can't make systems completely immune to failure, but you can make it so an arbitrarily large number of things have to all go wrong at the same time to get a catastrophic failure.

krisoft · 2 years ago
> Even if they had parsed it into some structure this would be the equivalent of a KeyError popping out of nowhere because the code assumed an optional key existed.

How many KeyError exceptions have brought down your whole server? It doesn't happen because whoever coded your web framework knows better and added a big try-catch around the code which handles individual requests. That way you get a 500 error on the specific request instead of a complete shutdown every time a developer made a mistake.

piva00 · 2 years ago
> Because they hit "unknown error" and when that happens on safety critical systems you have to assume that all your system's invariants are compromised and you're in undefined behavior -- so all you can do is stop.

What surprised me more is that the amount of data existing for all waypoints on the globe is quite small, if I were to implement a feature that query by their names as an identifier the first thing I'd do is to check for duplicates in the dataset. Because if there are, I need to consider that condition in every place where I'd be querying a waypoint by a potential duplicate identifier.

I had that thought immediately when looking at flight plan format, noticed the short strings referring to waypoints, way before getting to the section where they point out the name collision issue.

Maybe I'm too used to work with absurd amounts of data (at least in comparison to this dataset), it's a constant part of my job to do some cursory data analysis to understand the parameters of the data I'm working with, what values can be duplicated or malformed, etc.

ummonk · 2 years ago
That it's safety critical is all the more reason it should fail gracefully (albeit surfacing errors to warn the user). A single bad flight plan shouldn't jeopardize things by making data on all the other flight plans unavailable.
madeofpalk · 2 years ago
That's like saying that because one browser tab tried to parse some invalid JSON then my whole browser should crash.
jameshart · 2 years ago
The algorithm as described in the blogpost is probably not implemented as a straightforward piece of procedural code that goes step by step through the input flightplan waypoints as described. It may be implemented in a way that incorporates some abstractions that obscured the fact that this was an input error.

If from the code’s point of view it looked instead like a sanity failure in the underlying navigation waypoint database, aborting processing of flight plans makes a lot more sense.

Imagine the code is asking some repository of waypoints and routes ‘find me the waypoint where this route leaves UK airspace’; then it asks to find the route segment that incorporates that waypoint; then it asserts that that segment passes through UK airspace… if that assertion fails, that doesn’t look immediately like a problem with the flight plan but rather with the invariant assumptions built into the route data.

And of course in a sense it is potentially a fatal bug because this issue demonstrates that the assumptions the algorithm is making about the data are wrong and it is potentially capable of returning incorrect answers.

micromacrofoot · 2 years ago
I've had brief glimpses at these systems, and honestly I wouldn't be surprised if it took more a year for a simple feature like this to be implemented. These systems look like decades of legacy code duct-taped together.
cratermoon · 2 years ago
> why could the system not put the failed flight plan in a queue

Because it doesn't look at the data as a "flight plan" consisting of "way points" with "segments" along a "route" that has any internal self-consistency. It's a bag of strings and numbers that's parsed and the result passed along, if parsing is successful. If not, give up. In this case fail the entire systemand take it out of production.

Airline industry code is a pile of badly-written legacy wrappers on top of legacy wrappers. (Mostly not including actual flight software on the aircraft. Mostly). The FPRSA-R system mentioned here is not a flight plan system, it's an ETL system. It's not coded to model or work with flight plans, it's just parsing data from system A, re-encoding it for system B, and failing hard if it it can't.

slt2021 · 2 years ago
good ETLs are usually designed to separate good records from bad records, so even if one or two rows in the stream do not conform to schema - you can put them aside and process the rest.

seems like poor engineering

dboreham · 2 years ago
Because some software developers are crap at their jobs.
dang · 2 years ago
Related. Others?

Coincidentally-identical waypoint names foxed UK air traffic control system - https://news.ycombinator.com/item?id=37430384 - Sept 2023 (64 comments)

UK air traffic control outage caused by bad data in flight plan - https://news.ycombinator.com/item?id=37402766 - Sept 2023 (20 comments)

NATS report into air traffic control incident details root cause and solution - https://news.ycombinator.com/item?id=37401864 - Sept 2023 (19 comments)

UK Air traffic control network crash - https://news.ycombinator.com/item?id=37292406 - Aug 2023 (23 comments)

a_wild_dandan · 2 years ago
The recent episode of The Daily about the (US) aviation industry has convinced me that we’ll see a catastrophic headline soon. Things can’t go on like this.
switch007 · 2 years ago
The title of this post made me think there was a new, current meltdown !
rcostin2k2 · 2 years ago
The fact that they blamed the French flight plan already accepted by Eurocontrol proves that they didn't really know how the software works. And here the Austrian company should take part of the blame for the lack of intensive testing.
littlestymaar · 2 years ago
They blamed the French because they are British, that's it. It's hard to get rid of bad habits.
fransje26 · 2 years ago
But, but, but... ...the EU!
codeulike · 2 years ago
This is a great post. My reading of it:

- waypoint names used around the world are not unique

- as a sortof cludge, "In order to avoid confusion latest standards state that such identical designators should be geographically widely spaced."

- but still you might get the same waypoint name used twice in a route to mean different places

- the software was not written with that possibilty in mind

- route did not compute

- threw 'critical exception' and entered 'maintenance mode' - i.e. crashed

- backup system took over, hit the same bug with the same bit of data, also crashed

- support people have a crap time

- it wasnt until they called the software supplier that they found the low level logs that revealed the cause of the problem

dboreham · 2 years ago
"software supplier"??? Why on God's green earth isn't someone familiar with the code on 7/24 pager duty for a system with this level of mission criticality?
seabass-labrax · 2 years ago
That would be... the software supplier. This is quite a specific fault (albeit one that shouldn't have happened if better programming practices had been used), so I don't think anyone but the software's original developers would know what to do. This system is not safety-critical, luckily.
sublimefire · 2 years ago
I think there is a bit of ignorance about how software is sold in some cases. This is not just some windows or browser application that was sold but it also contained the staff training with a help to procure hardware to run that software and maybe even more. Such systems get closed off from the outside without a way to send telemetry to the public internet (I've seen this before, it is bizarre and hard to deal with). The contract would have some clauses that deal with such situations where you will always have someone on call as the last line of defense if a critical issue happens. Otherwise, the trained teams should have been able to deal with it but could not.
noman-land · 2 years ago
My jaw kept dropping with each new bullet point.
xvector · 2 years ago
Same, is aviation technology really this primitive?
teleforce · 2 years ago
Thanks for the summary and TL;DR.

Essentially this is down to the lack of proper namespace, who'd have thought aerospace engineer need to study operating systems! I've a friend who's a retired air force pilot and graduated from Cranfield University, UK foremost post graduate institution for aerospace engineering with their own airport for teaching and research [1]. According to him he did study OS in Cranfield, and now I finally understand why.

Apparently based on the other comments, the standard for namespace is already available but currently it's not being used by the NATS/ATC, hopefully they've learnt their lessons and start using it for goodness sake. The top comment mentioned about the geofencing bug, but if NATS/ATC is using proper namespace, geofencing probably not necessary in the first place.

[1] Cranfield University:

https://en.wikipedia.org/wiki/Cranfield_University

seabass-labrax · 2 years ago
It sounds like a great place to study that has its own ~2km long airstrip! It would be nice if they had a spare Trident or Hercules just lying around for student baggage transport :)

Deleted Comment

cratermoon · 2 years ago
"the description sounds like the procedure is working directly on the textual representation of the flight plan, rather than a data structure parsed from the text file. This would be quite worrying, but it might also just be how it is explained."

Oh, this is typical in airline industry work. Ask programmers about a domain model or parsing, they give you blank stares. They love their validation code, and they love just giving up if something doesn't validate. It's all dumb data pipelines At no point is there code models the activities happening in the real world.

In no system is there a "flight plan" type that has any behavior associated with it or anything like a set of waypoint types. Any type found would be a struct of strings in C terms, passed around and parsed not once, but every time the struct member is accessed. As the article notes, "The programming style seems very imperative.".

touisteur · 2 years ago
Giving up if something doesn't validate is indeed standard to avoid propagating badly interpreted data, causing far more complex bugs down the line. Validate soon, validate strongly, report errors and don't try to interpret whatever the hell is wrong with the input, don't try to be 'clever', because there lie the safety holes. Crashing on bad input is wrong, but trying to interpret data that doesn't validate, without specs (of course) is fraught with incomprehension and incompatibilities down the line, or unexpected corner cases (or untested, but no one wants to pay for a fully tested all-goes system, or just for the tools to simulate 'wrong inputs' or for formal validation of the parser and all the code using the parser's results).

There are already too many problems with non-compliant or legacy (or just buggy) data emitters, with the complexity in semantics or timing of the interfaces, to try and be clever with badly formatted/encoded data.

It's already difficult (and costly) to make a system work as specified, so subtle variations to make it more tolerant to unspecificied behaviour is just asking for bugs (or for more expensive systems that don't clear the purchasing price bar).

cratermoon · 2 years ago
There's a difference between parsing and validating. https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

You're right about all the buggy stuff out there, and that nobody wants to pay to make it better, though.

jameshh · 2 years ago
That's super interesting (and a little terrifying). It's funny how different industries have developped different "cultures" for seemingly random reasons.
cratermoon · 2 years ago
It was terrifying enough for me in the gig I worked on that dealt with reservations and check-in, where a catastrophic failure would be someone boarding a flight when they shouldn't have. To avoid that sort of failure, the system mostly just gave up and issued the passenger what's called an "Airport Service Document": effectively a record that shows the passenger as having a seat on the flight, but unable to check-in. This allows the passenger to go to the airport and talk to an agent at the check-in desk. At that point, yes, a person gets involved, and a good agent can usually work out the problem and get the passenger on their flight, but of course that takes time.

If you've ever been a the airline desk waiting to check-in and an agent spends 10 minutes working with a passenger (passengers), it's because they got an ASD and the agent has to screw around directly in the the user-hostile SABRE interface to fix the reservation.