So they forgot to "geographically disparate" fence their queries. Having built a flight navigation system before, I know this bug. I've seen this bug. I've followed the spec to include a geofence to avoid this bug.
1. Pilots occasionally have to fat finger them into ruggedized I/O devices and read them off to ATC over radios.
2. These are defined by the various regional aviation authorities. The US FAA will define one list, (and they'll be unique in the US) the EU will have one, (EASA?) etc.
The AA965 crash (1995-12-20) was due to an aliased waypoint name. Colombia had two waypoints with the same name within 150 nautical miles of each other. (the name was 'R') This was in violation of ICAO regulations from like the '70s.
The names have to be entered manually by pilots, if e.g. they change the route. They have to be transmitted over the air by humans. So they must be short ans simple.
Long story: because changing identifiers is a considerable refactoring, and it takes coordination with multiple worldwide distributed partners to transition safely from the old to the new system, all to avoid a hypothetical issue some software engineer came up with
Short story: money. It costs money to do things well.
It sounds like for actual processing they replace them with GPS coordinates (or at least augment them with such). But this is the system that is responsible for actually doing that...
> the backup system applied the same logic to the flight plan with the same result
Oops. In software, the backup system should use different logic. When I worked at Boeing on the 757 stab trim system, there were two avionics computers attached to the wires to activate the trim. The attachment was through a comparator, that would shut off the authority of both boxes if they didn't agree.
The boxes were designed with:
1. different algorithms
2. different programming languages
3. different CPUs
4. code written by different teams with a firewall between them
The idea was that bugs from one box would not cause the other to fail in the same way.
This would have been a 2oo2 system where the pilot becomes the backup. 2oo2 systems are not highly available.
Air traffic control systems should at least be 2oo3[1] (3 systems independently developed of which 2 must concur at any given time) so that a failure of one system would still allow the other two to continue operation without impacting availability of the aviation industry.
Human backup is not possible because of human resourcing and complexity. ATC systems would need to be available to provide separation under IFR[2] and CVFR[3] conditions.
> Air traffic control systems should at least be 2oo3... Human backup is not possible because of human resourcing and complexity.
But this was a 1oo1 system, and the human backup handled it well enough: a lot of people were inconvenienced, but there were no catastrophes, and (AFAIK) nothing that got close to being one.
As for the benefits of independent development: it might have helped, but the chances of this being so are probably not as much as one would have hoped if one thought programming errors are essentially random defects analogous to, say, weaknesses in a bundle of cables; I had a bit more to say about it here:
This reminds me of a backwoods hike I took with a friend some years back. We each brought a compass, "for redundancy", but it wasn't until we were well underway that we noticed our respective compasses frequently disagreed. We often wished we had a third to break the tie!
In this case the problem was choosing an excessively naive algorithm. I'm very inexperienced but that seems to me like the solution would be to spend a bit more money on reviewing the one implementation rather than writing two new ones from scratch.
> Human backup is not possible because of human resourcing
This is an artificial restraint. In the end, it comes down to risk management: "Are we willing to pay someone to make sure the system stays up when the computer does something unexpected?".
Considering this bug only showed up no, chances are there was a project manager who decided the risk would me extremely low and not worth spending another 200k or so of yearly operation expenses on.
First thought that came to my mind as well when I read it. This failover system seems to be more designed to mitigate hardware failures than software bugs.
I also understand that it is impractical to implement the ATC system software twice using different algorithms. The software at least checked for an illogical state and exited, which was the right thing to do.
A fix I would consider is to have the inputs more thoroughly checked for correctness before passing them on to the ATC system.
if this is true, then would it be a better investment to have the 2nd team produce a fuzz testing/systematic testing mechanism instead of producing a secondary copy of the same system?
In fact, make it adversarial testing such that this team is rewarded (may be financially) if mistakes or problems are found from the 1st team's program.
Naturally, any comparator would have some slack in it to account for variations. Even CPU internals have such slack, that's why there's a "clock" to synchronize things.
I seem to remember another problem at NATS which had the same effect. Primary fell over so they switched over to a secondary that fell over for the exact same reason.
It seems like you should only failover if you know the problem is with the primary and not with the software itself. Failing over "just because" just reinforces the idea that they didn't have enough information exposed to really know what to do.
The bit that makes me feel a bit sick though is that they didn't have a method called "ValidateFlightPlan" that throws an error if for any reason it couldn't be parsed and that error could be handled in a really simple way. What programmer would look at a processor of external input and not think, "what do we do with bad input that makes it fall over?". I did something today for a simple message prompt since I can't guarantee that in all scenarios the data I need will be present/correct. Try/catch and a simple message to the user "Data could not be processed".
Well, if the primary is known not to be in a good state, you might as well fail over and hope that the issue was a fried disk or a cosmic bit flip or something.
The real safety feature is the 4 hour lead time before manual processing becomes necessary.
One of the key safety controls in aviation is “if this breaks for any reason, what do we do”, not so much “how do we stop this breaking in the first place”.
It was in a bad state, but in a very inane way: a flight plan in its processing queue was faulty. The system itself was mostly fine. It was just not well-written enough to distinguish an input error from an internal error, and thus didn't just skip the faulty flight plan.
No validation, anddd this point from the article stood out to me:
---
The programming style is very imperative. Furthermore, the description sounds like the procedure is working directly on the textual representation of the flight plan, rather than a data structure parsed from the text file. This would be quite worrying, but it might also just be how it is explained.
---
Given that description, I'd be surprised if it wasn't just running a regex / substring matches against the text and there's no classes / objects / data structure involved. Bearing in mind this is likely decades old C code that can't be rewritten or replaced because the entirety of the UK's aviation runs on it.
> Bearing in mind this is likely decades old C code that can't be rewritten or replaced because the entirety of the UK's aviation runs on it.
It's new code, from 2018 :)
Quote from the report:
> An FPRSA sub-system has existed in NATS for many years and in 2018 the previous FPRSA sub- system was replaced with new hardware and software manufactured by Frequentis AG, one of the leading global ATC System providers.
Failing over is correct because there's no way to discern that the hardware is not at fault. They should have designed a better response to the second failure to avoid the knock-on effects.
And why could the system not put the failed flight plan in a queue for human review and just keep on working for the rest of the flights? I think the lack of that “feature” is what I find so boggling.
Because the code classified it as a "this should never happen!" error, and then it happened. The code didn't classify it as a "flight plan has bad data" error or a "flight plan data is OK but we don't support it yet" error.
If a "this should never happen!" error occurs, then you don't know what's wrong with the system or how bad or far-reaching the effects are. Maybe it's like what happened here and you could have continued. Or maybe you're getting the error because the software has a catastrophic new bug that will silently corrupt all the other flight plans and get people killed. You don't know whether it is or isn't safe to continue, so you stop.
That reasoning is fine, but it rather seems that the programmers triggered this catastrophic "stop the world" error because they were not thorough enough considering all scenarios. As TA expounds, it seems that neither formal methods nor fuzzing were used, which would have gone a long way flushing out such errors.
I agree with the general sentiment "if you see an unexpected error, STOP", but I don't really think that applies here.
That is, when processing a sequential queue which is what this job does, it seems to me reading the article that each job in the queue is essentially totally independent. In that case, the code most definitely should isolate "unexpected error in job" from a larger "something unknown happened processing the higher level queue".
I've actually seen this bug in different contexts before, and the lessons should always be: One bad job shouldn't crash the whole system. Error handling boundaries should be such that a bad job should be taken out of the queue and handled separately. If you don't do this (which really just entails being thoughtful when processing jobs about the types of errors that are specific to an individual job), I guarantee you'll have a bad time, just like these maintainers did.
That's true, but then, why did engineers try to restart the system several times if they had no clue what was happening, and restarting it could have been dangerous?
To be fair, the article suggests early on that sometimes these plans are being processed for flights already in the air (although at least 4 hours away from the UK).
If you can stop the specific problematic plane taking off then keeping the system running is fine, but once you have a flight in the air it's a different game.
It's not totally unreasonable to say "we have an aircraft en route to enter UK airspace and we don't know when or where - stop planning more flights until we know where that plane is".
If you really can't handle the flight plan, I imagine a reasonable solution would be to somehow force the incoming plane to redirect and land before reaching the UK, until you can work out where it's actually going, but that's definitely something that needs to wait for manual intervention anyway.
To be fair that is exactly what the article said was a major problem, and which the postmortem also said was a major problem. I agree I think this is the most important issue:
> The FPRSA-R system has bad failure modes
> All systems can malfunction, so the important thing is that they malfunction in a good way and that those responsible are prepared for malfunctions.
> A single flight plan caused a problem, and the entire FPRSA-R system crashed, which means no flight plans are being processed at all. If there is a problem with a single flight plan, it should be moved to a separate slower queue, for manual processing by humans. NATS acknowledges this in their "actions already undertaken or in progress":
>> The addition of specific message filters into the data flow between IFPS and FPRSA-R to filter out any flight plans that fit the conditions that caused the incident.
Because they hit "unknown error" and when that happens on safety critical systems you have to assume that all your system's invariants are compromised and you're in undefined behavior -- so all you can do is stop.
Saying this should have been handled as a known error is totally reasonable but that's broadly the same as saying they should have just written bug free code. Even if they had parsed it into some structure this would be the equivalent of a KeyError popping out of nowhere because the code assumed an optional key existed.
For these kinds of things the post mortem and remediation have to kinda take as given that eventually a not predictable in advance unhandled unknown error will occur and then work on how it could be handled better. Because of course the solution to a bug is to fix the bug, but the issue and the reason for the meltdown is a DR plan that couldn't be implemented in a reasonable timeframe. I don't care what programming practices, what style, what language, what tooling. Something of a similar caliber will happen again eventually with probability 1 even with the best coders.
I agree with your first paragraph but your second paragraph is quite defeatist. I was involved in a quite few of "premortem" meetings where people think of increasing improbable failure modes and devise strategies for them. It's a useful meeting before larges changes to critical systems are made live. In my opinion, this should totally be a known error.
> Having found an entry and exit point, with the latter being the duplicate and therefore geographically incorrect, the software could not extract a valid UK portion of flight plan between these two points.
It doesn't take much imagination to surmise that perhaps real world data is broken and sometimes you are handed data that doesn't have a valid UK portion of flight plan. Bugs can happen, yes, such as in this case where a valid flight plan was misinterpreted to be invalid, but gracefully dealing with the invalid plan should be a requirement.
> Saying this should have been handled as a known error is totally reasonable but that's broadly the same as saying they should have just written bug free code.
I think there's a world of difference between writing bug free code, and writing code such that a bug in one system doesn't propagate to others. Obviously it's unreasonable to foresee every possible issue with a flight plan and handle each, but it's much more reasonable to foresee that there might be some issue with some flight plan at some point, and structure the code such that it doesn't assume an error-free flight plan, and the damage is contained. You can't make systems completely immune to failure, but you can make it so an arbitrarily large number of things have to all go wrong at the same time to get a catastrophic failure.
> Even if they had parsed it into some structure this would be the equivalent of a KeyError popping out of nowhere because the code assumed an optional key existed.
How many KeyError exceptions have brought down your whole server? It doesn't happen because whoever coded your web framework knows better and added a big try-catch around the code which handles individual requests. That way you get a 500 error on the specific request instead of a complete shutdown every time a developer made a mistake.
> Because they hit "unknown error" and when that happens on safety critical systems you have to assume that all your system's invariants are compromised and you're in undefined behavior -- so all you can do is stop.
What surprised me more is that the amount of data existing for all waypoints on the globe is quite small, if I were to implement a feature that query by their names as an identifier the first thing I'd do is to check for duplicates in the dataset. Because if there are, I need to consider that condition in every place where I'd be querying a waypoint by a potential duplicate identifier.
I had that thought immediately when looking at flight plan format, noticed the short strings referring to waypoints, way before getting to the section where they point out the name collision issue.
Maybe I'm too used to work with absurd amounts of data (at least in comparison to this dataset), it's a constant part of my job to do some cursory data analysis to understand the parameters of the data I'm working with, what values can be duplicated or malformed, etc.
That it's safety critical is all the more reason it should fail gracefully (albeit surfacing errors to warn the user). A single bad flight plan shouldn't jeopardize things by making data on all the other flight plans unavailable.
The algorithm as described in the blogpost is probably not implemented as a straightforward piece of procedural code that goes step by step through the input flightplan waypoints as described. It may be implemented in a way that incorporates some abstractions that obscured the fact that this was an input error.
If from the code’s point of view it looked instead like a sanity failure in the underlying navigation waypoint database, aborting processing of flight plans makes a lot more sense.
Imagine the code is asking some repository of waypoints and routes ‘find me the waypoint where this route leaves UK airspace’; then it asks to find the route segment that incorporates that waypoint; then it asserts that that segment passes through UK airspace… if that assertion fails, that doesn’t look immediately like a problem with the flight plan but rather with the invariant assumptions built into the route data.
And of course in a sense it is potentially a fatal bug because this issue demonstrates that the assumptions the algorithm is making about the data are wrong and it is potentially capable of returning incorrect answers.
I've had brief glimpses at these systems, and honestly I wouldn't be surprised if it took more a year for a simple feature like this to be implemented. These systems look like decades of legacy code duct-taped together.
> why could the system not put the failed flight plan in a queue
Because it doesn't look at the data as a "flight plan" consisting of "way points" with "segments" along a "route" that has any internal self-consistency. It's a bag of strings and numbers that's parsed and the result passed along, if parsing is successful. If not, give up. In this case fail the entire systemand take it out of production.
Airline industry code is a pile of badly-written legacy wrappers on top of legacy wrappers. (Mostly not including actual flight software on the aircraft. Mostly). The FPRSA-R system mentioned here is not a flight plan system, it's an ETL system. It's not coded to model or work with flight plans, it's just parsing data from system A, re-encoding it for system B, and failing hard if it it can't.
good ETLs are usually designed to separate good records from bad records, so even if one or two rows in the stream do not conform to schema - you can put them aside and process the rest.
The recent episode of The Daily about the (US) aviation industry has convinced me that we’ll see a catastrophic headline soon. Things can’t go on like this.
The fact that they blamed the French flight plan already accepted by Eurocontrol proves that they didn't really know how the software works. And here the Austrian company should take part of the blame for the lack of intensive testing.
"software supplier"???
Why on God's green earth isn't someone familiar with the code on 7/24 pager duty for a system with this level of mission criticality?
That would be... the software supplier. This is quite a specific fault (albeit one that shouldn't have happened if better programming practices had been used), so I don't think anyone but the software's original developers would know what to do. This system is not safety-critical, luckily.
I think there is a bit of ignorance about how software is sold in some cases. This is not just some windows or browser application that was sold but it also contained the staff training with a help to procure hardware to run that software and maybe even more. Such systems get closed off from the outside without a way to send telemetry to the public internet (I've seen this before, it is bizarre and hard to deal with). The contract would have some clauses that deal with such situations where you will always have someone on call as the last line of defense if a critical issue happens. Otherwise, the trained teams should have been able to deal with it but could not.
Essentially this is down to the lack of proper namespace, who'd have thought aerospace engineer need to study operating systems! I've a friend who's a retired air force pilot and graduated from Cranfield University, UK foremost post graduate institution for aerospace engineering with their own airport for teaching and research [1]. According to him he did study OS in Cranfield, and now I finally understand why.
Apparently based on the other comments, the standard for namespace is already available but currently it's not being used by the NATS/ATC, hopefully they've learnt their lessons and start using it for goodness sake. The top comment mentioned about the geofencing bug, but if NATS/ATC is using proper namespace, geofencing probably not necessary in the first place.
It sounds like a great place to study that has its own ~2km long airstrip! It would be nice if they had a spare Trident or Hercules just lying around for student baggage transport :)
"the description sounds like the procedure is working directly on the textual representation of the flight plan, rather than a data structure parsed from the text file. This would be quite worrying, but it might also just be how it is explained."
Oh, this is typical in airline industry work. Ask programmers about a domain model or parsing, they give you blank stares. They love their validation code, and they love just giving up if something doesn't validate. It's all dumb data pipelines At no point is there code models the activities happening in the real world.
In no system is there a "flight plan" type that has any behavior associated with it or anything like a set of waypoint types. Any type found would be a struct of strings in C terms, passed around and parsed not once, but every time the struct member is accessed. As the article notes, "The programming style seems very imperative.".
Giving up if something doesn't validate is indeed standard to avoid propagating badly interpreted data, causing far more complex bugs down the line. Validate soon, validate strongly, report errors and don't try to interpret whatever the hell is wrong with the input, don't try to be 'clever', because there lie the safety holes. Crashing on bad input is wrong, but trying to interpret data that doesn't validate, without specs (of course) is fraught with incomprehension and incompatibilities down the line, or unexpected corner cases (or untested, but no one wants to pay for a fully tested all-goes system, or just for the tools to simulate 'wrong inputs' or for formal validation of the parser and all the code using the parser's results).
There are already too many problems with non-compliant or legacy (or just buggy) data emitters, with the complexity in semantics or timing of the interfaces, to try and be clever with badly formatted/encoded data.
It's already difficult (and costly) to make a system work as specified, so subtle variations to make it more tolerant to unspecificied behaviour is just asking for bugs (or for more expensive systems that don't clear the purchasing price bar).
That's super interesting (and a little terrifying). It's funny how different industries have developped different "cultures" for seemingly random reasons.
It was terrifying enough for me in the gig I worked on that dealt with reservations and check-in, where a catastrophic failure would be someone boarding a flight when they shouldn't have. To avoid that sort of failure, the system mostly just gave up and issued the passenger what's called an "Airport Service Document": effectively a record that shows the passenger as having a seat on the flight, but unable to check-in. This allows the passenger to go to the airport and talk to an agent at the check-in desk. At that point, yes, a person gets involved, and a good agent can usually work out the problem and get the passenger on their flight, but of course that takes time.
If you've ever been a the airline desk waiting to check-in and an agent spends 10 minutes working with a passenger (passengers), it's because they got an ASD and the agent has to screw around directly in the the user-hostile SABRE interface to fix the reservation.
2. These are defined by the various regional aviation authorities. The US FAA will define one list, (and they'll be unique in the US) the EU will have one, (EASA?) etc.
The AA965 crash (1995-12-20) was due to an aliased waypoint name. Colombia had two waypoints with the same name within 150 nautical miles of each other. (the name was 'R') This was in violation of ICAO regulations from like the '70s.
https://en.wikipedia.org/wiki/American_Airlines_Flight_965
Short story: money. It costs money to do things well.
You need to be able to read, write, hear, and speak the identifier. (And receive/transmit in morse code)
Would it be okay to have an "area code prefix" in the identifier? Plausible (but practically speaking too late for that)
I.E. Yankee = YANKY. The pilot and ATC must be location aware. Apparently their software does not.
Deleted Comment
Oops. In software, the backup system should use different logic. When I worked at Boeing on the 757 stab trim system, there were two avionics computers attached to the wires to activate the trim. The attachment was through a comparator, that would shut off the authority of both boxes if they didn't agree.
The boxes were designed with:
1. different algorithms
2. different programming languages
3. different CPUs
4. code written by different teams with a firewall between them
The idea was that bugs from one box would not cause the other to fail in the same way.
Air traffic control systems should at least be 2oo3[1] (3 systems independently developed of which 2 must concur at any given time) so that a failure of one system would still allow the other two to continue operation without impacting availability of the aviation industry.
Human backup is not possible because of human resourcing and complexity. ATC systems would need to be available to provide separation under IFR[2] and CVFR[3] conditions.
[1] https://en.wikipedia.org/wiki/Triple_modular_redundancy
[2] https://en.wikipedia.org/wiki/Instrument_flight_rules#Separa...
[3] https://en.wikipedia.org/wiki/Visual_flight_rules#Controlled...
But this was a 1oo1 system, and the human backup handled it well enough: a lot of people were inconvenienced, but there were no catastrophes, and (AFAIK) nothing that got close to being one.
As for the benefits of independent development: it might have helped, but the chances of this being so are probably not as much as one would have hoped if one thought programming errors are essentially random defects analogous to, say, weaknesses in a bundle of cables; I had a bit more to say about it here:
https://news.ycombinator.com/item?id=37476624
This is an artificial restraint. In the end, it comes down to risk management: "Are we willing to pay someone to make sure the system stays up when the computer does something unexpected?".
Considering this bug only showed up no, chances are there was a project manager who decided the risk would me extremely low and not worth spending another 200k or so of yearly operation expenses on.
A fix I would consider is to have the inputs more thoroughly checked for correctness before passing them on to the ATC system.
J. Gall
Nothing is perfect, though, and the pilot is the backup for failure of that system. I.e. turn off the stab trim system.
In fact, make it adversarial testing such that this team is rewarded (may be financially) if mistakes or problems are found from the 1st team's program.
It seems like you should only failover if you know the problem is with the primary and not with the software itself. Failing over "just because" just reinforces the idea that they didn't have enough information exposed to really know what to do.
The bit that makes me feel a bit sick though is that they didn't have a method called "ValidateFlightPlan" that throws an error if for any reason it couldn't be parsed and that error could be handled in a really simple way. What programmer would look at a processor of external input and not think, "what do we do with bad input that makes it fall over?". I did something today for a simple message prompt since I can't guarantee that in all scenarios the data I need will be present/correct. Try/catch and a simple message to the user "Data could not be processed".
The real safety feature is the 4 hour lead time before manual processing becomes necessary.
One of the key safety controls in aviation is “if this breaks for any reason, what do we do”, not so much “how do we stop this breaking in the first place”.
1. Process controls: What do we do when this breaks for any reason.
2. Engineering controls: What can we do to keep this from breaking in the first place?
Both of them seem to be somewhat essential for a truly safe system.
Deleted Comment
It's new code, from 2018 :) Quote from the report:
> An FPRSA sub-system has existed in NATS for many years and in 2018 the previous FPRSA sub- system was replaced with new hardware and software manufactured by Frequentis AG, one of the leading global ATC System providers.
The software raised an exception because a "// TODO: this should never happen" case happened
A hardware fault would look like machines not talking to each other or corrupted data file unreadable
Primary suffers integer overflow, fails. Secondary is identical, which also overflows. Angle of attack increases, boosters separate. Rocket goes boom.
[1] https://en.wikipedia.org/wiki/Ariane_flight_V88
If a "this should never happen!" error occurs, then you don't know what's wrong with the system or how bad or far-reaching the effects are. Maybe it's like what happened here and you could have continued. Or maybe you're getting the error because the software has a catastrophic new bug that will silently corrupt all the other flight plans and get people killed. You don't know whether it is or isn't safe to continue, so you stop.
That is, when processing a sequential queue which is what this job does, it seems to me reading the article that each job in the queue is essentially totally independent. In that case, the code most definitely should isolate "unexpected error in job" from a larger "something unknown happened processing the higher level queue".
I've actually seen this bug in different contexts before, and the lessons should always be: One bad job shouldn't crash the whole system. Error handling boundaries should be such that a bad job should be taken out of the queue and handled separately. If you don't do this (which really just entails being thoughtful when processing jobs about the types of errors that are specific to an individual job), I guarantee you'll have a bad time, just like these maintainers did.
Because you eventually figure out that, yes, it does happen
If you can stop the specific problematic plane taking off then keeping the system running is fine, but once you have a flight in the air it's a different game.
It's not totally unreasonable to say "we have an aircraft en route to enter UK airspace and we don't know when or where - stop planning more flights until we know where that plane is".
If you really can't handle the flight plan, I imagine a reasonable solution would be to somehow force the incoming plane to redirect and land before reaching the UK, until you can work out where it's actually going, but that's definitely something that needs to wait for manual intervention anyway.
Flight plans don't tell where the plane is. Where is this assumption coming from?
> The FPRSA-R system has bad failure modes
> All systems can malfunction, so the important thing is that they malfunction in a good way and that those responsible are prepared for malfunctions.
> A single flight plan caused a problem, and the entire FPRSA-R system crashed, which means no flight plans are being processed at all. If there is a problem with a single flight plan, it should be moved to a separate slower queue, for manual processing by humans. NATS acknowledges this in their "actions already undertaken or in progress":
>> The addition of specific message filters into the data flow between IFPS and FPRSA-R to filter out any flight plans that fit the conditions that caused the incident.
Saying this should have been handled as a known error is totally reasonable but that's broadly the same as saying they should have just written bug free code. Even if they had parsed it into some structure this would be the equivalent of a KeyError popping out of nowhere because the code assumed an optional key existed.
For these kinds of things the post mortem and remediation have to kinda take as given that eventually a not predictable in advance unhandled unknown error will occur and then work on how it could be handled better. Because of course the solution to a bug is to fix the bug, but the issue and the reason for the meltdown is a DR plan that couldn't be implemented in a reasonable timeframe. I don't care what programming practices, what style, what language, what tooling. Something of a similar caliber will happen again eventually with probability 1 even with the best coders.
> Having found an entry and exit point, with the latter being the duplicate and therefore geographically incorrect, the software could not extract a valid UK portion of flight plan between these two points.
It doesn't take much imagination to surmise that perhaps real world data is broken and sometimes you are handed data that doesn't have a valid UK portion of flight plan. Bugs can happen, yes, such as in this case where a valid flight plan was misinterpreted to be invalid, but gracefully dealing with the invalid plan should be a requirement.
I think there's a world of difference between writing bug free code, and writing code such that a bug in one system doesn't propagate to others. Obviously it's unreasonable to foresee every possible issue with a flight plan and handle each, but it's much more reasonable to foresee that there might be some issue with some flight plan at some point, and structure the code such that it doesn't assume an error-free flight plan, and the damage is contained. You can't make systems completely immune to failure, but you can make it so an arbitrarily large number of things have to all go wrong at the same time to get a catastrophic failure.
How many KeyError exceptions have brought down your whole server? It doesn't happen because whoever coded your web framework knows better and added a big try-catch around the code which handles individual requests. That way you get a 500 error on the specific request instead of a complete shutdown every time a developer made a mistake.
What surprised me more is that the amount of data existing for all waypoints on the globe is quite small, if I were to implement a feature that query by their names as an identifier the first thing I'd do is to check for duplicates in the dataset. Because if there are, I need to consider that condition in every place where I'd be querying a waypoint by a potential duplicate identifier.
I had that thought immediately when looking at flight plan format, noticed the short strings referring to waypoints, way before getting to the section where they point out the name collision issue.
Maybe I'm too used to work with absurd amounts of data (at least in comparison to this dataset), it's a constant part of my job to do some cursory data analysis to understand the parameters of the data I'm working with, what values can be duplicated or malformed, etc.
If from the code’s point of view it looked instead like a sanity failure in the underlying navigation waypoint database, aborting processing of flight plans makes a lot more sense.
Imagine the code is asking some repository of waypoints and routes ‘find me the waypoint where this route leaves UK airspace’; then it asks to find the route segment that incorporates that waypoint; then it asserts that that segment passes through UK airspace… if that assertion fails, that doesn’t look immediately like a problem with the flight plan but rather with the invariant assumptions built into the route data.
And of course in a sense it is potentially a fatal bug because this issue demonstrates that the assumptions the algorithm is making about the data are wrong and it is potentially capable of returning incorrect answers.
Because it doesn't look at the data as a "flight plan" consisting of "way points" with "segments" along a "route" that has any internal self-consistency. It's a bag of strings and numbers that's parsed and the result passed along, if parsing is successful. If not, give up. In this case fail the entire systemand take it out of production.
Airline industry code is a pile of badly-written legacy wrappers on top of legacy wrappers. (Mostly not including actual flight software on the aircraft. Mostly). The FPRSA-R system mentioned here is not a flight plan system, it's an ETL system. It's not coded to model or work with flight plans, it's just parsing data from system A, re-encoding it for system B, and failing hard if it it can't.
seems like poor engineering
Coincidentally-identical waypoint names foxed UK air traffic control system - https://news.ycombinator.com/item?id=37430384 - Sept 2023 (64 comments)
UK air traffic control outage caused by bad data in flight plan - https://news.ycombinator.com/item?id=37402766 - Sept 2023 (20 comments)
NATS report into air traffic control incident details root cause and solution - https://news.ycombinator.com/item?id=37401864 - Sept 2023 (19 comments)
UK Air traffic control network crash - https://news.ycombinator.com/item?id=37292406 - Aug 2023 (23 comments)
- waypoint names used around the world are not unique
- as a sortof cludge, "In order to avoid confusion latest standards state that such identical designators should be geographically widely spaced."
- but still you might get the same waypoint name used twice in a route to mean different places
- the software was not written with that possibilty in mind
- route did not compute
- threw 'critical exception' and entered 'maintenance mode' - i.e. crashed
- backup system took over, hit the same bug with the same bit of data, also crashed
- support people have a crap time
- it wasnt until they called the software supplier that they found the low level logs that revealed the cause of the problem
Essentially this is down to the lack of proper namespace, who'd have thought aerospace engineer need to study operating systems! I've a friend who's a retired air force pilot and graduated from Cranfield University, UK foremost post graduate institution for aerospace engineering with their own airport for teaching and research [1]. According to him he did study OS in Cranfield, and now I finally understand why.
Apparently based on the other comments, the standard for namespace is already available but currently it's not being used by the NATS/ATC, hopefully they've learnt their lessons and start using it for goodness sake. The top comment mentioned about the geofencing bug, but if NATS/ATC is using proper namespace, geofencing probably not necessary in the first place.
[1] Cranfield University:
https://en.wikipedia.org/wiki/Cranfield_University
Deleted Comment
Oh, this is typical in airline industry work. Ask programmers about a domain model or parsing, they give you blank stares. They love their validation code, and they love just giving up if something doesn't validate. It's all dumb data pipelines At no point is there code models the activities happening in the real world.
In no system is there a "flight plan" type that has any behavior associated with it or anything like a set of waypoint types. Any type found would be a struct of strings in C terms, passed around and parsed not once, but every time the struct member is accessed. As the article notes, "The programming style seems very imperative.".
There are already too many problems with non-compliant or legacy (or just buggy) data emitters, with the complexity in semantics or timing of the interfaces, to try and be clever with badly formatted/encoded data.
It's already difficult (and costly) to make a system work as specified, so subtle variations to make it more tolerant to unspecificied behaviour is just asking for bugs (or for more expensive systems that don't clear the purchasing price bar).
You're right about all the buggy stuff out there, and that nobody wants to pay to make it better, though.
If you've ever been a the airline desk waiting to check-in and an agent spends 10 minutes working with a passenger (passengers), it's because they got an ASD and the agent has to screw around directly in the the user-hostile SABRE interface to fix the reservation.