Air traffic failure caused by two locations 3600nm apart sharing 3-letter code

I don't know how long that failure mode has been in place or if this is relevant, but it makes me think of analogous times I've encountered similar:

When automated systems are first put in place, for something high risk, "just shut down if you see something that may be an error" is a totally reasonable plan. After all, literally yesterday they were all functioning without the automated system, if it doesn't seem to be working right better switch back to the manual process we were all using yesterday, instead of risk a catastrophe.

In that situation, switching back to yesterday's workflow is something that won't interrupt much.

A couple decades -- or honestly even just a couple years -- later, that same fault system, left in place without much consideration because it rarely is triggered -- is itself catastrophic, switching back to a rarely used and much more inefficient manual process is extremely disruptive, and even itself raises the risk of catastrophic mistakes.

The general engineering challenge, is how we deal with little-used little-seen functionality (definitely thinking of fault-handling, but there may be other cases) that is totally reasonable when put in place, but has not aged well, and nobody has noticed or realized it, and even if they did it might be hard to convince anyone it's a priority to improve, and the longer you wait the more expensive.

ronsor · 9 months ago

> The general engineering challenge, is how we deal with little-used little-seen functionality (definitely thinking of fault-handling, but there may be other cases) that is totally reasonable when put in place, but has not aged well, and nobody has noticed or realized it, and even if they did it might be hard to convince anyone it's a priority to improve, and the longer you wait the more expensive.

The solution to this is to trigger all functionality periodically and randomly to ensure it remains tested. If you don't test your backups, you don't have any.

jvanderbot · 9 months ago

That is "a" solution.

Another solution that is very foreign to us in sweng, but is common practice in, say, aviation, is to have that fallback plan in a big thick book, and to have a light that says "Oh it's time to use the fallback plan", rather than require users to diagnose the issue and remember the fallback.

This was one of the key ideas in the design of critical systems*: Instead of automating the execution of a big branching plan, it is often preferable to automate just the detection of the next desirable state, then let the users execute the transition. This is because, if there is time, it allows all users to be fully cognizant of the inner state of the system and the reasons for that state, in case they need to take over.

The worst of both worlds is to automate yourself into a corner, gunk everything up, and then require the user to come in and do a back-breaking cleanup just to get to the point where they can diagnose this. My factorio experiences mirror this last case perfectly.

* "Joint Cognitive Systems" - Hollnagle&Woods

ericjmorey · 9 months ago

Which company deployed a chaos monkey deamon on their systems? Seemed to improve resiliency when I read about it.

telgareith · 9 months ago

Dig into the OpenZFS 2.2.0 data loss bug story. There was at least one ticket (in FreeBSD) where it cropped up almost a year prior and got labeled "look into layer," but it got closed.

I'm aware closing tickets of "future investigation" tasks when it seems to not be an issue any longer is common. But, it shouldnt be.

Arainach · 9 months ago

>it shouldnt be

Software can (maybe) be perfect, or it can be relevant to a large user base. It cannot be both.

With an enormous budget and a strictly controlled scope (spacecraft) it may be possible to achieve defect-free software.

In most cases it is not. There are always finite resources, and almost always more ideas than it takes time to implement.

If you are trying to make money, is it worth chasing down issues that affect a miniscule fraction of users that take eng time which could be spent on architectural improvements, features, or bugs affecting more people?

If you are an open source or passion project, is it worth your contributors' limited hours, and will trying to insist people chase down everything drive your contributors away?

The reality in any sufficiently large project is that the bug database will only grow over time. If you leave open every old request and report at P3, users will grow just as disillusioned as if you were honest and closed them as "won't fix". Having thousands of open issues that will never be worked on pollutes the database and makes it harder to keep track of the issues which DO matter.

pj_mukh · 9 months ago

"When automated systems are first put in place, for something high risk, "just shut down if you see something that may be an error" is a totally reasonable plan"

Pretty sure this is exactly what happened with Cruise in San Francisco, cars would just stop and await instructions causing traffic jams. City got mad so they added a "pullover" mechanism. Except now, the "pullover" mechanism ended up dragging someone who had been "flung" into the cars path by someone who had hit and run a pedestrian.

The real world will break all your test cases.

sameoldtune · 9 months ago

> switching back to a rarely used and much more inefficient manual process is extremely disruptive, and even itself raises the risk of catastrophic mistakes.

Catastrophe is most likely to strike when you try to fix a small mistake: pushing a hot-fix that takes down the server; burning yourself trying to take overdone cookies from the oven; offending someone you are trying to apologize to.

crtified · 9 months ago

Also, as codebases and systems get more (not less) complex over time, the potential for technical debt multiplies. There are more processing and outcome vectors, more (and different) branching paths. New logic maps. Every day/month/year/decade is a new operating environment.

mithametacs · 9 months ago

I don’t think it is exponential. In fact, one of the things that surprises me about software engineering is that it’s possible at all.

Bugs seem to scale log-linearly with code complexity. If it’s exponential you’re doing it wrong.

InDubioProRubio · 9 months ago

Reminds me of the switches we used to put into production machines that could self destroy

if -- well defined case else

Scream

while true do

  Sleep(forever)

Same in

switch- default

Basically every known unknown its better to halt and let humans drive the fragile machine back into safe parameters- or expand the program.

PS: Yes, the else- you know what the else is, its the set of !(-well defined conditions) And its ever changing, if the well-defined if condition changes.

agos · 9 months ago

Erlang was born out of a similar problem with signal towers: if a terminal sends bogus data or otherwise crashes it should not bring everything down because the subsequent reconnection storm would be catastrophic. so, "let it fail" would be a very reasonable approach to this challenge, at least for fault handling

akavel · 9 months ago

Take a look at the book: "Systemantics. How systems work and especially how they fail" - a classic, has more observations like this.

wkat4242 · 9 months ago

It's been in place for a while, it happens every few months.

You know there's a software engineer somewhere that saw this as a potential problem, brought up a solution, and had that solution rejected because handling it would add 40 hours of work to a project.

nightowl_games · 9 months ago

I don't know that and I don't like this assumption that only 'managers' make mistakes, or that software engineers are always right. I thinks needlessly adversarial, biased and largely incorrect.

elteto · 9 months ago

Agreed. And most of the people with these attitudes have never written actual safety critical code where everything is written to a very detailed spec. Most likely the designers of the system thought of this edge case and required adding a runtime check and fatal assertion if it was ever encountered.

zer8k · 9 months ago

Spoken like a manager.

Look, when you're barking orders at the guys in the trenches who, understandably in fear for their jobs, do the stupid "business-smart" thing, then it is entirely the fault of management.

I can't tell you how many times just in the last year I've been blamed-by-proxy for doing something that was decreed upon me by some moron in a corner office. Everything is an emergency, everything needs to be done yesterday, everything is changing all the time because King Shit and his merry band of boot-licking middle managers decide it should be.

Software engineers, especially ones with significant experience, are almost surely more right than middle managers. "Shouldn't we consider this case?" is almost always met with some parable about "overengineering" and followed up by a healthy dose of "that's not AGILE". I have grown so tired of this and thanks to the massive crater in job mobility most of us just do as we are told.

It's the power imbalance. In this light, all blame should fall on the manager unless it can be explicitly shown to be developer problems. The addage "those who can, do, and those who can't, teach" applies equally to management.

When it's my f@#$U neck on the line and the only option to keep my job is do the stupid thing you can bet I'll do the stupid thing. Thank god there's no malpractice law in software.

Poor you - only one of our jobs is getting shipped overseas.

CrimsonCape · 9 months ago

C dev: "You are telling me that the three digit codes are not globally unique??? And now we have to add more bits to the struct?? That's going to kill our perfectly optimized bit layout in memory! F***! This whole app is going to sh**"

throw0101a · 9 months ago

> C dev: "You are telling me that the three digit codes are not globally unique???

They are understood not to be. They are generally known to be regionally unique.

The "DVL" code is unique with-in FAA/Transport Canada control, and the "DVL" is unique with-in EASA space.

There are pre-defined three-letter codes:

* https://en.wikipedia.org/wiki/IATA_airport_code

And pre-defined four-letter codes:

* https://en.wikipedia.org/wiki/ICAO_airport_code

There are also five-letter names for major route points:

* https://data.icao.int/icads/Product/View/98

* https://ruk.ca/content/icao-icard-and-5lnc-how-those-5-lette...

If there are duplicates there is a resolution process:

* https://www.icao.int/WACAF/Documents/Meetings/2014/ICARD/ICA...

shagie · 9 months ago

CGP Grey The Maddening Mess of Airport Codes! https://youtu.be/jfOUVYQnuhw

I'd rather deal with designing tables to properly represent names.

ryandrake · 9 months ago

... or there's a software engineer somewhere who simply assumed that three letter navaid identifiers were globally unique, and baked that assumption into the code.

I guess we now need a "Falsehoods Programmers Believe About Aviation Data" site :)

metaltyphoon · 9 months ago

Did aviation software for 7 years. This is 100% the first assumption about waypoint / navaid when new devs come in.

SoftTalker · 9 months ago

And this is why you always use surrogate keys and not natural keys. No matter how much you convince yourself that your natural key is unique and will never change, if a human created the value then a human can change the value or create duplicates, and eventually will.

MichaelZuo · 9 months ago

Or even more straightforward, just don’t believe anyone 100% knows what they are doing until they exhaustively list every assumption they are making.

em-bee · 9 months ago

or falsehoods programmers believe about global identifiers

Dead Comment

jrochkind1 · 9 months ago

jp57 · 9 months ago

FYI: nm = nautical miles, not nanometers.

joemi · 9 months ago

It's quite amusing that they used the incorrect, lowercase abbreviation for "nautical mile" which means something else ("nanometer") in an article about a major issue caused by two things sharing the same abbreviation.

SergeAx · 9 months ago

I learned navigation around 2005, and we used "nm" then. I honestly don't remember the moment when this abbreviation was granted to nanometer.

cduzz · 9 months ago

I was wondering; it seemed like if the to airports were 36000 angstroms apart (3600 nanometers), it'd be reasonable to give them the same airport code since they'd be pretty much on top of each other.

I've also seen "DANGER!! 12000000 μVolts!!!" on tiny little model railroad signs.

atonse · 9 months ago

That's so adorable (for model railroads)

andyjohnson0 · 9 months ago

Even though I knew this was about aviation, I still read nm as nanometres. Now I'm wondering what this says about how my brain works.

lostlogin · 9 months ago

It says ‘metric’. Good.

skykooler · 9 months ago

"Hacker News failure caused by two units 12 orders of magnitude apart sharing 2-letter code"

krick · 9 months ago

I wouldn't have guessed until I read the comments. My assumption was somebody just mistyped km and somehow nobody cared to fix it.

jug · 9 months ago

Yeah, I went into the article thinking this because I expected someone had created waypoints right on top of each other and in the process also somehow generating the same code for them.

QuercusMax · 9 months ago

Ah! I thought this was a case where the locations were just BARELY different from each other, not that they're very far apart.

barbazoo · 9 months ago

Given the context, I'd say NM actually https://en.wikipedia.org/wiki/Nautical_mile

I was clarifying the post title, which uses "nm".

rob74 · 9 months ago

Yes. And, to quote the Wikipedia article: "Symbol: M, NM, or nmi". Not nm (as used in the title, but the article also uses it).

Deleted Comment

dietr1ch · 9 months ago

Thanks, from the title I was confused on why there was such a high resolution on positions.

hughdbrown · 9 months ago

Wow, I read this article because I could not understand how two labeled points on an air path could be 3600 nanometers apart. Never occurred to me that someone would use 'nm' to mean nautical miles.

ikiris · 9 months ago

Nanometers would be a very short flight.

cheschire · 9 months ago

I could imagine conflict arising when switching between single and double precision causing inequality like this.

ainiriand · 9 months ago

Exactly the first thing that came to my mind when I saw that abbreviation.

larsnystrom · 9 months ago

Ah, yes, like when people put in extraordinary amounts of effort to avoid sending a millibit (mb) of data over the wire.

animal531 · 9 months ago

And for those like myself wondering how much 3600nm is, it is of course 0.0036mm

endoblast · 9 months ago

We all need to stop using abbreviations, in my opinion.

EDIT: I mean the point of abbreviations is to facilitate communication. However with the world wide web connecting multiple countries, languages and fields of endeavour there are simply too many (for example) three letter acronyms in use. There are too many sources of ambiguity and confusion. Better to embrace long-form writing.

fabrixxm · 9 months ago

It took me a while, tbh..

noqc · 9 months ago

man, this ruins everything.

FateOfNations · 9 months ago

Good news: the system successfully detected an error and didn't send bad data to air traffic controllers.

Bad News: the system can't recover from an error in an individual flight plan, bringing the whole system down with it (along with the backup system since it was running the same code).

wyldfire · 9 months ago

> he system can't recover from an error in an individual flight plan, bringing the whole system down with it

From the system's POV maybe this is the right way to resolve the problem. Could masking the failure by obscuring this flight's waypoint problem have resulted in a potentially conflicting flight not being tracked among other flights? If so, maybe it's truly urgent enough to bring down the system and force the humans to resolve the discrepancy.

The systems outside of the scope of this one failed to preserve a uniqueness guarantee that was depended on by this system. Was that dependency correctly identified as one that was the job of System X and not System Y?

akira2501 · 9 months ago

> obscuring this flight's waypoint problem have resulted in a potentially conflicting flight not being tracked among other flights?

Flights are tracked by radar and by transponder. The appropriate thing to do is just flag the flight with a discontinuity error but otherwise operate normally. This happens with other statuses like "radio failure" or "emergency aircraft."

It's not something you'd see on a commercial flight, but a private IFR flight (one with a flight plan), you can actually cancel your IFR plan mid flight and revert to VFR (visual flight rules) instead.

Some flights take off without an IFR clearance as a VFR flight, but once airborne, they call up ATC and request an IFR clearance already en route.

The system is vouchsafing where it does not need to.

martinald · 9 months ago

Yes I agree. The reason the system crashed from what I understand wasn't because of the duplicate code, it was because it had the plane time travelling, which suggests very serious corruption.

outworlder · 9 months ago

> From the system's POV maybe this is the right way to resolve the problem. Could masking the failure by obscuring this flight's waypoint problem have resulted in a potentially conflicting flight not being tracked among other flights? If so, maybe it's truly urgent enough to bring down the system and force the humans to resolve the discrepancy.

Flagging the error is absolutely the right way to go. It should have rejected the flight plan, however. There could be issues if the flight was allowed to proceed and you now have an aircraft you didn't expect showing up.

Crashing is not the way to handle it.

aftbit · 9 months ago

It seems fundamentally unreasonable for the flight processing system to entirely shut itself down just because it detected that one flight plan had corrupt data. Some degree of robustness should be expected from this system IMO.

cryptonector · 9 months ago

There is no need to shut down the whole system just because of one flight plan that the system was able to reject. Canceling (or forcing manual updates to) one flight plan is a lot better than canceling 1,500 flights.

steeeeeve · 9 months ago

Jtsummers · 9 months ago

There's been some prior discussion on this over the past year, here are a few I found (selected based on comment count, haven't re-read the discussions yet):

From the day of:

https://news.ycombinator.com/item?id=37292406 - 33 points by woodylondon on Aug 28, 2023 (23 comments)

Discussions after:

https://news.ycombinator.com/item?id=37401864 - 22 points by bigjump on Sept 6, 2023 (19 comments)

https://news.ycombinator.com/item?id=37402766 - 24 points by orobinson on Sept 6, 2023 (20 comments)

https://news.ycombinator.com/item?id=37430384 - 34 points by simonjgreen on Sept 8, 2023 (68 comments)

perihelions · 9 months ago

There's also a much larger one,

https://news.ycombinator.com/item?id=37461695 ("UK air traffic control meltdown (jameshaydon.github.io)", 446 comments)

mstngl · 9 months ago

I remembered this extensive article immediately (only that I've read it, not what and where to find). Thanks for saving me from endlessly searching it.

jmvoodoo · 9 months ago

So, essentially the system has a serious denial of service flaw. I wonder how many variations of flight plans can cause different but similar errors that also force a disconnect of primary and secondary systems.

Seems "reject individual flight plan" might be a better system response than "down hard to prevent corruption"

Bad assumption that a failure to interpret a plan is a serious coding error seems to be the root cause, but hard to say for sure.

mjevans · 9 months ago

Reject the flight plan would be the last case scenario, but where it should have gone without other options rather than total shutdown.

CORRECT the flight plan, by first promoting the exit/entry points for each autonomous region along the route, validating the entry/exit list only, and then the arcs within, would be the least errant method.

d1sxeyes · 9 months ago

You can’t just reject or correct the flight plan, you’re a consumer of the data. The flight plan was valid, it was the interpretation applied by the UK system which was incorrect and led to the failure.

There are a bunch of ways FPRSA-R can already interpret data like this correctly, but there were a combination of 6 specific criteria that hadn’t been foreseen (e.g. the duplicate waypoints, the waypoints both being outside UK airspace, the exit from UK airspace being implicit on the plan as filed, etc).

mcfedr · 9 months ago

Reject the plan surely should have come many places before shutdown the whole system!

convivialdingo · 9 months ago

I guarantee that piece of code has a comment like

  /* This should never happen */
  if (waypoints.matchcount > 2) {

crubier · 9 months ago

Possibly even just

    waypoint = waypointsMatches[0]

Without even mentioning that waypointsMatches might have multiple elements.

This is why I always consider [0] to be a code smell. It doesn't have a name afaik, but it should.

CaptainFever · 9 months ago

Silently ignoring conditions where there are multiple or zero elements?

gopher_space · 9 months ago

Race condition?

dx034 · 9 months ago

From the text it sounds like it looked up if a code was in the flight plan and at which position it was in the plan. It never looked up two codes or assumed there code only be one, just comparing how the plan was filed.

I'm sure there'd be a better way to handle this, but it sounds to me like the system failed in a graceful way and acted as specified.

gitaarik · 9 months ago

Don't you mean > 1 ?

GnarfGnarf · 9 months ago

Funny airport call letters story: I once headed to Salt Lake City, UT (SLC) for a conference. My luggage was processed by a dyslexic baggage handler, who sent it to... SCL (Santiago, Chile).

I was three days in my jeans at business meetings. My bag came back through Lima, Peru and Houston. My bag was having more fun than me.

watt · 9 months ago

Why not pop in to a shop, get another pair of pants?

Too cheap... :o)