A Matter of Millimeters: The story of Qantas flight 32

I don't know about others, but I can't help but smile when I read the detailed series of events in aviation postmortems. To be able to zero in on what turned out to be a single faulty part and then trace the entire provenance and environment that led to that defective part entering service speaks to the robustness of the industry. I say that sincerely since mistakes are going to happen and in my view robustness has less to do with the number of mistakes but how one responds to them.

Being an SRE at a FAANG and generally spending a lot of my life dealing with reliability, I am consistently in awe of the aviation industry. I can only hope (and do my small contribution) that the software/tech industry can one day be an equal in this regard.

And finally, the biggest of kudos to the Kyra Dempsey the writer. What an approachable article despite being (necessarily) heavy on the engineering content.

WalterBright · 2 years ago

As a former Boeing engineer, other industries can learn a great deal from how airplanes are designed. The Fukushima and Deepwater Horizon disasters were both "zipper" failures that showed little thought was given to "when X fails, then what?"

Note I wrote when X fails, not if X fails. It's a different way of thinking.

lloeki · 2 years ago

When I worked in an industrial context, some coding tasks would seem trivial to today's Joe Random software dev, but we had to be constantly thinking about failure modes: from degraded modes that would keep a plant 100% operative 100% of the time in spite of some component being down, to driving a 10m high oven has the opportunity to break airborne water molecules from mere ambient humidity into hydrogen whose buildups could be dangerously explosive if some parameters were not kept in check, implying that the code/system has to have a number of contingency plans. "Sane default" suddenly has a very tangible meaning.

f1shy · 2 years ago

As an engineer I think a lot about tradeoffs of cost vs other criteria. There is little I can learn from nuclear or aviation industry, as the cost structure ist so completely different. I’m very happy that the costs of safety in aviation are very good accepted, but I understand that few people are willing to pay similar costs for other things like, say, cars.

arendtio · 2 years ago

In the context of disasters that happened due to software failures (e.g. Ariane 5 [1]), one of my professors used to tell us, that software doesn't break somewhen but is broken from the beginning.

I like the idea of thinking 'when' instead of 'if', but the verdict should be even harder when it comes to software engineering because it has this rare material at its disposal, which doesn't degrade over time.

[1] https://en.wikipedia.org/wiki/Ariane_5#Notable_launches

WalterBright · 2 years ago

An example of zipper failure in the Airbus incident is when a wire bundle gets cut, all the functions of all the wires in that bundle are lost. Having two or more smaller bundles physically separated would greatly reduce that risk. Certainly, having the primary and the backup system in the same bundle is a bad idea.

On the 757, one set of control cables runs under the floor. The backup set runs in the ceiling.

laydn · 2 years ago

What's fascinating about airplane design for me is not the huge technical complexity, but rather, the way it is designed such that a lot of its subsystems are serviceable by technicians so quickly and reliably, not just in a fully controlled environment like a maintenance hangar, but right on the tarmac, waiting for takeoff.

cedivad · 2 years ago

> When my AoA sensor fails, then what?

crickets, let's just randomise which sensor we use during boot, that ought to do it!

asystole · 2 years ago

I agree in principle, but I don't think industries should be looking at current-day Boeing's engineering practices except for an example of how a proud company's culture can rot from the inside out with fatal consequences.

oefnak · 2 years ago

Are you serious in saying that other industries could learn from Boeing?

sylens · 2 years ago

I think many of us are so used to working with software, with its constant need for adaptation and modification in order to meet an ever growing list of integration requirements, that we forget the benefits of working with a finalized spec with known constants like melting points, air pressure, and gravity.

abid786 · 2 years ago

Completely agree - I think it can go one of two ways. Software is more malleable than airplanes are and that also comes with downsides (like how much time and effort it takes to bring a new plane to the market)

WalterBright · 2 years ago

Airliners face constantly changing specifications. No two airliners are built the same.

mzi · 2 years ago

It took hundreds of subject experts from ten organizations in seven countries almost three years to reach that conclusion.

Here at HN we want a post mortem for a cloud failure in a matter of hours.

modernpacifist · 2 years ago

> Here at HN we want a post mortem for a cloud failure in a matter of hours.

I'll go one further - I've yet to finish writing a postmortem on one incident before the next one happens. I also have my doubts that folks wanting a PM in O(hours) actually care about its contents/findings/remediations - its just a tick box in the process of day-to-day ops.

thaumasiotes · 2 years ago

Something similar that struck me was that, in early February, Russia invaded Ukraine.

And then, I saw an endless stream of aggrieved comments from people who were personally outraged that the outcome, whatever it might be, hadn't been finalized yet at the late, late date of... late February.

mlrtime · 2 years ago

I work at mid tier FAANG, our SLA for post mortems have SLA in the 7-14 day period. Nobody seriously wants a full PM in hours.

They may want a mitigation or RCA in hours, but even AWS gives us NDA restricted PMs in > 24 hours.

bitcharmer · 2 years ago

Apples to oranges

crabmusket · 2 years ago

> To be able to zero in on what turned out to be a single faulty part and then trace the entire provenance and environment that led to that defective part entering service speaks to the robustness of the industry.

And to be able to reconstruct the chain of events after the components in question have exploded and been scattered throughout south-east Asia is incredible.

Gare · 2 years ago

My impressiom was that the defective part was still inside the engine when it landed.

nextos · 2 years ago

Aviation is great because the industry learns so much after incidents and accidents. There is a culture of trying to improve, rather than merely seeking culprits.

However, I have been told by an insider that supply chain integrity is an underappreciated issue. Someone has been caught selling fake plane parts through an elaborate scheme, and there are other suspicious suppliers, which is a bit unsettling:

"Safran confirmed the fraudulent documentation, launching an investigation that found thousands of parts across at least 126 CFM56 engines were sold without a legitimate airworthiness certificate."

https://www.businessinsider.com/scammer-fooled-us-airlines-b...

EdwardDiego · 2 years ago

Admiral Cloudberg has covered a case where counterfeit or EOL-but-with-new-paperworks components were involved in a crash.

https://admiralcloudberg.medium.com/riven-by-deceit-the-cras...

inglor_cz · 2 years ago

I suspect this is precisely what is happening in Russian civil aviation now. No legit parts supplied, so there will be a lot of fake/problematic parts imported through black channels.

bambax · 2 years ago

The Checklist Manifesto (2009) is a great short book that shows how using simple checklists would help immensely in many different industries, esp. in medical (the author is a surgeon).

Checklists of course are not the same as detailed post-mortems but they belong to the same way of thinking. And they would cost pretty much nothing to implement.

Also CRM: it's very important to have a culture where underlings feel they can speak up when something doesn't look right -- or when a checklist item is overlooked, for that matter.

sgarland · 2 years ago

Yes, but they do have one critical failure mode: that the checklist failed to account for something (or that an expected reaction to a step being performed didn’t occur).

I was a submarine nuclear reactor operator, and one of my Commanding Officers once ordered that we stop using checklists during routine operations for precisely this reason. Instead, we had to fully read and parse the source documentation for every step. Before, while we of course had them open, they served as more of a backstop.

His argument – which I to some extent agree with – was that by reading the source documentation every time, we would better engage our critical thinking and assess plant conditions, rather than skimming a simplified version. To be clear, the checklists had been generated and approved by our Engineering Officer, but they were still simplifications.

jacquesm · 2 years ago

Checklists are great if you use them properly: to make sure you remember. Checklists are dangerous when they are used improperly: to replace or shut-down critical thinking.

Simon_ORourke · 2 years ago

A colleague of mine came from a major aviation design company before joining tech and said they were in a state of culture shock at how critical systems were designed and monitored. Even if there are no hard real time requirements for a billing system, this guy was surprised at just how lax tech design patterns tended to be.

Horffupolde · 2 years ago

If 200 people died after a db instance crashed, software would be equal in that regard.

girvo · 2 years ago

To prove this, software that deals with medical stuff is somewhat more like aviation.

mlrtime · 2 years ago

Likewise, in "aviation" when the entertainment system completely fails in a 4 hour flight, there is most like no post mortem at all. They turn it off/on again just like most of us.

mewpmewp2 · 2 years ago

Some people who think this is ideal for any sort of software tech sound they would also want a 3 hour post mortem with whoever designed the rooms, after slightly stubbing a toe.

blauditore · 2 years ago

This kind of makes sense, but it is only possible because of public pressure/interest. Many people are irrationally emotional about flying (fear, excitement etc.), that's why articles and documentaries like this post are so popular.

On a side note, that's also why there's all the nomsense security theater at airports.

jstanley · 2 years ago

> robustness has less to do with the number of mistakes but how one responds to them

It must have something to do with the number of mistakes, otherwise it's all a waste of time!

It's all well and good responding to mistakes as thoroughly as possible, but if it's not reducing the number of mistakes, what's it all for?

krisoft · 2 years ago

> It must have something to do with the number of mistakes, otherwise it's all a waste of time!

Not really. Imagine two systems with the same amount of mistakes. (Here the mistakes can be either bugs, or operator mistakes.)

One is designed such that every mistake brings the whole system down for a day with millions of dollars of lost revenue each time.

The other is designed such that when a mistake happens it is caught early, and when it is not caught it only impacts some limited parts of the system and recovering from the mistake is fast and reliable.

They both have the same amount of mistakes, yet one of these two systems is wastly more reliable.

> if it's not reducing the number of mistakes, what's it all for

For reducing their impact.

colechristensen · 2 years ago

Aerospace things have to be like this or they just wouldn’t work at all. There are just too many points of failure and redundancy is capped by physics. When there’s a million things which if they went wrong could cause catastrophic failure, you have to be really good at learning how to not make mistakes.

WalterBright · 2 years ago

> you have to be really good at learning how to not make mistakes.

Not exactly. The idea is not not making mistakes, it's whatcha gonna do about X when (not if) it fails.

mewpmewp2 · 2 years ago

> Being an SRE at a FAANG and generally spending a lot of my life dealing with reliability, I am consistently in awe of the aviation industry. I can only hope (and do my small contribution) that the software/tech industry can one day be an equal in this regard.

There's a slight difference in terms of what kind of damage an airplane malfunctioning causes compared to a button on an e-commerce shop rendering improperly for one of the browsers. My point is that the level of investment in reliability and process should be proportional to the potential damage of any incidents.

solids · 2 years ago

I agree, and also I enjoy the attitude. While in my profession the postmortems goal is finding who to blame, here the attitude is towards preventing it to happen again, no matter what. Or at least that’s how I feel.

mewpmewp2 · 2 years ago

Your profession? Or you mean your company? Unless it's a very specific profession I would not know, it would usually imply that the company is dysfunctional.

bomewish · 2 years ago

Richard Hipp talks a lot about how SQLite adopted testing procedures directly from aviation.

switch007 · 2 years ago

> I can only hope that the software/tech industry can one day be an equal in this regard

I’d love to be an engineer with unlimited time budget to worry about “when, not if, X happens” (to quote a sibling comment).

But people don’t tend to die when we mess up, so we don’t get that budget.

akarve · 2 years ago

Hard agree. Civil & mechanical engineering have a culture and history of blameless analysis of failure. Software engineering could learn from them.

See the excellent To Engineer is Human in just this topic of analyzed failures in civil engineering.

The article is complex and well written, but I am a bit perplexed by the victorious tone and never-ending praise of safety. It resembles a sales pitch a bit too much, even though no one is selling anything. Maybe it's unintentional, and being around salesmen just does that to people.

If you are like me, you've probably said “hmm…” to yourself multiple times when certain things were mentioned, because those were things that actually didn't work (that they were left intact really boosts the credibility of the author). From calculation software that had never ever been tested with out-of-ordinary data to the computer keeping the broken engine running. From pure luck with fuel tanks being almost full and unable to explode to absence of any physical kill switch to stop the engine. An hour being generously available to go through ALL the checklists to clear the notifications. An hour of passengers and crew staying on top of the poodle of fuel hoping that nothing would ignite it. Finally, pure randomness in debris flying the way it did. It's not a story of “layers of safety” overlapping, it's a story of “layers of randomness” overlapping.

What would be really interesting is a distribution of outcomes for all possible trajectories of debris, i. e., how (un)lucky they actually were. I guess corporations don't release models like those to the public.

Also, that special chamber for oil filter requiring precise drilling of a perfectly fine pipe seems “ewww” to me. It is not serviceable anyway without reinstalling everything from scratch, as far as I understand, why not make it a single piece?

Game_Ender · 2 years ago

The author is positive because of all the safety layers that existed and staid intact, despite how flawed humans and companies are. The culture of looking at previous accidents like the UA232, where they lost ann engine and ALL controls with it, meant the A380 control system was engineered to take even more damage and it worked.

I do agree though it did not spend enough effort focusing on the areas to improve:

- A computer controlled engine that runs for 60 seconds while on fire, and lets a dangerous part spin too fast. It seems like something that should of been covered ahead of time.

- An engine manufacturing process that is so complex it’s almost impossible to validate.

- A fault management system that only shows you 1 or 2 at a time when you have 40.

genocidicbunny · 2 years ago

> - A fault management system that only shows you 1 or 2 at a time when you have 40.

As long as the system prioritizes the warnings/cautions with the most pressing ones shown first, this is a very good thing. In a high-stress situation, you don't want the pilots to have to deal with figuring out which of the 40 warnings need to be taken care of first.

mixdup · 2 years ago

I suspect the ECAM only showing a couple of failures at a time is a design feature, not a flaw, to prevent overwhelming the crew as they work through them

benhurmarcel · 2 years ago

> the computer keeping the broken engine running

That’s on purpose, you don’t want an automation decide such a drastic move as shutting down an engine. That’s the pilot’s decision.

> absence of any physical kill switch to stop the engine

There is, you shut down the fuel flow with a valve. But that “kill switch” was damaged.

> An hour being generously available to go through ALL the checklists to clear the notifications

Again, pilot decision to do it if time is available. Isn’t it safer that way?

> pure randomness in debris flying the way it did

Well that’s the nature of the failure. It’s like complaining that which HDD fails in a datacenter is random.

> outcomes for all possible trajectories of debris,

Yes it’s not public data, but all positive trajectories are analyzed at the design stage, and structural and systems components are kept segregated accordingly.

ogurechny · 2 years ago

I'm not an idiot (citation needed). I can see that a storm unplugging some imaginary tiny heartbeat cable, which in turn shuts down all the engines instantly, is not how planes should operate. What I don't understand is the approach to defend status quo, and pretend that “randomness is now conquered”.

It seems to me that fixing one complex problem creates 10 other complex problems. They can be rare, but it's ignorant to shift focus from them.

otherme123 · 2 years ago

I've read dozens of Admiral Cloudberg articles, and when you do so you notice a pattern: in old aviation crashes, a single error or a single part failure usually took down a plane with tens of dead bodies. Also the story of how and why the sterile flight deck started in response to some crashes where the pilots were distracted talking. In modern aviation accidents, it seems very unlikely. Even with an engine exploding, the pieces ripping half the cables, a wing, the fuel reservoir, hydraulics, and the airplane is still almost perfectly flyable and landable. Do the same to any car, were nothing is redundant, and lets see how well it performs.

The beauty of it is that everyone in aviation seems eager to learn and build on errors. This event prompted new actions that makes future flying even safer, despite having no victims.

ogurechny · 2 years ago

That's the problem. Even if there were victims, one could've written the exact same article about “flying even safer”.

jnsaff2 · 2 years ago

The victorious tone comes in my opinion (though I'm projecting a bit) from this graph[0].

There has been very systematic and deliberate effort to better aviation safety DESPITE commercial pressures.

The swiss cheese means that there are many more layers of randomness that have to line up. Many of those layers came from previous accidents. Those layers are not random at all. Also none of those layers are hole free.

If that disk had disintegrated differently a potentially different set of layers would have applied. Would it have meant fatalities? Possibly. Would it have instantly blown up the plane? We don't know.

But it is pretty obvious that had many of those layers not existed then the chances of a much more disastrous outcome would have been much higher.

[0] https://upload.wikimedia.org/wikipedia/commons/e/ef/Fataliti...

angry_octet · 2 years ago

And on other aviation systems we do examine multiple failure modes. For example, a round going though the fuselage of an Apache, tumbling and smashing and causing spalling, thousands of simulated trials. Then coupled physics models that look at dozens of unintended interactions, avgas squirting out onto electronics, hot manifolds, etc.

There a whole field of Fault Tree Analysis that looks at how adjacent faults can propagate into unrelated components, then Event Tree Analysis to determine what will happen next. Models that assess robustness against failures even when we have no idea how the failure will occur.

Reliability of cyber physical systems is a constantly evolving field, lots of recent work on concepts like probabilistic model checking, ML for anomaly detection, resistance to cyber attacks, and so on.

ogurechny · 2 years ago

There is more that one way to interpret this history of “triumph of technology and human mind”, yada yada.

This flight can be seen as an expensive (thrilling, entertaining, newsworthy, etc.) experiment on live subjects whose outcome was not controlled by existing tools and procedures.

The same for everything before to which it is compared so lightheartedly.

Please don't forget that your image shows a giant graveyard.

matheusmoreira · 2 years ago

That this plane was maneuverable despite a massive engine explosion that took out 65% of its roll control surfaces is absolutely a victory of the engineers of that aircraft. I was shocked when I read that.

Sheer dumb luck was certainly involved. Those discs could have cleaved the plane in half to say nothing of the humans in its way but somehow missed most of the plane entirely. We definitely need to count every single one of those blessings. It's hard not to be positive when such an episode ended with zero fatalities, zero injuries even.

nojs · 2 years ago

To me it’s impressive because presumably shards of debris cutting through so many distinct parts of the plane at the same time like this is a rare thing compared to more localized failures which the plane would be designed for. Yet all the different failsafes still worked enough to get the plane safely to the ground.

mlrtime · 2 years ago

It is very common and encouraged to add a "What went well" in post mortems. This is not a pat yourself on the back moment. It is to reflect on what failed and what didn't.

Neil44 · 2 years ago

I guess it's a glass half full type situation. There's a lot of universes where that plane did not make it back and a lot of decisions aligned to ensure that it did.

caf · 2 years ago

They do have multiple kill switches to stop the engines, up to dumping a bunch of flame retardant into it which makes it impossible to restart. The problem was that all these systems for the #1 engine were rendered inoperable by the damage caused by the failure of the #2 engine.

Certainly there was a fair bit of luck involved as well.