CrowdStrike Update: Windows Bluescreen and Boot Loops

Throwaway account...

CrowdStrike in this context is a NT kernel loadable module (a .sys file) which does syscall level interception and logs then to a separate process on the machine. It can also STOP syscalls from working if they are trying to connect out to other nodes and accessing files they shouldn't be (using some drunk ass heuristics).

What happened here was they pushed a new kernel driver out to every client without authorization to fix an issue with slowness and latency that was in the previous Falcon sensor product. They have a staging system which is supposed to give clients control over this but they pissed over everyone's staging and rules and just pushed this to production.

This has taken us out and we have 30 people currently doing recovery and DR. Most of our nodes are boot looping with blue screens which in the cloud is not something you can just hit F8 and remove the driver. We have to literally take each node down, attach the disk to a working node, delete the .sys file and bring it up. Either that or bring up a new node entirely from a snapshot.

This is fine but EC2 is rammed with people doing this now so it's taking forever. Storage latency is through the roof.

I fought for months to keep this shit out of production because of this reason. I am now busy but vindicated.

Edit: to all the people moaning about windows, we've had no problems with Windows. This is not a windows issue. This is a third party security vendor shitting in the kernel.

amluto · 2 years ago

I did approximately this recently, but on a Linux machine on GCP. It sucked far worse than it should have: apparently GCP cannot reliably “stop” a VM in a timely manner. And you can’t detach a boot disk from a VM that isn’t “stopped”, nor can you multi-attach it, nor can you (AFAICT) convince a VM to boot off an alternate disk.

I used to have this crazy idea that fancy cloud vendors had competent management tools. Like maybe I could issue an API call to boot an existing instance from an alternate disk or HTTPS netboot URL. Or to insta-stop a VM and get block-level access to its disk via API, even if I had to pay for the instance while doing this.

And I’m not sure that it’s possible to do this sort of recovery at all without blowing away local SSD. There’s a “preview” feature for this on GCP, which seems to be barely supported, and I bet it adds massive latency to the process. Throwing away one’s local SSD on every single machine in a deployment sounds like a great way to cause potentially catastrophic resource usage when everything starts back up.

Hmm, I wonder if you’re even guaranteed to be able to get your instance back after stopping it.

WTF. Why can’t I have any means to access the boot disk of an instance, in a timely manner? Or any better means to recover an instance?

Is AWS any better?

teeheelol · 2 years ago

AWS is not any better really on this. In fact 2 years ago (to the day!) we had a complete AZ outage in our local AWS region. This resulted in their control plane going nuts and being unable to shut down or start new instances. Then capacity problems.

Twirrim · 2 years ago

> apparently GCP cannot reliably “stop” a VM in a timely manner.

In OCI we made a decision years ago that after 15 minutes from sending an ACPI shutdown signal, the instance should be hard powered off. We do the same for VM or BM. If you really want to, we take an optional parameter on the shutdown and reboot commands to bypass this and do an immediate hard power off.

So worst case scenario here, 15 minutes to get it shut down and be able to detach the boot volume to attach to another instance.

pizza234 · 2 years ago

Based on your description, AWS has another level of stop, the "force stop", which one can use in such cases. I don't have statistics on the time, so I don't know if that meets your criteria of "timely", but I believe it's quick enough (sub-minute, I think).

khrystoph · 2 years ago

There is a way with AWS, but it carries risk. You can force detach an instance's volume while it's in the shutting down state, but if you re-attach it to another machine, you risk the possibility of a double-write/data corruption while the instance is still shutting down.

As for "throwing away local SSD", that only happens on AWS with instance store volumes which used to be called ephemeral volumes as the storage was directly attached to the host you were running on and if you did a stop/start of an ebs-backed instance, you were likely to get sent to a different host (vs. a restart API call, which would make an ACPI soft command and after a duration...I think it was 5 minutes, iirc, the hypervisor would kill the instance and restart it on the same host).

When the instance would get sent to a different host, it would get different instance storage and the old instance storage would be wiped from the previous host and you'd be provisioned new instance storage on the new host.

However, with EBS-volumes, those travel from host to host across stop/start cycles and they're attached via very low latency across the network from EBS servers and presented as a local block device to the instance. It's not quite as fast as local instance store, but it's fast enough for almost every use case if you get enough IOPS provisioned either through direct provisioning + correct instance size OR through a large enough drive + large enough instance to maximjze the connection to EBS (there's a table and stuff detailing IOPs, throughput, and instance size in the docs).

Also, support can detach the volume as well if the instance is stuck shutting down and doesn't get manually shut down by the API after a timeout.

None of this is by any means "ideal", but the complexity of these systems is immense and what they're capable of at the scale they operate is actually pretty impressive.

The key is...lots of the things you talk about are do-able at small scale, but when you add more and more operations and complexity to the tool stack on interacting with systems, you add a lot of back-end network overhead, which leads to extreme congestion, even in very high speed networks (it's an exponential scaling problem).

The "ideal" way to deal with these systems is to do regular interval backups off-host (ie. object/blob storage or NFS/NAS/similar) and then just blow away anything that breaks and do a quick restore to the new, fixed instance.

It's obviously easier said than done and most shops still on some level think about VMs/instances as pets, rather than cattle or have hurdles that make treating them as cattle much more challenging, but manual recovery in the cloud, in general, should just be avoided in favor of spinning up something new and re-deploying to it.

spectrumero · 2 years ago

I always said if you want to create real chaos, don't write malware. Get on the inside of a security product like this, and push out a bad update, and you can take most of the world down.

valval · 2 years ago

So… Write malware?

Twirrim · 2 years ago

> Most of our nodes are boot looping with blue screens which in the cloud is not something you can just hit F8 and remove the driver.

It took a bit to figure out with some customers, but we provide optional VNC access to instances at OCI, and with VNC the trick seems to be to hit esc and then F8, at the right stage in the boot process. Timing seems to be the devil in the details there, though. Getting that timing right is frustrating. People seem to be developing a knack for it though.

tamimio · 2 years ago

> give clients control over this but they pissed over everyone's staging and rules and just pushed this to production.

Interesting..

> We have to literally take each node down, attach the disk to a working node..

Probably the easiest solution for you is to go back in time to a previous scheduled snapshot, if you have that setup already.

teeheelol · 2 years ago

That would make sense but it appears everyone is doing EBS snapshots in our regions like mad so they aren't restoring. Spoke to our AWS account manager (we are a big big big org) and they have contention issues everywhere.

I really want our cages, C7000's and VMware back at this point.

numbsafari · 2 years ago

> This is not a windows issue.

Honest question, I've seen comments in these various threads about people having similar issues (from a few months/weeks back) with kernel extension based deployments of CrowdStrike on Debian/Ubuntu systems.

I haven't seen anything similar regarding Mac OS, which no longer allows kernel extensions.

Is Mac OS not impacted by these kinds of issues with CrowdStrike's product, or have we just not heard about it due to the small scale?

Personally, it's a shared responsibility issue. MS should build a product that is "open to extension but closed for modification".

> they pissed over everyone's staging and rules and just pushed this to production.

I am guessing that act alone is going to create a massive liability for CrowdStrike over this issue. You've made other comments that your organization is actively removing CrowdStrike. I'm curious how this plays out. Did CrowdStrike just SolarWind themselves? Will we see their CISO/CTO/CEO do time? This is just the first part of this saga.

teeheelol · 2 years ago

The issue is where it is integrated. You could arguably implement CrowdStrike in BPF on Linux. On NT they literally hook NT syscalls in the kernel from a driver they inject into kernel space which is much bad juju. As for macOS, you have no access to the kernel.

There is no shared responsibility. CrowdStrike pushed a broken driver out, then triggered the breakage, overriding customer requirement and configuration for staging. It is a faulty product with no viable security controls or testing.

crustycoder · 2 years ago

Although it's a .sys file, it's not a device driver.

"Although Channel Files end with the SYS extension, they are not kernel drivers."

https://www.crowdstrike.com/blog/technical-details-on-todays...

teeheelol · 2 years ago

Yeah it's a way of delivering a payload to the driver, which promptly crashed.

Which is horrible!

sphar1970 · 2 years ago

But it is Windows because the kernel should be able to roll back a bad update, there should NEVER be BSODs.

teeheelol · 2 years ago

Windows does do that. Crowdstrike sticks it back in at the UEFI level by the looks, because you know, "security".

OhMeadhbh · 2 years ago

pish! this isn't VM/SP! commodity OSes and hardware took over because customers didn't want to pay firms to staff people who grokked risk management. linux supplanted mature OSes because some dork implied even security bugs were shallow with all those billions of eyes. It's a weird world when MSFT does a security stand down in 2003 and in 2008 starts widening security holes because the new "secure" OS they wrote was a no-go for third parties who didn't want to pay $100 to hire someone who knew how to rub two primes together.

I miss my AS/400.

This might be a decent place to recount the experience I had when interviewing for office security architect in 2003. my background is mainframe VM system design and large system risk management modeling which I had been doing since the late 80s at IBM, DEC, then Digital Switch and Bell Canada. My resume was pretty decent at the time. I don't like Python and tell VP/Eng's they have a problem when they can't identify benefits from JIRA/SCRUM, so I don't get a lot of job offers these days. Just a crusty greybeard bitching...

But anyway... so I'm up in Redmond and I have a decent couple of interviews with people and then the 3rd most senior dev in all of MSFT comes in and asks "how's your QA skills?" and I start to answer about how QA and Safety/Security/Risk Management are different things. QA is about insuring the code does what it's supposed to, software security, et al is about making sure the code doesn't do what it's not supposed to and the philosophic sticky wicket you enter when trying to prove a negative (worth a google deep dive if you're unfamiliar.) Dude cuts me off and says "meh. security is stupid. in a month, Bill will end this stupid security stand down and we'll get back to writing code and I need to put you somewhere and I figured QA is the right place."

When I hear that MSFT has systems that expose inadequate risk management abstractions, I think of the culture that promoted that guy to his senior position... I'm sure he was a capable engineer, but the culture in Redmond discounts the business benefits of risk management (to the point they outsource critical system infrastructure to third parties) because senior engineers don't want to be bothered to learn new tricks.

Culture eats strategy for breakfast, and MSFT has been fed on a cultural diet of junk food for almost half a century. At least from the perspective of doing business in the modern world.

Reason077 · 2 years ago

> ”This is not a windows issue. This is a third party security vendor shitting in the kernel.“

Sure, but Windows shares some portion of the blame for allowing third-party security vendors to “shit in the kernel”.

Compare to macOS which has banned third-party kernel extensions on Apple Silicon. Things that once ran as kernel extensions, including CrowdStrike, now run in userspace as “system extensions”.

naib0930 · 2 years ago

Back in 2006 the Microsoft agreed to allow kernel level access for Security companies due to an EU anti trust investigation. They were being sued by anti virus companies because they were blocking kernel access in the soon to be released Vista.

https://arstechnica.com/information-technology/2006/10/7998/

phendrenad2 · 2 years ago

Yes... in the same sense that if a user bricks their own system by deleting system32 then Windows shares some small sliver of the blame. In other words, not much.

deepsummer · 2 years ago

...but still, if the user space process is broken, MacOS will fail as well. Maybe it's a bit easier to recover, but any broken process with non-trivial privileges can interrupt the whole system.

Dead Comment

jakebleiberg · 2 years ago

Hello:

I'm a reporter with Bloomberg News covering cybersecurity. I'm trying to learn more about this Crowdstrike update potentially bypassing staging rules and would love to hear about your experience. Would you be open to a coversation?

I'm reachable by email at jbleiberg2@bloomberg.net or on Signal at JakeBleiberg.24. Here's my Bloomberg author page: https://www.bloomberg.com/authors/AWuCZUVX-Pc/jake-bleiberg.

Thank you.

Jake

HelloNurse · 2 years ago

Before reaching the "pushed out to every client without authorization" stage, a kernel driver/module should have been tested. Tested by Microsoft, not by "a third party security vendor shitting in the kernel" that some criminally negligent manager decided to trust.

Shorn · 2 years ago

> Tested by Microsoft

MS don't have testers any more. Where do you think CS learned their radically effective test-in-prod approach?

teeheelol · 2 years ago

Yeah we have a staging and test process where we run their updated Falcon sensor releasees.

They shit all over our controls and went to production.

This says we don't control it and should not trust it. It is being removed.

nox101 · 2 years ago

why would Microsoft be required to test some 3rd party software? Maybe I mis-understood.

willmadden · 2 years ago

It's a shitty C++ hack job within CloudStrike with a null pointer. Because the software has root access, Windows shuts it down as a security precaution. A simple unit test would have caught this, or any number of tools that look for null pointers in C++, not even full QA. It's unbelievable incompetence.

Took down our entire emergency department as we were treating a heart attack. 911 down for our state too. Nowhere for people to be diverted to because the other nearby hospitals are down. Hard to imagine how many millions of not billions of dollars this one bad update caused.

HaZeust · 2 years ago

Yup - my mom went into the ER for stroke symptoms last night and was put under MRI. The MRI imaging could NOT be sent to the off-site radiologist and they had to come in -- turned out the MRI outputs weren't working at all.

We were discharged at midnight by the doctor, the nurse didn't come into our exam room to tell us until 4am. I can't imagine the mess this has caused.

nirav72 · 2 years ago

A relative of mine had back surgery late yesterday. Today the hospital nursing staff couldn’t proceed with the pain medication process for patients recovering from surgery because they didn’t have access to the hospital systems.

jmcgough · 2 years ago

Hope she's okay. For better or worse, our entire emergency department flow is orchestrated around epic. If we can't even see the board, nurses don't know what orders to perform, etc.

rbanffy · 2 years ago

I wish your mother recovers promptly. And I’m glad she doesn’t run on Windows. ;-)

davidw · 2 years ago

I hope she's ok.

jessechang · 2 years ago

Wishing you and your mom the best

olwmc · 2 years ago

I wish your mother the best <3

theGnuMe · 2 years ago

Idk… critical hospital systems should be air gapped.

diebeforei485 · 2 years ago

> We were discharged at midnight by the doctor, the nurse didn't come into our exam room to tell us until 4am. I can't imagine the mess this has caused.

That's an extra 4 hours of emergency room fees you ideally wouldn't have to pay for.

failbuffer · 2 years ago

Honestly, that sounds like a typical ER visit.

davycro · 2 years ago

The system crashed while my coworker was running a code (aka doing CPR) in the ER last night. Healthcare IT is so bad at baseline that we are somewhat prepared for an outage while resuscitating a critical patient.

peterleiser · 2 years ago

The second largest hospital group in Nashville experienced a ransomware attack about two months ago. Nurses told me they were using manual processes for three weeks.

FireBeyond · 2 years ago

As a paramedic, there is very little about running a code that requires IT. You have the crash cart, so not even stuck trying to get meds out of the Pyxis. The biggest challenge is charting / scribing the encounter.

dwatson92 · 2 years ago

Excuse my ignorance, but what systems are needed for CPR?

SarahWSJ · 2 years ago

Hello, I'm a journalist looking to reach people impacted by the outage and wondering if you could kindly connect with your ER colleague. My email is sarah.needleman@wsj.com. Thanks!

smsm42 · 2 years ago

Now this is an unusual meeting of two meanings of "running a code".

jjav · 2 years ago

> Took down our entire emergency department as we were treating a heart attack.

It makes my blood boil to be honest that there is no liability for what software has become. It's just not acceptable.

Companies that produce software with the level of access that Crowdstrike has (for all effective purposes a remote root exploit vector) must be liable for the damages that this access can cause.

This would radically change how much attention they pay to quality control. Today they can just YOLO-push barely tested code that bricks large parts of the economy and face no consequences. (Oh, I'm sure there will be some congress testimony and associated circus, but they will not ever pay for the damages they caused today.)

If a person caused the level and quantity of damage Crowdstrike caused today they would be in jail for life. But a company like Crowdstrike will merrily go on doing more damage without paying any consequence.

throwaway7356 · 2 years ago

> Companies that produce software

What about companies that deploy software with the level of quality that Crowdstrike has? Or Microsoft 365 for that matter.

That seems to be the bigger issue here; after all Crowdstrike probably says it is not suitable for any critical systems in their terms of use. You shouldn't be able to just decide to deploy anything not running away fast enough on critical infrastructure.

On the other hand, Crowdstrike Falcon Sensor might be totally suitable for a non-critical systems, say entertainment systems like the Xbox One.

importantbrian · 2 years ago

Wife is a nurse. They eventually go 2 computers working for her unit. I don't think it impacted patients already being treated, but they couldn't get surgeries scheduled and no charting was being done. Some of the other floors were in complete shambles.

SarahWSJ · 2 years ago

Hi, as I noted to another commenter, I'm a journalist looking to speak with people who've been impacted by the outage. I'm wondering if I could speak with your wife. My email is sarah.needleman@wsj.com. Thanks.

hassiktir · 2 years ago

I dont understand how this isnt bigger news?

Local emergency services were basically nonfunctioning for better part of the day along with the heat wave and various events, seems like a number of deaths (locally at least, specific to what I know for my mid sized US city) will be indirectly attributable to this.

jmcgough · 2 years ago

It's entirely possible (likely, even) that someone died from this, but it's hard to know with critically ill patients whether they would have survived without the added delays.

ayakang31415 · 2 years ago

If true, this is insane that critical facilities like hospital do not have decentralized security system.

mandevil · 2 years ago

Crowdstrike is on every machine in the hospital because hospitals and medical centers became a big target for ransomware a few years ago. This forced medical centers to get insured against loss of business and getting their data back. The insurance companies that insure companies against ransomware insist on putting host based security systems onto every machine or they won't cover losses. So Crowdstrike (or one of their competitors) has to run on every machine.

niutech · 2 years ago

It's insane why critical facilities use Windows OS rather than Linux/*BSD, which is rock-solid.

fransje26 · 2 years ago

> Hard to imagine how many millions of not billions of dollars this one bad update caused.

And even worse, possibly quite a few deaths as well.

I hope (although I will not be holding my breath), that this is the wake-up call we need to realise that we cannot have so much of our critical infrastructure rely on the bloated OS of company known for its buggy, privacy-intruding, crapware riddled software.

I'm old enough to remember the infamous blue-screen-of-death Windows 98 presentation. Bugs exist but that was hardly a glowing endorsement of high-quality software.. This was long ago, yet it is nigh on impossible to believe that the internal company culture has drastically improved since then, with regular high-profile screw-ups reminding us of what is hiding under the thin veneer of corporate of respectability.

Our emergency systems don't need windows, our telephone systems don't need windows, our flight management systems don't need windows, our shop equipment systems don't need windows, our HVAC systems don't need windows, and the list goes on, and on, and on.

Specialized, high-quality OSes with low attack surfaces are what we need to run our systems. Not a generic OS stuffed with legacy code from a time when those applications were not even envisaged.

Keep-it-simple-stupid -KISS-is what we need to go back to, our lives literally depend on it.

With the mutli-billion dollars screw-up that happened yesterday, and an as-of-yet unknown number of deaths, it's impossible to argue that the funds are unavailable to develop such systems. Plurality is what we need, built on top of strong standards for compatibility and interoperability.

fluoridation · 2 years ago

OK, but this was a bug in an update of a kernel module that just happened to be deployed on Windows machines. How many OSs are there that can gracefully recover from an error in kernel space? If every machine that crashed had been running, say, Linux and the update had been coded equivalently, nothing would've changed.

Perhaps rather than an indictment on Windows, this is a call to re-evaluate microkernels, at least for critical systems and infrastructure.

charles_f · 2 years ago

> Took down our entire emergency department

What does this mean? Did the power go down? Is all the equipment connected? Or is it the insurance software that can't run do nothing gets done? Maybe you can't access patient files anymore but is that taking down the whole thing?

jmcgough · 2 years ago

Every computer entered a bluescreen loop. We are dependent on Epic for placing orders, for nursing staff to know what needs to be done, for viewing records, for transmitting and interpreting radiology machines. It's how we know the current state of the department and where each patient (out of 50+ people we are simultaneously treating) is at. Our equipment still works but we're flying blind and having to shout orders at each other and have no way to send radiology images to other doctors for consultation.

plonk · 2 years ago

Did the person survive?

jmcgough · 2 years ago

We have limited visibility into this in the emergency department. You stabilize the patient and admit them to the hospital, then they become internal medicine or ICU's patient. Thankfully most of the work was done and consults were called prior to the outage, but they were in critical condition.

drhelix · 2 years ago

Why is the emergency department using windows?

johncessna · 2 years ago

Why did they update everything all at once?

bell-cot · 2 years ago

High-end hospital-management software is not simple stuff, to roll your own. And the (very few) specialty companies which produce such software may see no reason to support a variety of OS's.

et2o · 2 years ago

Because essentially every large hospital in the USA does?

sizzle · 2 years ago

Contact a lawyer if this affected her health please. Any delays in receiving Stroke care can have injured her more I imagine. Any docs here?

jmcgough · 2 years ago

ER worker here. It really depends on the details. If she was C-STAT positive with last known normal within three hours, you assume stroke, activate the stroke team, and everything moves very quickly. This is where every minute counts, because you can do clot busting to recover brain function.

The fact that she was discharged without an overnight admit suggests to me that the MRI did not show a stroke, or perhaps she was outside the treatment window when she went to the hospital.

psychlops · 2 years ago

I can't even begin to imagine the cost of proving the health effects and attempting to win the case.

bookofjoe · 2 years ago

Yes. Reading and learning.

kyledrake · 2 years ago

I remember a fed speaker in the 90s at Alexis hotel Defcon trying to rationalize their weirdly over-aggressive approach to enforcement by mentioning how hackers would potentially kill people in hospitals, fast forward to today and it's literally the "security" software vendor that's causing it.

zitterbewegung · 2 years ago

Well cryptolockers have actually compromised various hospitals and I remember the first one was in the United Kingdom .

kspacewalk2 · 2 years ago

It's not like hackers haven't killed people in hospitals with e.g. ransomware. Our local dinky hospital system was hit by ransomware twice, which at the very least delayed some important surgeries.

BuckRogers · 2 years ago

I can't imagine why any critical system is connected to the internet at all. It never made sense to me. Wifi should not be present on any critical system board and ethernet plugged in only when needed for maintenance.

This should be the standard for any life sustaining or surgical systems, and any critical weapons systems.

emodendroket · 2 years ago

Not like hackers haven’t done the same.

willmadden · 2 years ago

I'm guessing hundreds of billions if you could somehow add it all up.

I can't believe they pushed updates to 100% of Windows machines and somehow didn't notice a reboot loop. Epic gross negligence. Are their employees really this incompetent? It's unbelievable.

I wonder where MSFT and Crowdstrike are most vulnerable to lawsuits?

lanstin · 2 years ago

This outage seems to be the natural result of removing QA by a different team than the (always optimistic) dev team as a mandatory step for extremely important changes. And neglecting canary type validations. The big question is will businesses migrate away from such a visibly incompetent organization. (Note I blame the overall org; I am sure talented individuals tried their best inside a set of procedures that asked for trouble.)

fsloth · 2 years ago

If you’ve ever been forced to use a PC with Crowdstrike it’s not amazing at all. I’m amazed incident of this scale didn’t happen earlier.

Everything about it reeks of incompetence and gross negligence.

It’s the old story of the user and purchaser being different parties-the software needs to be only good enough to be sold to third parties who never neeed to use it.

It’s a half-baked rootkit part of performative cyberdefence theatrics.

cma · 2 years ago

This will go on for multiple days, but hundreds of billions would be >$36 trillion annualized if it was that much damage for one day. World annual GDP is $100 trillion.

SoftTalker · 2 years ago

Their terms of use undoubtedly disclaim any warranty, fitness for purpose, or liability for any direct or incidental consequences of using their product.

I am LMFAO at the entire situation. Somewhere, George Carlin is smiling.

fortran77 · 2 years ago

MSFT doesn’t recommend or specify CrowdStrike

Drygord · 2 years ago

Same people who destroyed a US bridge recently.

This is the result of giving away US jobs overseas at 1/10th the salary

ww520 · 2 years ago

I saw one of the surgery videos recently. The doctor was saying, "Alexa, turn on suction." It boggled my mind. There could be so many points of failure.

senortumnus · 2 years ago

Fwiw this is not typical, we don’t have alexa/siri type smart devices in any OR I work in, and suction is turned on off with a button and a dial.

nullbyte · 2 years ago

ALEXA, TURN OFF THE SUCTION! ALEXA!!

“Loive from NPR news in Washington“

CamperBob2 · 2 years ago

I don't suppose there was a doctor or nurse named Alexa involved?

atoav · 2 years ago

Not to be that guy, but I often say software engineering as a field should have harsher standards of quality and certainly liability for things like these. You know like civil engineers, electrical engineers and most people whose work could kill people if done wrongly.

Usually when I write this devs get all defensive and ask me what the worst thing is that could happen.. I don't know.. Could you guarantee it doesn't involve people dying?

Dear colleagues, software is great because one persons work multiplies. But it is also a damn fucking huge responsibility to ensure you are not inserting bullshit into the multiplication.

sigseg1v · 2 years ago

Some countries such as Canada have taken minor steps towards this, for example making it illegal to call oneself a software engineer unless you are certified by the provinces professional engineering body, however this is still missing a lot. I also don't wish to be "that guy" but I'll go further and say that the US is really holding this back by not making using Software Engineer as a title (without holding a PEng) illegal in a similar fashion.

If we can at least get that basis then we can start to define more things such as jobs that non Engineers can not legally do, and legal ramifications for things such as software bugs. If someone will lose their professional license and potentially their career over shipping a large enough bug, suddenly the problem of having 25,000 npm dependences and continuous deployment breaking things at any moment will magically cease to exist quite quickly.

oriel · 2 years ago

I'd go a step farther and say software engineering as a field is not respected at the same levels as such certified/credentialed engineers, because of these lacks of standards and liabilities. Leading to common occurrences of systemic destructive failures such as this, due to organization level direction being very lax in dealing with software failure potential.

whydoyoucare · 2 years ago

I believe instances like this will push people to reconsider the lax stance. Humans in general have a hard time regulating something abstract. The fact that people can be killed is well-known since the 80s', see https://en.wikipedia.org/wiki/Therac-25

Tao3300 · 2 years ago

I'd endorse this. That way when my hypothetical PHB wants to know why something is taking so long I can say "See this part? Someone could die if we don't refactor it."

roeles · 2 years ago

Related talk by Alan Kay: https://youtu.be/D43PlUr1x_E

sunnybeetroot · 2 years ago

It’s important not to disregard that software engineers are often overruled by management or product when strict deadlines and targets exist.

hnthrow289570 · 2 years ago

"If only we asked harder problems for our leetcode interview!"

Dead Comment

beeboobaa3 · 2 years ago

And how many lifes lost?

WhyNotHugo · 2 years ago

It's honestly terrifying that someone would opt for Windows in systems critical to medical emergencies.

I hope organisations start revisiting some of these insane decisions.

TheCondor · 2 years ago

Not my story to tell, so I'm relaying it. Childhood friend works for a big company, you've heard their name, they make nuclear control systems for nuclear reactors; they have products out in the field they support and there are new reactors in parts of the world from time to time. We were scheduled to have lunch a couple years back and he bailed, we rescheduled, he bailed because that was the day you couldn't defer XP updates anymore, they came in and some XP systems became Windows 10. XP was "nuclear reactor approved" by someone, they had a tool chain that didn't work right on other versions of windows, it all gave me chills.

They ended up giving MS a substantial amount of money to extend support for their use case for some number of years. I can't remember the number he told me but it was extremely large.

afavour · 2 years ago

Eh. There are a great many problems that could befall a medical emergency systems that are unrelated to OS. Like power loss. I think the core problem here really is a lack of redundancy.

wannacboatmovie · 2 years ago

I've had updates break Linux machines.

Just a few weeks ago I had an OpenBSD box render itself completely unbootable after nothing more than a routine clean shutdown. Turns out their paranoid-idiotic "we re-link the kernel on every boot" coupled with their house-of-cards file system corrupted the kernel, then overwrote the backup copy when I booted from emergency media - which doesn't create device nodes by default so can't even mount the internal disks without more cryptic commands.

Give me the Windows box, please.

type0 · 2 years ago

some critical software has DRM that only works in Windows

rsync · 2 years ago

"Took down our entire emergency department as we were treating a heart attack. 911 down for our state too."

Why would Windows systems be anywhere near critical infra ?

Heart attacks and 911 are not things you build with Windows based systems.

We understood this 25 years ago.

freehorse · 2 years ago

I do not think windows is the problem here. The problem is that equipment that is critical infrastructure being connected to the internet, imo. There is little reason for a lot of computers in some settings to be connected to the internet, except for convenience or negligence. If data transfer needs to be done, it can happen through another computer. Some systems should exist on a (more or less) isolated network at best. Too often we do not really understand the risk of a device being connected to the internet, until something like this happens.

sentientslug · 2 years ago

It seems like you’ve never worked with critical infra. Most of it runs on 6 to 10 year old unpatched versions of Windows…

Wytwwww · 2 years ago

Well you can use stupid broken software with any OS, not just Windows. Isn't CrowdStrike Falcon available on Linux, is there any reason why couldn't they have introduced a similar bug and similar consequences there?

orbillius · 2 years ago

> Why would Windows systems be anywhere near critical infra ?

This is just a guess, but maybe the client machines are windows. So maybe there are servers connected to phone lines or medical equipment, but the doctors and EMS are looking at the data on windows machines.

nO0b · 2 years ago

> Why would Windows systems be anywhere near critical infra ?

maybe Heartbleed or the xzUtils debacles convinced them to switch.

Nathanael_M · 2 years ago

Because Windows is accessible and Linux requires uncommon expertise and short term cost that is just not practical for lots of places.

Goodluck teaching administrators an entirely new ecosystem, goodluck finding software off the shelf for Linux.

Bespoke is expensive, expertise is rare, Linux is sadly niche.

charles_f · 2 years ago

> Why would Windows systems be anywhere near critical infra ?

Why would computers be anywhere near critical infra? This sounds like something that should failsafe, the control system goes down but the thing keeps running. If power goes down, hospitals have generator backups, it seems weird that computers would not be in the same situation

swyx · 2 years ago

i mean not just dollars but lives also right? do we have a way to track that?

aydyn · 2 years ago

Yup through electronic medical records... o wait

lupire · 2 years ago

What's the NASDAQ ticker for lives?

charles_f · 2 years ago

> Hard to imagine how many millions of not billions of dollars this one bad update caused.

I mean, if the problem is that hospitals can't function anymore, money is hardly the biggest problem

Dead Comment

epolanski · 2 years ago

May I say that starting from "treating a heart attack" and ending up worrying about millions lost in productivity sounds a bit "wrong"?

jmcgough · 2 years ago

I just had a ten hour hospital shift from hell, apologies if my writing is lacking. I can't think of a better way to try to measure the scope of the damage caused by this.

plonk · 2 years ago

Billions in losses means a somewhat worse life for a huge number of people and potentially much worse healthcare problems down the line, the NHS was affected

kube-system · 2 years ago

When it comes to measuring the impact to society at scale, dollars is really the only useful common proxy. One can't enumerate every impact this is going to have on the world today -- there's too many.

jojobas · 2 years ago

Millions lost is sizeable parts of people's lives they won't get back.

jodrellblank · 2 years ago

> "Took down our entire emergency department as we were treating a heart attack."

Not questioning that it happened, but this was a boot loop after a content update. So if the computers were off and didn't get the update, and you booted them, they would be fine. And if they were on and you were using them, they wouldn't be rebooting, and it would be fine.

How did it happen that you were rebooting in the middle of treating a heart attack? [Edit: BSOD -> auto reboot]

TeMPOraL · 2 years ago

Beyond the BSOD that happened in this case, in general this is not true with Windows:

> And if they were on and you were using them, they wouldn't be rebooting, and it would be fine.

Windows has been notorious for forcing updates down your throat, and rebooting at the least appropriate moments (like during time-sensitive presentations, because that's when you stepped away from the keyboard for 5 minutes to set up the projector). And that's in private setting. Corporate setting, the IT department is likely setting up even more aggressive and less workaround-able reboot schedule.

Things like this is exactly why people hate auto-updates.

sleepydog · 2 years ago

Not OP, but some (most? many?) machines receiving the update crashed with a BSOD. So that's how they could enter the boot loop.

newhotelowner · 2 years ago

Half of the hotels (Choice) computers were down. We never reboot the computer, unless it's not working or working slowly or Windows update.

a0123 · 2 years ago

idk, a lot of system are never meant to be rebooted outside of the update schedule, so they wouldn't have been off in the first place. And if those systems control others, then there is a domino effect.

I can see very well how one computer could have screwed all others. It's really not hard to imagine.

lupire · 2 years ago

What happens when a computer gets rebooted as part of daily practice or because of the update, and then it becomes unusable, and then the treatment team needs to use it hours later?

dang · 2 years ago

All: there are over 3000 comments in this thread. If you want to read them all, click More at the bottom of each page, or like this:

https://news.ycombinator.com/item?id=41002195&p=2

https://news.ycombinator.com/item?id=41002195&p=3

https://news.ycombinator.com/item?id=41002195&p=4 (...etc.)

nimbius · 2 years ago

I work for a diesel truck repair facility and just locked up the doors after a 40 minute day :( .

- lifts wont operate.

- cant disarm the building alarms. (have been blaring nonstop...)

- cranes are all locked in standby/return/err.

- laser aligners are all offline.

- lathe hardware runs but controllers are all down.

- cant email suppliers.

- phones are all down.

- HVAC is also down for some reason (its getting hot in here.)

the police drove by and told us to close up for the day since we dont have 911 either.

alarms for the building are all offline/error so we chained things as best we could (might drive by a few times today.)

we dont know how many orders we have, we dont even know whos on schedule or if we will get paid.

saganus · 2 years ago

How come lifts and cranes are affected by this?

Are they somehow controlled remotely? or do they need to ping a central server to be able to operate?

I can see how alarms, email and phones are affected but the heavy machinery?

(Clearly not familiar with any of these things so I am genuinely curious)

lima · 2 years ago

Lots and lots of heavy machinery uses Windows computers even for local control panels.

ddulaney · 2 years ago

Probably a Windows-based HMI (“human-machine interface”).

I used to build sorting machines that use variants of the typical “industrial” tech stack, and the actual controllers are rarely (but not never!) Windows. But it’s common for the HMI to be a Windows box connected into the rest of the network, as well as any server.

davepm · 2 years ago

I'm still running multiple CNC/Industrial equipment with win3.1/98/xp. Only just retired one running Dos 6.2

Saris · 2 years ago

I'm just impressed that the lifts, alarms, cranes, phones, etc all run on Windows somehow.

a10c · 2 years ago

In a lot of cases you find tangential dependencies on Windows in ways you don't expect. For example a deployment pipeline entirely linux-based deploying to linux-based systems that relies on Active Directory for authentication.

I don't know if "impressed" is the right word..

"Appalled", "bewildered" and "horrified" and also comes to mind..

shoebham · 2 years ago

wow, why do lifts require an OS?

morning-coffee · 2 years ago

I think the same question can be asked for why lots of equipment seemingly requires an OS. My take is that these products went through a phase of trying to differentiate themselves from competitors and so added convenience features that were easier to implement with a general purpose computer and some VB script rather than focusing on the simplest most reliable way to implement their required state machines. It's essentially convenience to the implementors at the expense of reliability of the end result.

kulikalov · 2 years ago

the question is - why lifts require windows?

thedrbrian · 2 years ago

Why do lathes , cranes and laser alignment systems need a new copy of windows?

Because we live deep into the internet of shit era.

warkdarrior · 2 years ago

How else are you going to update your grocery list while operating the lift?

moritzwarhier · 2 years ago

> we dont have 911 either

Holy cow...

Who on earth requires a Windows-based backend (or whatever else had CrowdStrike, in the shop or outside) for regular (VoIP) phone calls.

This should really lead to some learnings for anyone providing any kind of phone infrastructure.

jajko · 2 years ago

Or lathe, or cranes, or alarms, or hvac... what the actual fuck.

Next move should be some artisanal as mechanical-as-possible quality products, or at least Linux(TM) certified product or similar (or Windows-free (TM)). The opportunity is here, everybody noticed this clusterfuck, and smart folks don't like ignoring threats that are in your face.

But I suppose in 2 weeks some other bombastic news will roll over this and most will forget. But there is always some hope

Filligree · 2 years ago

That’s not it. 911 itself was down.

xyst · 2 years ago

what are the brands of these systems?

__MatrixMan__ · 2 years ago

Oh man, you work with some cool (and dangerous) stuff.

Outage aside, do you feel safe using it while knowing that it accepts updates based on the whims of far away people that you don't know?

NegativeK · 2 years ago

I hate to be that person, but things have moved to automatic updates because security was even shittier when the user was expected to do it.

I can't even imagine how much worse ransomware would be if, for example, Windows and browsers weren't updating themselves.

steelframe · 2 years ago

Wow, this hits close to home. Doing a page fault where you can't in the kernel is exactly what I did with my very first patch I submitted after I joined the Microsoft BitLocker team in 2009. I added a check on the driver initialization path and didn't annotate the code as non-paged because frankly I didn't know at the time that the Windows kernel was paged. All my kernel development experience up to that point was with Linux, which isn't paged.

BitLocker is a storage driver, so that code turned into a circular dependency. The attempt to page in the code resulted a call to that not-yet-paged-in code.

The reason I didn't catch it with local testing was because I never tried rebooting with BitLocker enabled on my dev box when I was working on that code. For everyone on the team that did have BitLocker enabled they got the BSOD when they rebooted. Even then the "blast radius" was only the BitLocker team with about 8 devs, since local changes were qualified at the team level before they were merged up the chain.

The controls in place not only protected Windows more generally, but they even protected the majority of the Windows development group. It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification.

bonestamp2 · 2 years ago

> without even the most basic level of qualification

That was my first thought too. Our company does firmware updates to hundreds of thousands of devices every month and those updates always go through 3 rounds of internal testing, then to a couple dozen real world users who we have a close relationship with (and we supply them with spare hardware that is not on the early update path in case there is a problem with an early rollout). Then the update goes to a small subset of users who opt in to those updates, then they get rolled out in batches to the regular users in case we still somehow missed something along the way. Nothing has ever gotten past our two dozen real world users.

mihaaly · 2 years ago

Exactly this what I was missing in the story. Like why not to have a limited set of users have it before going live for the whole user base at a mission critical product like this is beyond comprehension of everyone ever came across software bugs (so billions of people). And then we already overcame the part of not testing internally well, or at all? Something clusteruck must have happened there which is still better than imagining that this is the normal way the organization operates. Which is a very scary vision. Serious rethinking of trusting this organization is due everywhere!

rvnx · 2 years ago

Or it could be made that Windows stops loading drivers that are crashing.

Third-party driver/module crashed more than 3 times in a row -> Third-party driver/module is punished and has to be manually re-enabled.

cush · 2 years ago

It's baffling how fast and wide the blast radius was for this Crowdstrike update. Quite impressive actually, if you think about it - updating billions of systems that quickly.

yibg · 2 years ago

This was my first thought too. I'm not that familiar with the space, but I would think for something this sensitive the rollout would be staggered at least instead of what looks like globally all at the same time.

But do you ever get free world-wide advertisement that everyone uses your product? Crowdstrike sure did and I'm sure they'll use that to sell it to more people.

rkagerer · 2 years ago

That is the right way to do it.

usr1106 · 2 years ago

> It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification.

Discussed elsewhere it is claimed that the file causing the crash was a data file that has been corrupted in the delivery process. So the development team and their CI have probably tested a good version, but the customer received a bad one.

If that is true to problem is that the driver first uses an unsigned file at all, so all customer machines are continuously at risk for local attacks. And then it does not do any integrity check on the data it contains, which is a big no no for all untrusted data, whether user space or kernel.

vlod · 2 years ago

If the file was signed, wouldn't that have prevented the corrupted transmission file from being loaded.

I assume if the signed file was hacked (or parts missing), then it wouldn't pass verification.

mbreese · 2 years ago

> And then it does not do any integrity check on the data it contains, which is a big no no for all untrusted data, whether user space or kernel.

To me, this is the inexcusable sin. These updates should be signed and signatures validated before the file is read. Ideally the signing/validating would be handled before distribution so that when this file was corrupted, the validation would have failed here.

But even with a good signature, when a file is read and the values don’t make sense, it should be treated as a bad input. From what I’ve seen, even a magic bytes header here would have helped.

wedesoft · 2 years ago

Still a staggered roll-out would have reduced the impact.

password4321 · 2 years ago

https://news.ycombinator.com/item?id=41006104#41006555

the flawed data was added in a post-processing step of the configuration update, which is after it's been tested internally but before it's copied to their update servers

per a new/green account

spaceywilly · 2 years ago

“And so that’s why we recommend using phased rollouts” -Every DevOps engineer from now on

cududa · 2 years ago

So have we decided to stop using checksums or something?

when something is changed, we usually re-test. that's the whole point of testing anyway. :)

> I didn't know at the time that the Windows kernel was paged.

At uni I had a professor in database systems, who did not like written exams, but mostly did oral exams. Obviously for DBMSes the page buffer is very relevant, so we chatted about virtual memory and paging. So in my explanation I made the difference for kernel space and user space. I am pretty sure I had read that in a book describing VAX/VMS internals. However, the professor claimed that a kernel never does paging for its own memory. I did not argue on that and passed the exam with the best grade. Did not check that book again to verify my claim. I have never done any kernel space development even vaguely close to memory management, so still today I don't know the exact details.

However, what strikes me here: When that exam happened in 1985ish the NT kernel did not exist yet, I'd believe. However, IIRC a significant part of the DEC VMS kernel team went to Microsoft to work on the NT kernel. So the concept of paging (a part of) kernel memory went with them? Whether VMS --> WNT, every letter increased by one is just a coincidence or intentionally the next baby of those developers I have never understood. As Linux has shown us today much bigger systems can be successfully handled without the extra complications for paging kernel memory. Whether it's a good idea I don't know, at least not a necessary one.

nullindividual · 2 years ago

If you want to hear the history of [DEC/VMS] NT from the horses mouth:

https://www.youtube.com/watch?v=xi1Lq79mLeE

dralley · 2 years ago

https://www.usenix.org/system/files/1311_05-08_mickens.pdf

"Perhaps the worst thing about being a systems person is that other, non-systems people think that they understand the daily tragedies that compose your life. For example, a few weeks ago, I was debugging a new network file system that my research group created. The bug was inside a kernel-mode component, so my machines were crashing in spectacular and vindic- tive ways. After a few days of manually rebooting servers, I had transformed into a shambling, broken man, kind of like a computer scientist version of Saddam Hussein when he was pulled from his bunker, all scraggly beard and dead eyes and florid, nonsensical ramblings about semi-imagined enemies. As I paced the hallways, muttering Nixonian rants about my code, one of my colleagues from the HCI group asked me what my problem was. I described the bug, which involved concur- rent threads and corrupted state and asynchronous message delivery across multiple machines, and my coworker said, “Yeah, that sounds bad. Have you checked the log files for errors?” I said, “Indeed, I would do that if I hadn’t broken every component that a logging system needs to log data. I have a network file system, and I have broken the network, and I have broken the file system, and my machines crash when I make eye contact with them. I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS. My only logging option is to hire monks to transcribe the subjective experience of watching my machines die as I weep tears of blood.”

qingcharles · 2 years ago

Ah, the joys of trying to come up with creative ways to get feedback from your code when literally nothing is available. Can I make the beeper beep in morse code? Can I just put a variable delay in the code and time it with a stopwatch to know which value was returned from that function? Ughh.

dwattttt · 2 years ago

Quoting James Mickens is always the winning move. I recommend the entire collection of his wisdom, https://mickens.seas.harvard.edu/wisdom-james-mickens

Somebody get this man a serial port, or maybe a PC Speaker to Morse out diagnostics signals.

Arrath · 2 years ago

That's beautiful.

killerstorm · 2 years ago

This is an interesting piece of creative writing, but virtual machines already existed in 2013. There are very few reasons to experiment on your dev machine.

mtlynch · 2 years ago

>Doing a page fault where you can't in the kernel is exactly what I did with my very first patch I submitted after I joined the Microsoft BitLocker team in 2009.

Hello from a fellow BitLocker dev from this time! I think I know who this is, but I'm not sure and don't want to say your name if you want it private. Was one of your Win10 features implementing passphrase support for the OS drive? In any case, feel free to reach out and catch up. My contact info is in my profile.

Win8. I've been seeing your blog posts show up here and there on HN over the years, so I was half expecting you to pick up on my self-doxx. I'll ping you offline.

temac · 2 years ago

"It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification."

It was my understanding that MS now sign 3rd party kernel mode code, with quality requirements. In which case why did they fail to prevent this?

YZF · 2 years ago

Drivers have had to be signed forever and pass pretty rigorous test suites and static analysis.

The problem here is obviously this other file the driver sucks in. Just because the driver didn't crash for Microsoft in their lab doesn't mean a different file can't crash it...

einpoklum · 2 years ago

> In which case why did they fail to prevent this?

"Oh, crowdstrike? Yeah, yeah, here's that Winodws kernel code signing key you paid for."

EasyMark · 2 years ago

This is what I don’t get, it’s extremely hard for me to believe this didn’t get caught in CI when things started blue screening. Every place I ever did test rebooting/powercycling was part of CI, with various hardware configs. This was before even our lighthouse customers even saw it.

Fire-Dragon-DoL · 2 years ago

What makes you think they have CI after what happened?

tomrod · 2 years ago

Disgruntled employee trying to use Crowd Strike to start a General Strike?

Deleted Comment

jboy55 · 2 years ago

I was thinking, this doesn't seem like its a case of all these machines still on an old version of windows, or some specific version, that is having issues. Therefore QA just missed one particular variant in their smoke testing. It seems like its every windows instance with that software, so either they don't have basic automated testing, or someone pushed this outside of a normal process.

brightlancer · 2 years ago

> Even then the "blast radius" was only the BitLocker team with about 8 devs, since local changes were qualified at the team level before they were merged up the chain.

Up the chain to automated test machines, right?

nick7376182 · 2 years ago

You would think automated test would come before your teammates work stations / commit to head.

I'm completely ignorant on the topic but isn't rebooting a default test for kernel code, given how sensitive it is?

Oh I rebooted, I just didn't happen to have the right configuration options to invoke the failure when I rebooted. Not every dev workstation was bluescreening, just the ones with the particular feature enabled.

brcmthrowaway · 2 years ago

What does this mean?

Windows kernel paged, linux non paged?

The memory used by the Windows kernel is either Paged or Non-Paged. Non-Paged means pinning the memory in physical RAM. Paged means it might be swapped out to disk and paged back in when needed. OP was working on BitLocker a file system driver, which handles disk IO. It must be pinned in physical RAM to be available all the times; otherwise, if it's paged out, an IO request coming would find the driver code missing in memory and try to page in the driver code, which triggers another IO request, creating an infinite loop. The Windows kernel usually would crash at that point to prevent a runway system and stops at the point of failure to let you fix the problem.

p_l · 2 years ago

Linux is a bit unusual in that kernel memory is generally physically mapped and unless you use vmalloc any memory you allocate has to correspond to pages backed by RAM. This also ties into how file IO happens, swapping, and how Linux approach to IO is actually closer to Multics and OS/400 than OG Unix.

Many other systems instead default to using full power of virtual memory including swapping kernel space to disk, with only things explicitly need to be kept in ram being allocated from "non-paged" or "wired" memory.

EDIT: fixed spelling thanks to writing on phone.

ec109685 · 2 years ago

Linux kernel memory isn’t paged out to disk, while Windows kernel memory can be: https://knowledge.broadcom.com/external/article/32146/third-...

isatty · 2 years ago

I do not mean this to be blamey in any way shape or form and am asking only about the process:

Shouldn’t that have been caught in code review?

My manager actually blamed the more senior developer who reviewed my code for that one.

hevisko · 2 years ago

Must have been DNS... when they did the deployment run and the necessary code was pulled and the DNS failed and then the wrong code got compiled...</sarcasm>

that they don't even do staged/A-B pushes was also <mind-blown-away>

But the most.... ironical was: https://www.theregister.com/2024/07/18/security_review_failu...

sandworm101 · 2 years ago

So the key test, the test that was not run, was to turn the machine off and on again? Classic windows.

eitland · 2 years ago

Some Canonical guy I think many years ago mentioned this as their sales strategy a few year ago after a particularly nasty Windows outage:

We don't ask customers to switch all systems from Windows to Ubuntu, but to consider moving maybe a third to Ubuntu so they won't sit completely helpless next time Windows fail spectacularly.

While I see more and more Ubuntu systems, and recently have even spotted Landscape in the wild I don't think they were as successful as they hoped with that strategy.

That said, maybe there is a silver lining on todays clouds both WRT Ubuntu and Linux in general, and also WRT IT departments stopping to reconsider some security best practices.

51Cards · 2 years ago

Except further up this thread another poster mentions that CrowdStrike took down their Debian servers back in April as well. As soon as you're injecting third party software into your critical path with self-triggered updates you're vulnerable to the quality (or lack of) that software despite platform.

Honestly your comment highlights one of the few defenses... don't sit all on one platform.

II2II · 2 years ago

Sure, but note the sales pitch was to encourage resiliency through diversity. While that may not be helpful in cases where one vendor may push the same breaking change through to multiple platforms, it also may be helpful. I remember doing some work with a mathematics package under Solaris while in university, while my peers were using the same package under Windows. Both had the same issue, but the behaviour was different. Under Solaris, it was possible to diagnose since the application crashed with useful diagnostic information. Under Windows, it was impossible to diagnose since it took out the operating system and (because of that) it was unable to provide diagnostic information. (It's worth noting that I've seen the opposite happen as well, so this isn't meant to belittle Windows.)

lolinder · 2 years ago

Yep, these episodes are the banana monoculture [0] applied to IT. The solution isn't to use this vendor or avoid that vendor, it's to diversify your systems such that you can have partial operability even if one major component is down.

[0] https://en.m.wikipedia.org/wiki/Gros_Michel_banana

emporas · 2 years ago

> don't sit all on one platform.

Debian has automatic updates but they can be manual as well. That's not the case in Windows.

The best practice for security critical infrastructure in which peoples lives are at stake, is to install some version of BSD stripped down to it's bare minimum. But then the company has to pay for much more expensive admins. Windows admins are much cheaper and plentiful.

Also as a user of Ubuntu and Debian for more than a decade, i have a hunch that this will not happen in India [1].

[1] https://news.itsfoss.com/indian-govt-linux-windows/

noveltyaccount · 2 years ago

Hopefully they won't botch the update for two operating systems at the same time. But yeah. Hope.

potatolicious · 2 years ago

Yeah, I see a lot of noise on social media blaming this on Microsoft/Windows... but AFAIK if you install a bad kernel driver into any major OS the result would be the same.

The specific of this CrowdStrike kernel driver (which AFAIK is intended to intercept and log/deny syscalls depending on threat assessment?) means that this is badnewsbears no matter which platform you're on.

Like sure, if an OS is vulnerable to kernel panics from code in userland, that's on the OS vendor, but this level of danger is intrinsic to kernel drivers!

You can also make rollback easy. Just load the config before the one where you took the bad update.

Of course that means putting the user in control of when they apply updates, but maybe that would be a good thing anyway.

grumpyprole · 2 years ago

Linux and open source also have the potential to be far more modular than Windows is. At the moment we have airport display boards running a full windows stack including anti-virus/spyware/audit etc, just to display a table ... madness

pbhjpbhj · 2 years ago

I'm a Kubuntu user that, seemingly due to Canonical's decision to ship untested software regularly, has been repeatedly hit by problems with snaps. What were initially basic, obvious, and widespread issues with major software.

Yes, distribute your eggs, but check the handles on the baskets being sold to you by the guy pointing out bad handles.

FWIW, while some people like Kubuntu, I have had much better results with KDE Neon.

Stable Ubuntu core under the surface, and everything desktop related delivered by the KDE team.

baggy_trough · 2 years ago

Still haven't forgiven Ubuntu for pushing a bad kernel of their own that caused a boot loop if you used containers...

j33zusjuice · 2 years ago

I’ll never forgive them for the spyware they defaulted to on in their desktop stuff. It wasn’t the worst thing in the world, but they’re also the only major distro to ever do it, so Ubuntu (and Canonical as a whole) can get fucked, imo.

samcat116 · 2 years ago

Sure but if that Canonical sales person was successful in that, I'd almost guarantee that after they switched the first third they'd be in there arguing to switch out the rest.

Absolutely.

I'm just saying what they said their strategy was, not judging their sales people.

mrintegrity · 2 years ago

Many years ago an Ubuntu tech sales guy demoed their (openstack?) Self hosted cloud offering, his laptop was running windows..

totallywrong · 2 years ago

Canonical in particular are no better, they do the exact same thing with that aberration called snap. They have brought entire clusters down before with automatic updates.

aiauthoritydev · 2 years ago

Seems like a reasonable strategy. Not just Ubuntu but some redundancy in some systems.

secondcoming · 2 years ago

Ubuntu has unattended-upgrades enabled by default

sgarland · 2 years ago

Yes, but by default the only repo enabled for it is $(cat /etc/os-release)-security.

make3 · 2 years ago

things are so interdependent that in this scenario you might now just end up crashing the system if either Windows or Ubuntu are down instead of just the one of them you chose

1024core · 2 years ago

Read on Mastodon: https://infosec.exchange/@littlealex/112813425122476301

The CEO of Crowdstrike, George Kurtz, was the CTO of McAfee back in 2010 when it sent out a bad update and caused similar issues worldwide.

If at first you don't succeed, .... ;-) j/k

mvkel · 2 years ago

If anything, this just shows how short-term our memory is. I imagine crowdstrike stock will be back to where it was by the end of next week.

I bet they don't even lose a meaningful amount of customers. Switching costs are too high.

A real shame, and a good reminder that we don't own the things we think we own.

akira2501 · 2 years ago

> this just shows how short-term our memory is.

I've been out of IT proper for a while, so to me, I had to ask "the Russiagate guys are selling AV software now?"

_heimdall · 2 years ago

I don't partake in the stock market these days, but this is the kind of event that you can make good money betting the price will come back up.

When a company makes major headlines for bad news like this investors almost always over react and drive the price too far down.

omeid2 · 2 years ago

With a P/E of over 573? Doubt it will recover that fast.

bloopernova · 2 years ago

Worth $3.7B, paid $148M in 2022.

Edited to add: I wonder what the economic fallout from this will be? 10x his monetary worth? 100x? (not trying to put a price on the people who will die because of the outage; for that he and everyone involved needs to go to jail)

mk89 · 2 years ago

Nothing at all.

He will be the guy that convinced the investors and stakeholders to pour more money into the company despite some world-wide incident.

He deserves at least 3x the pay.

PS: look at the stocks! They sank, and now they are gaining again value. People can't work, people die, flights get delayed/canceled because of their software.

localfirst · 2 years ago

Kurtz response is ridiculous blaming the customer on X. He will probably find another company to hire him as CEO tho. Just an upside down world in the C-suite world.

BaldricksGhost · 2 years ago

Don't forget the golden parachute. These guys always seem to fail upward.

jeffrallen · 2 years ago

That guy is gonna fail all the way right up to the top. Sheesh.

matrix87 · 2 years ago

who is hiring these fucking idiots? they need to be blacklisted

Zanneth · 2 years ago

Crowdstrike is run by humans just like you and me. One mistake doesn’t mean they are completely incompetent.

hbn · 2 years ago

Reminds me of Phil Harrison who always seems to find himself in an of executive position, botching launches of new video game platforms - PlayStation 3, Xbox One, Google Stadia

markus_zhang · 2 years ago

CXOs usually have deep connection and great contracts (golden parachutes, etc.) that make them extremely difficult to fire and amiable to hire :)

duped · 2 years ago

He founded the company

BobbyTables2 · 2 years ago

I didn’t understand why in 2010, it didn’t seem to make most news…

Took out the entire company where I worked.

People thought it was a worm/virus — few minutes after plugging in laptop, McAfee got the DAT update, quarantined the file; which caused Windows to start countdown+reboot (leading to endless BSODs).

pseudopersonal · 2 years ago

Yet another successful loser who somehow continues to ascend corporate ranks despite poor company performance. Just shows how disconnected job performance is from C-suite peer reviews, a glorified popularity contest. Should add the unity and better.com folk here

SkyPuncher · 2 years ago

Eh. To be fair, the higher profile your job is, the more likely you'll be the face of one of these in your career.

OsrsNeedsf2P · 2 years ago

Ok but he faced two

anonymous8888 · 2 years ago

fool me once...

tbatchelli · 2 years ago

This event is predicted in Sydney Dekker’s book “Drift into Failure”, which basically postulates that in order to prevent local failure we setup failure prevention systems that increase the complexity beyond our ability to handle, and introduce systemic failures that are global. It’s a sobering book to read if you ever thought we could make systems fault tolerant.

COGlory · 2 years ago

We need more local expertise is really the only answer. Any organization that just outsources everything is prone to this. Not that organizations that don't outsource aren't prone to other things, but at least their failures will be asynchronous.

bjelkeman-again · 2 years ago

Funny thing is that for decades there were predictions about how there was a need for millions of more IT workers. It was assumed one needed local knowledge in companies. Instead what we got was more and more outsourced systems and centralized services. This today is one of the many downsides.

The problem here would be that there's not enough people who can provide the level of protection a third-party vendor claims to provide, and a person (or persons) with comparable level of expertise would be much more expensive likely. So companies who do their own IT would be routinely outcompeted by ones that outsource, only for the latter to get into trouble when the black swan swoops in. The problem is all other kinds of companies are mostly extinct by then unless their investors had some super-human foresight and discipline to invest for years into something that year after year looks like losing money.

notNNT · 2 years ago

Also a major point in the Black Swan. In the Black Swan, Taleb describes that it is better for banks to fail more often than for them to be protected from any adversity. Eventually they will become "too big to fail". If something is too big to fail, you are fragile to a catastrophic failure.

UniverseHacker · 2 years ago

I was wondering when someone would bring up Taleb RE: this incident.

I know you aren't saying it is, but I think Taleb would argue that this incident, as he did with the coronavirus pandemic for example, isn't even a Black Swan event. It was extremely easy to predict, and you had a large number of experts warning people about it for years but being ignored. A Black Swan is unpredictable and unexpected, not something totally predictable that you decided not to prepare for anyways.

nurbl · 2 years ago

"Antifragile" is even more focused around this.

joe_the_user · 2 years ago

I think it was "predicted" by Sunburst, the Solarwinds hack.

I don't think centrally distributed anti-virus software is the only way to maintain reliability. Instead, I'd say companies to centralize anything like administration since it's cost effective and because they actually aren't concerned about global outage like this.

JM Keynes said "A ‘sound’ banker, alas! is not one who foresees danger and avoids it, but one who, when he is ruined, is ruined in a conventional and orthodox way along with his fellows, so that no one can really blame him." and the same goes for corporate IT.

mym1990 · 2 years ago

Many systems are fault tolerant, and many systems can be made fault tolerant. But once you drift into a level of complexity spawned by many levels of dependencies, it definitely becomes more difficult for system A to understand the threats from system B and so on.

Do you know of any fault tolerant system? Asking because in all the cases I know, when we make a system "fault tolerant" we increase the complexity and we introduce new systemic failure modes related to our fault-tolerant-making-system, making them effectively non fault tolerant.

In all the cases I know, we traded frequent and localized failure for infrequent but globalized catastrophic failures. Like in this case.

jessriedel · 2 years ago

It's also in line with arguments made by Ted Kaczynski (the Unabomber)

> Why must everything collapse? Because, [Kaczynski] says, natural-selection-like competition only works when competing entities have scales of transport and talk that are much less than the scale of the entire system within which they compete. That is, things can work fine when bacteria who each move and talk across only meters compete across an entire planet. The failure of one bacteria doesn’t then threaten the planet. But when competing systems become complex and coupled on global scales, then there are always only a few such systems that matter, and breakdowns often have global scopes.

https://www.overcomingbias.com/p/kaczynskis-collapse-theoryh...

https://en.wikipedia.org/wiki/Anti-Tech_Revolution

crazy how much he was right. if he hadn't gone down the path of violence out of self-loathing and anger he might have lived to see a huge audience and following.

oneepic · 2 years ago

I think a surprising amount of people already share this view, even if they don't go into extensive treatment with references like Dekker presumably does (I haven't read it).

I suspect most people in power just don't subscribe to that. which is precisely why it's systemic to see the engineer shouting "no!" when John CEO says "we're doing it anyway." I'm not sure this is something you can just teach, because the audience definitely has reservations about adopting it.

> we setup failure prevention systems

You can't prevent failure. You can only mitigate the impact. Biology has pretty good answers as to how to achieve this without having to increase complexity as a result, in fact, it often shows that simpler systems increase resilliency.

Something we used to understand until OS vendors became publicly traded companies and "important to national security" somehow.

Just yesterday listened to a lecture by Moshe Vardi which covers adjacent topics:

https://simons.berkeley.edu/events/lessons-texas-covid-19-73...

> if you ever thought we could make systems fault tolerant

The only possible way to fault tolerancy is simplicity and then more simplicity.

Things like crowsdtrike have the opposite approach. Add a lot of fragile complexity attempting to catch problems, but introducing more attack surfaces than they can remove. This will never succeed.

ryukoposting · 2 years ago

As an architect of secure, real-time systems, the hardest lesson I had to learn is there's no such thing as a secure, real-time system in the absolute sense. Don't tell my boss.

ricardo81 · 2 years ago

I haven't read it, but I'd take a leap to presume it's somewhere between the people that say "C is unsafe" and "some other language takes care of all of things".

Basically delegation.