My workplace has a number of people reporting Windows blue-screening and going into a boot loop. The IT Department have a number of servers recently gone offline and have said there's a chance that the two issues are related, and potentially due to a Crowd Strike application update.
My laptop blue-screened and rebooted, but is working fine after the reboot.
A local radio station has also said they've got the same issues with their laptops and their phone system is down as a result.
Not seeing anything on news sites yet. Anyone else seeing similar?
Above is all based in Australia.
https://news.ycombinator.com/item?id=41002195&p=2
https://news.ycombinator.com/item?id=41002195&p=3
https://news.ycombinator.com/item?id=41002195&p=4 (...etc.)
CrowdStrike in this context is a NT kernel loadable module (a .sys file) which does syscall level interception and logs then to a separate process on the machine. It can also STOP syscalls from working if they are trying to connect out to other nodes and accessing files they shouldn't be (using some drunk ass heuristics).
What happened here was they pushed a new kernel driver out to every client without authorization to fix an issue with slowness and latency that was in the previous Falcon sensor product. They have a staging system which is supposed to give clients control over this but they pissed over everyone's staging and rules and just pushed this to production.
This has taken us out and we have 30 people currently doing recovery and DR. Most of our nodes are boot looping with blue screens which in the cloud is not something you can just hit F8 and remove the driver. We have to literally take each node down, attach the disk to a working node, delete the .sys file and bring it up. Either that or bring up a new node entirely from a snapshot.
This is fine but EC2 is rammed with people doing this now so it's taking forever. Storage latency is through the roof.
I fought for months to keep this shit out of production because of this reason. I am now busy but vindicated.
Edit: to all the people moaning about windows, we've had no problems with Windows. This is not a windows issue. This is a third party security vendor shitting in the kernel.
I used to have this crazy idea that fancy cloud vendors had competent management tools. Like maybe I could issue an API call to boot an existing instance from an alternate disk or HTTPS netboot URL. Or to insta-stop a VM and get block-level access to its disk via API, even if I had to pay for the instance while doing this.
And I’m not sure that it’s possible to do this sort of recovery at all without blowing away local SSD. There’s a “preview” feature for this on GCP, which seems to be barely supported, and I bet it adds massive latency to the process. Throwing away one’s local SSD on every single machine in a deployment sounds like a great way to cause potentially catastrophic resource usage when everything starts back up.
Hmm, I wonder if you’re even guaranteed to be able to get your instance back after stopping it.
WTF. Why can’t I have any means to access the boot disk of an instance, in a timely manner? Or any better means to recover an instance?
Is AWS any better?
In OCI we made a decision years ago that after 15 minutes from sending an ACPI shutdown signal, the instance should be hard powered off. We do the same for VM or BM. If you really want to, we take an optional parameter on the shutdown and reboot commands to bypass this and do an immediate hard power off.
So worst case scenario here, 15 minutes to get it shut down and be able to detach the boot volume to attach to another instance.
As for "throwing away local SSD", that only happens on AWS with instance store volumes which used to be called ephemeral volumes as the storage was directly attached to the host you were running on and if you did a stop/start of an ebs-backed instance, you were likely to get sent to a different host (vs. a restart API call, which would make an ACPI soft command and after a duration...I think it was 5 minutes, iirc, the hypervisor would kill the instance and restart it on the same host).
When the instance would get sent to a different host, it would get different instance storage and the old instance storage would be wiped from the previous host and you'd be provisioned new instance storage on the new host.
However, with EBS-volumes, those travel from host to host across stop/start cycles and they're attached via very low latency across the network from EBS servers and presented as a local block device to the instance. It's not quite as fast as local instance store, but it's fast enough for almost every use case if you get enough IOPS provisioned either through direct provisioning + correct instance size OR through a large enough drive + large enough instance to maximjze the connection to EBS (there's a table and stuff detailing IOPs, throughput, and instance size in the docs).
Also, support can detach the volume as well if the instance is stuck shutting down and doesn't get manually shut down by the API after a timeout.
None of this is by any means "ideal", but the complexity of these systems is immense and what they're capable of at the scale they operate is actually pretty impressive.
The key is...lots of the things you talk about are do-able at small scale, but when you add more and more operations and complexity to the tool stack on interacting with systems, you add a lot of back-end network overhead, which leads to extreme congestion, even in very high speed networks (it's an exponential scaling problem).
The "ideal" way to deal with these systems is to do regular interval backups off-host (ie. object/blob storage or NFS/NAS/similar) and then just blow away anything that breaks and do a quick restore to the new, fixed instance.
It's obviously easier said than done and most shops still on some level think about VMs/instances as pets, rather than cattle or have hurdles that make treating them as cattle much more challenging, but manual recovery in the cloud, in general, should just be avoided in favor of spinning up something new and re-deploying to it.
It took a bit to figure out with some customers, but we provide optional VNC access to instances at OCI, and with VNC the trick seems to be to hit esc and then F8, at the right stage in the boot process. Timing seems to be the devil in the details there, though. Getting that timing right is frustrating. People seem to be developing a knack for it though.
Interesting..
> We have to literally take each node down, attach the disk to a working node..
Probably the easiest solution for you is to go back in time to a previous scheduled snapshot, if you have that setup already.
I really want our cages, C7000's and VMware back at this point.
Honest question, I've seen comments in these various threads about people having similar issues (from a few months/weeks back) with kernel extension based deployments of CrowdStrike on Debian/Ubuntu systems.
I haven't seen anything similar regarding Mac OS, which no longer allows kernel extensions.
Is Mac OS not impacted by these kinds of issues with CrowdStrike's product, or have we just not heard about it due to the small scale?
Personally, it's a shared responsibility issue. MS should build a product that is "open to extension but closed for modification".
> they pissed over everyone's staging and rules and just pushed this to production.
I am guessing that act alone is going to create a massive liability for CrowdStrike over this issue. You've made other comments that your organization is actively removing CrowdStrike. I'm curious how this plays out. Did CrowdStrike just SolarWind themselves? Will we see their CISO/CTO/CEO do time? This is just the first part of this saga.
There is no shared responsibility. CrowdStrike pushed a broken driver out, then triggered the breakage, overriding customer requirement and configuration for staging. It is a faulty product with no viable security controls or testing.
"Although Channel Files end with the SYS extension, they are not kernel drivers."
https://www.crowdstrike.com/blog/technical-details-on-todays...
Which is horrible!
I miss my AS/400.
This might be a decent place to recount the experience I had when interviewing for office security architect in 2003. my background is mainframe VM system design and large system risk management modeling which I had been doing since the late 80s at IBM, DEC, then Digital Switch and Bell Canada. My resume was pretty decent at the time. I don't like Python and tell VP/Eng's they have a problem when they can't identify benefits from JIRA/SCRUM, so I don't get a lot of job offers these days. Just a crusty greybeard bitching...
But anyway... so I'm up in Redmond and I have a decent couple of interviews with people and then the 3rd most senior dev in all of MSFT comes in and asks "how's your QA skills?" and I start to answer about how QA and Safety/Security/Risk Management are different things. QA is about insuring the code does what it's supposed to, software security, et al is about making sure the code doesn't do what it's not supposed to and the philosophic sticky wicket you enter when trying to prove a negative (worth a google deep dive if you're unfamiliar.) Dude cuts me off and says "meh. security is stupid. in a month, Bill will end this stupid security stand down and we'll get back to writing code and I need to put you somewhere and I figured QA is the right place."
When I hear that MSFT has systems that expose inadequate risk management abstractions, I think of the culture that promoted that guy to his senior position... I'm sure he was a capable engineer, but the culture in Redmond discounts the business benefits of risk management (to the point they outsource critical system infrastructure to third parties) because senior engineers don't want to be bothered to learn new tricks.
Culture eats strategy for breakfast, and MSFT has been fed on a cultural diet of junk food for almost half a century. At least from the perspective of doing business in the modern world.
Sure, but Windows shares some portion of the blame for allowing third-party security vendors to “shit in the kernel”.
Compare to macOS which has banned third-party kernel extensions on Apple Silicon. Things that once ran as kernel extensions, including CrowdStrike, now run in userspace as “system extensions”.
https://arstechnica.com/information-technology/2006/10/7998/
Dead Comment
I'm a reporter with Bloomberg News covering cybersecurity. I'm trying to learn more about this Crowdstrike update potentially bypassing staging rules and would love to hear about your experience. Would you be open to a coversation?
I'm reachable by email at jbleiberg2@bloomberg.net or on Signal at JakeBleiberg.24. Here's my Bloomberg author page: https://www.bloomberg.com/authors/AWuCZUVX-Pc/jake-bleiberg.
Thank you.
Jake
MS don't have testers any more. Where do you think CS learned their radically effective test-in-prod approach?
They shit all over our controls and went to production.
This says we don't control it and should not trust it. It is being removed.
We were discharged at midnight by the doctor, the nurse didn't come into our exam room to tell us until 4am. I can't imagine the mess this has caused.
That's an extra 4 hours of emergency room fees you ideally wouldn't have to pay for.
It makes my blood boil to be honest that there is no liability for what software has become. It's just not acceptable.
Companies that produce software with the level of access that Crowdstrike has (for all effective purposes a remote root exploit vector) must be liable for the damages that this access can cause.
This would radically change how much attention they pay to quality control. Today they can just YOLO-push barely tested code that bricks large parts of the economy and face no consequences. (Oh, I'm sure there will be some congress testimony and associated circus, but they will not ever pay for the damages they caused today.)
If a person caused the level and quantity of damage Crowdstrike caused today they would be in jail for life. But a company like Crowdstrike will merrily go on doing more damage without paying any consequence.
What about companies that deploy software with the level of quality that Crowdstrike has? Or Microsoft 365 for that matter.
That seems to be the bigger issue here; after all Crowdstrike probably says it is not suitable for any critical systems in their terms of use. You shouldn't be able to just decide to deploy anything not running away fast enough on critical infrastructure.
On the other hand, Crowdstrike Falcon Sensor might be totally suitable for a non-critical systems, say entertainment systems like the Xbox One.
Local emergency services were basically nonfunctioning for better part of the day along with the heat wave and various events, seems like a number of deaths (locally at least, specific to what I know for my mid sized US city) will be indirectly attributable to this.
And even worse, possibly quite a few deaths as well.
I hope (although I will not be holding my breath), that this is the wake-up call we need to realise that we cannot have so much of our critical infrastructure rely on the bloated OS of company known for its buggy, privacy-intruding, crapware riddled software.
I'm old enough to remember the infamous blue-screen-of-death Windows 98 presentation. Bugs exist but that was hardly a glowing endorsement of high-quality software.. This was long ago, yet it is nigh on impossible to believe that the internal company culture has drastically improved since then, with regular high-profile screw-ups reminding us of what is hiding under the thin veneer of corporate of respectability.
Our emergency systems don't need windows, our telephone systems don't need windows, our flight management systems don't need windows, our shop equipment systems don't need windows, our HVAC systems don't need windows, and the list goes on, and on, and on.
Specialized, high-quality OSes with low attack surfaces are what we need to run our systems. Not a generic OS stuffed with legacy code from a time when those applications were not even envisaged.
Keep-it-simple-stupid -KISS-is what we need to go back to, our lives literally depend on it.
With the mutli-billion dollars screw-up that happened yesterday, and an as-of-yet unknown number of deaths, it's impossible to argue that the funds are unavailable to develop such systems. Plurality is what we need, built on top of strong standards for compatibility and interoperability.
Perhaps rather than an indictment on Windows, this is a call to re-evaluate microkernels, at least for critical systems and infrastructure.
What does this mean? Did the power go down? Is all the equipment connected? Or is it the insurance software that can't run do nothing gets done? Maybe you can't access patient files anymore but is that taking down the whole thing?
The fact that she was discharged without an overnight admit suggests to me that the MRI did not show a stroke, or perhaps she was outside the treatment window when she went to the hospital.
This should be the standard for any life sustaining or surgical systems, and any critical weapons systems.
I can't believe they pushed updates to 100% of Windows machines and somehow didn't notice a reboot loop. Epic gross negligence. Are their employees really this incompetent? It's unbelievable.
I wonder where MSFT and Crowdstrike are most vulnerable to lawsuits?
Everything about it reeks of incompetence and gross negligence.
It’s the old story of the user and purchaser being different parties-the software needs to be only good enough to be sold to third parties who never neeed to use it.
It’s a half-baked rootkit part of performative cyberdefence theatrics.
I am LMFAO at the entire situation. Somewhere, George Carlin is smiling.
This is the result of giving away US jobs overseas at 1/10th the salary
“Loive from NPR news in Washington“
Usually when I write this devs get all defensive and ask me what the worst thing is that could happen.. I don't know.. Could you guarantee it doesn't involve people dying?
Dear colleagues, software is great because one persons work multiplies. But it is also a damn fucking huge responsibility to ensure you are not inserting bullshit into the multiplication.
If we can at least get that basis then we can start to define more things such as jobs that non Engineers can not legally do, and legal ramifications for things such as software bugs. If someone will lose their professional license and potentially their career over shipping a large enough bug, suddenly the problem of having 25,000 npm dependences and continuous deployment breaking things at any moment will magically cease to exist quite quickly.
Dead Comment
I hope organisations start revisiting some of these insane decisions.
They ended up giving MS a substantial amount of money to extend support for their use case for some number of years. I can't remember the number he told me but it was extremely large.
Just a few weeks ago I had an OpenBSD box render itself completely unbootable after nothing more than a routine clean shutdown. Turns out their paranoid-idiotic "we re-link the kernel on every boot" coupled with their house-of-cards file system corrupted the kernel, then overwrote the backup copy when I booted from emergency media - which doesn't create device nodes by default so can't even mount the internal disks without more cryptic commands.
Give me the Windows box, please.
Why would Windows systems be anywhere near critical infra ?
Heart attacks and 911 are not things you build with Windows based systems.
We understood this 25 years ago.
This is just a guess, but maybe the client machines are windows. So maybe there are servers connected to phone lines or medical equipment, but the doctors and EMS are looking at the data on windows machines.
maybe Heartbleed or the xzUtils debacles convinced them to switch.
Goodluck teaching administrators an entirely new ecosystem, goodluck finding software off the shelf for Linux.
Bespoke is expensive, expertise is rare, Linux is sadly niche.
Why would computers be anywhere near critical infra? This sounds like something that should failsafe, the control system goes down but the thing keeps running. If power goes down, hospitals have generator backups, it seems weird that computers would not be in the same situation
I mean, if the problem is that hospitals can't function anymore, money is hardly the biggest problem
Dead Comment
Dead Comment
Not questioning that it happened, but this was a boot loop after a content update. So if the computers were off and didn't get the update, and you booted them, they would be fine. And if they were on and you were using them, they wouldn't be rebooting, and it would be fine.
How did it happen that you were rebooting in the middle of treating a heart attack? [Edit: BSOD -> auto reboot]
> And if they were on and you were using them, they wouldn't be rebooting, and it would be fine.
Windows has been notorious for forcing updates down your throat, and rebooting at the least appropriate moments (like during time-sensitive presentations, because that's when you stepped away from the keyboard for 5 minutes to set up the projector). And that's in private setting. Corporate setting, the IT department is likely setting up even more aggressive and less workaround-able reboot schedule.
Things like this is exactly why people hate auto-updates.
I can see very well how one computer could have screwed all others. It's really not hard to imagine.
- lifts wont operate.
- cant disarm the building alarms. (have been blaring nonstop...)
- cranes are all locked in standby/return/err.
- laser aligners are all offline.
- lathe hardware runs but controllers are all down.
- cant email suppliers.
- phones are all down.
- HVAC is also down for some reason (its getting hot in here.)
the police drove by and told us to close up for the day since we dont have 911 either.
alarms for the building are all offline/error so we chained things as best we could (might drive by a few times today.)
we dont know how many orders we have, we dont even know whos on schedule or if we will get paid.
Are they somehow controlled remotely? or do they need to ping a central server to be able to operate?
I can see how alarms, email and phones are affected but the heavy machinery?
(Clearly not familiar with any of these things so I am genuinely curious)
I used to build sorting machines that use variants of the typical “industrial” tech stack, and the actual controllers are rarely (but not never!) Windows. But it’s common for the HMI to be a Windows box connected into the rest of the network, as well as any server.
"Appalled", "bewildered" and "horrified" and also comes to mind..
Holy cow...
Who on earth requires a Windows-based backend (or whatever else had CrowdStrike, in the shop or outside) for regular (VoIP) phone calls.
This should really lead to some learnings for anyone providing any kind of phone infrastructure.
Next move should be some artisanal as mechanical-as-possible quality products, or at least Linux(TM) certified product or similar (or Windows-free (TM)). The opportunity is here, everybody noticed this clusterfuck, and smart folks don't like ignoring threats that are in your face.
But I suppose in 2 weeks some other bombastic news will roll over this and most will forget. But there is always some hope
Outage aside, do you feel safe using it while knowing that it accepts updates based on the whims of far away people that you don't know?
I can't even imagine how much worse ransomware would be if, for example, Windows and browsers weren't updating themselves.
Dead Comment
BitLocker is a storage driver, so that code turned into a circular dependency. The attempt to page in the code resulted a call to that not-yet-paged-in code.
The reason I didn't catch it with local testing was because I never tried rebooting with BitLocker enabled on my dev box when I was working on that code. For everyone on the team that did have BitLocker enabled they got the BSOD when they rebooted. Even then the "blast radius" was only the BitLocker team with about 8 devs, since local changes were qualified at the team level before they were merged up the chain.
The controls in place not only protected Windows more generally, but they even protected the majority of the Windows development group. It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification.
That was my first thought too. Our company does firmware updates to hundreds of thousands of devices every month and those updates always go through 3 rounds of internal testing, then to a couple dozen real world users who we have a close relationship with (and we supply them with spare hardware that is not on the early update path in case there is a problem with an early rollout). Then the update goes to a small subset of users who opt in to those updates, then they get rolled out in batches to the regular users in case we still somehow missed something along the way. Nothing has ever gotten past our two dozen real world users.
Third-party driver/module crashed more than 3 times in a row -> Third-party driver/module is punished and has to be manually re-enabled.
Discussed elsewhere it is claimed that the file causing the crash was a data file that has been corrupted in the delivery process. So the development team and their CI have probably tested a good version, but the customer received a bad one.
If that is true to problem is that the driver first uses an unsigned file at all, so all customer machines are continuously at risk for local attacks. And then it does not do any integrity check on the data it contains, which is a big no no for all untrusted data, whether user space or kernel.
I assume if the signed file was hacked (or parts missing), then it wouldn't pass verification.
To me, this is the inexcusable sin. These updates should be signed and signatures validated before the file is read. Ideally the signing/validating would be handled before distribution so that when this file was corrupted, the validation would have failed here.
But even with a good signature, when a file is read and the values don’t make sense, it should be treated as a bad input. From what I’ve seen, even a magic bytes header here would have helped.
the flawed data was added in a post-processing step of the configuration update, which is after it's been tested internally but before it's copied to their update servers
per a new/green account
At uni I had a professor in database systems, who did not like written exams, but mostly did oral exams. Obviously for DBMSes the page buffer is very relevant, so we chatted about virtual memory and paging. So in my explanation I made the difference for kernel space and user space. I am pretty sure I had read that in a book describing VAX/VMS internals. However, the professor claimed that a kernel never does paging for its own memory. I did not argue on that and passed the exam with the best grade. Did not check that book again to verify my claim. I have never done any kernel space development even vaguely close to memory management, so still today I don't know the exact details.
However, what strikes me here: When that exam happened in 1985ish the NT kernel did not exist yet, I'd believe. However, IIRC a significant part of the DEC VMS kernel team went to Microsoft to work on the NT kernel. So the concept of paging (a part of) kernel memory went with them? Whether VMS --> WNT, every letter increased by one is just a coincidence or intentionally the next baby of those developers I have never understood. As Linux has shown us today much bigger systems can be successfully handled without the extra complications for paging kernel memory. Whether it's a good idea I don't know, at least not a necessary one.
https://www.youtube.com/watch?v=xi1Lq79mLeE
"Perhaps the worst thing about being a systems person is that other, non-systems people think that they understand the daily tragedies that compose your life. For example, a few weeks ago, I was debugging a new network file system that my research group created. The bug was inside a kernel-mode component, so my machines were crashing in spectacular and vindic- tive ways. After a few days of manually rebooting servers, I had transformed into a shambling, broken man, kind of like a computer scientist version of Saddam Hussein when he was pulled from his bunker, all scraggly beard and dead eyes and florid, nonsensical ramblings about semi-imagined enemies. As I paced the hallways, muttering Nixonian rants about my code, one of my colleagues from the HCI group asked me what my problem was. I described the bug, which involved concur- rent threads and corrupted state and asynchronous message delivery across multiple machines, and my coworker said, “Yeah, that sounds bad. Have you checked the log files for errors?” I said, “Indeed, I would do that if I hadn’t broken every component that a logging system needs to log data. I have a network file system, and I have broken the network, and I have broken the file system, and my machines crash when I make eye contact with them. I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS. My only logging option is to hire monks to transcribe the subjective experience of watching my machines die as I weep tears of blood.”
Hello from a fellow BitLocker dev from this time! I think I know who this is, but I'm not sure and don't want to say your name if you want it private. Was one of your Win10 features implementing passphrase support for the OS drive? In any case, feel free to reach out and catch up. My contact info is in my profile.
It was my understanding that MS now sign 3rd party kernel mode code, with quality requirements. In which case why did they fail to prevent this?
The problem here is obviously this other file the driver sucks in. Just because the driver didn't crash for Microsoft in their lab doesn't mean a different file can't crash it...
"Oh, crowdstrike? Yeah, yeah, here's that Winodws kernel code signing key you paid for."
Dead Comment
Deleted Comment
Up the chain to automated test machines, right?
Windows kernel paged, linux non paged?
Many other systems instead default to using full power of virtual memory including swapping kernel space to disk, with only things explicitly need to be kept in ram being allocated from "non-paged" or "wired" memory.
EDIT: fixed spelling thanks to writing on phone.
Shouldn’t that have been caught in code review?
that they don't even do staged/A-B pushes was also <mind-blown-away>
But the most.... ironical was: https://www.theregister.com/2024/07/18/security_review_failu...
Deleted Comment
We don't ask customers to switch all systems from Windows to Ubuntu, but to consider moving maybe a third to Ubuntu so they won't sit completely helpless next time Windows fail spectacularly.
While I see more and more Ubuntu systems, and recently have even spotted Landscape in the wild I don't think they were as successful as they hoped with that strategy.
That said, maybe there is a silver lining on todays clouds both WRT Ubuntu and Linux in general, and also WRT IT departments stopping to reconsider some security best practices.
Honestly your comment highlights one of the few defenses... don't sit all on one platform.
[0] https://en.m.wikipedia.org/wiki/Gros_Michel_banana
Debian has automatic updates but they can be manual as well. That's not the case in Windows.
The best practice for security critical infrastructure in which peoples lives are at stake, is to install some version of BSD stripped down to it's bare minimum. But then the company has to pay for much more expensive admins. Windows admins are much cheaper and plentiful.
Also as a user of Ubuntu and Debian for more than a decade, i have a hunch that this will not happen in India [1].
[1] https://news.itsfoss.com/indian-govt-linux-windows/
The specific of this CrowdStrike kernel driver (which AFAIK is intended to intercept and log/deny syscalls depending on threat assessment?) means that this is badnewsbears no matter which platform you're on.
Like sure, if an OS is vulnerable to kernel panics from code in userland, that's on the OS vendor, but this level of danger is intrinsic to kernel drivers!
Of course that means putting the user in control of when they apply updates, but maybe that would be a good thing anyway.
Yes, distribute your eggs, but check the handles on the baskets being sold to you by the guy pointing out bad handles.
Stable Ubuntu core under the surface, and everything desktop related delivered by the KDE team.
I'm just saying what they said their strategy was, not judging their sales people.
The CEO of Crowdstrike, George Kurtz, was the CTO of McAfee back in 2010 when it sent out a bad update and caused similar issues worldwide.
If at first you don't succeed, .... ;-) j/k
I bet they don't even lose a meaningful amount of customers. Switching costs are too high.
A real shame, and a good reminder that we don't own the things we think we own.
I've been out of IT proper for a while, so to me, I had to ask "the Russiagate guys are selling AV software now?"
When a company makes major headlines for bad news like this investors almost always over react and drive the price too far down.
Edited to add: I wonder what the economic fallout from this will be? 10x his monetary worth? 100x? (not trying to put a price on the people who will die because of the outage; for that he and everyone involved needs to go to jail)
He will be the guy that convinced the investors and stakeholders to pour more money into the company despite some world-wide incident.
He deserves at least 3x the pay.
PS: look at the stocks! They sank, and now they are gaining again value. People can't work, people die, flights get delayed/canceled because of their software.
Deleted Comment
Took out the entire company where I worked.
People thought it was a worm/virus — few minutes after plugging in laptop, McAfee got the DAT update, quarantined the file; which caused Windows to start countdown+reboot (leading to endless BSODs).
I know you aren't saying it is, but I think Taleb would argue that this incident, as he did with the coronavirus pandemic for example, isn't even a Black Swan event. It was extremely easy to predict, and you had a large number of experts warning people about it for years but being ignored. A Black Swan is unpredictable and unexpected, not something totally predictable that you decided not to prepare for anyways.
I don't think centrally distributed anti-virus software is the only way to maintain reliability. Instead, I'd say companies to centralize anything like administration since it's cost effective and because they actually aren't concerned about global outage like this.
JM Keynes said "A ‘sound’ banker, alas! is not one who foresees danger and avoids it, but one who, when he is ruined, is ruined in a conventional and orthodox way along with his fellows, so that no one can really blame him." and the same goes for corporate IT.
In all the cases I know, we traded frequent and localized failure for infrequent but globalized catastrophic failures. Like in this case.
> Why must everything collapse? Because, [Kaczynski] says, natural-selection-like competition only works when competing entities have scales of transport and talk that are much less than the scale of the entire system within which they compete. That is, things can work fine when bacteria who each move and talk across only meters compete across an entire planet. The failure of one bacteria doesn’t then threaten the planet. But when competing systems become complex and coupled on global scales, then there are always only a few such systems that matter, and breakdowns often have global scopes.
https://www.overcomingbias.com/p/kaczynskis-collapse-theoryh...
https://en.wikipedia.org/wiki/Anti-Tech_Revolution
I suspect most people in power just don't subscribe to that. which is precisely why it's systemic to see the engineer shouting "no!" when John CEO says "we're doing it anyway." I'm not sure this is something you can just teach, because the audience definitely has reservations about adopting it.
You can't prevent failure. You can only mitigate the impact. Biology has pretty good answers as to how to achieve this without having to increase complexity as a result, in fact, it often shows that simpler systems increase resilliency.
Something we used to understand until OS vendors became publicly traded companies and "important to national security" somehow.
https://simons.berkeley.edu/events/lessons-texas-covid-19-73...
The only possible way to fault tolerancy is simplicity and then more simplicity.
Things like crowsdtrike have the opposite approach. Add a lot of fragile complexity attempting to catch problems, but introducing more attack surfaces than they can remove. This will never succeed.
Deleted Comment
Basically delegation.