This should not have passed a competent C/I pipeline for a system in the critical path.
I’m not even particularly stringent when it comes to automated test across-the-board but for this level of criticality of system, you need exceptionally good state management
To the point where you should not roll to production without an integration test on every environment that you claim to support
Like it’s insane to me that this size and criticality of a company doesn’t have a staging or even a development test server that tests all of the possible target images that they claim to support.
Who is running stuff over there - total incompetence
A lot of assumptions here that probably aren't worth making without more info -- For example it could certainly be the case that there was a "real" file that worked and the bug was in the "upload verified artifact to CDN code" or something, at which point it passes a lot of things before the failure.
We don't have the answers, but I'm not in a rush to assume that they don't test anything they put out at all on Windows.
I haven't seen the file, but surely each build artifact should be signed and verified when it's loaded by the client. The failure mode of bit rot / malice in the CDN should be handled.
> Like it’s insane to me that this size and criticality of a company doesn’t have a staging or even a development test server that tests all of the possible target images that they claim to support.
Who is saying they don't have that? Who is saying it didn't pass all of that?
Yeah... the comment above reads like someone who has read a lot of books on CI deployment, but has zero experience in a real world environment actually doing it. Quick to throw stones with absolutely no understanding of any of the nuances involved.
You sound like the guy that a few years ago tried to argue (the company in question) tested os code that didn't include any drivers for their gear's local storage. Its obvious it wasn't to anyone competent.
The strange thing is that when I interviewed there years ago with the team that owns the language that runs in the kernel, they said their ci has 20k or 40k machine os combinations/configurations. Surely some of them were vanilla windows!
/* Acceptance criteria #1: do not allow machine to boot if invalid data signatures are present, this could indicate a compromised system. Booting
could cause presidents diary to transmit to rival 'Country' of the week */
if(dataFileIsNotValid) {
throw FatalKernelException("All your base are compromised");
}
EDIT+ Explanation:
With hindsight not booting may be exactly the right thing to do since a bad datafile would indicate a compromised distribution/ network.
The machines should not fully boot until file with valid signature is downloaded.*
It seems unlikely that a file entirely full of null characters was the output of any automated build pipeline. So I’d wager something got built, passed the CI tests, then the system broke at some point after that when the file was copied ready for deployment.
But at this stage, all we are doing is speculating.
Those signature files should have a checksum, or even a digital signature. I mean even if it doesn't crash the entire computer, a flipped bit in there could still turn the entire thing against a harmless component of the system and lead to the same result.
Yup. I had quite a battle with some sort of system bug (never fully traced) where I wrote valid data but what ended up on disk was all zero. It appeared to involve corrupted packets being accepted as valid.
It doesn't matter how much you test if something down the line zeroes out your stuff.
If a garbage file is pushed out, the program could have handled it by ignoring it. In this case, it did not and now we're (the collective IT industry) dealing with the consequences of one company that can't be bothered to validate its input (they aren't the only ones, but this is a particularly catastrophic demonstration of the importance of input validation).
What sort of sane system modifies the build output after testing?
Our release process is more like: build and package, sign package, run CI tests on signed package, run manual tests on signed package, release signed package. The deployment process should check those signatures. A test process should by design be able to detect any copy errors between test and release in a safe way.
Keep in mind that this was probably a data file, not necessarily a code file.
It's possible that they run tests on new commits, but not when some other, external, non-git system pushes out new data.
Team A thinks that "obviously the driver developers are going to write it defensively and protect it against malformed data", team B thinks "obviously all this data comes from us, so we never have to worry about it being malformed"
I don't have any non-public info about what actually happened, but something along these lines seems to be the most likely hypothesis to me.
Edit: Now what would have helped here is a "staged rollout" process with some telemetry. Push the update to 0.01% of your users and solicit acknowledgments after 15 minutes. If the vast majority of systems are still alive and haven't been restarted, keep increasing the threshold. If, at any point, too many of the updated systems stop responding or indicate a failure, immediately stop the rollout, page your on-call engineers and give them a one-click process to completely roll the update back, even for already-updated clients.
This is exactly the kind of issue that non-invasive, completely anonymous, opt-out telemetry would have solved.
> tests all of the possible target images that they claim to support.
Or even at the very least the most popular OS that they support. I'm genuinely imagining right now that for this component, the entirety of the company does not have a single Windows machine they run tests on.
It's wild that I'm out here boosting existing unit testing practices with mutation testing https://github.com/codeintegrity-ai/mutahunter and there are folks out there that don't even do the basic testing.
Without delving into any kind of specific conspiratorial thinking, I think people should also include the possibility that this was malicious. It's much more likely to be incompetence and hubris, but ever since I found out that this is basically an authorized rootkit, I've been concerned about what happens if another Solarwinds incident occurs with Crowdstrike or another such tool. And either way, we have the answer to that question now: it has extreme consequences.
We really need to end this blind checkbox compliance culture and start doing real security.
I don't know if people on Microsoft ecosystems even know what CI pipelines are.
Linux and Unix ecosystems in general work by people thoroughly testing and taking responsibility for their work.
Windows ecosystems work by blame passing. Blame Ron, the IT guy. Blame Windows Update. Blame Microsoft. That's how stuff works.
It has always worked this way.
But also, all the good devs got offered 3X the salary at Google, Meta, and Apple. Have you ever applied for a job at CrowdStrike? No? That's why they suck.
* A disproportionately large number of Windows IT guys are named Ron, in my experience.
On a related note, I don't think that it's a coincidence that 2 of the largest tech meltdowns in history (this one and the SolarWinds hack from a few years ago) were both the result of "security software" run amok. (Also sad that both of these companies are based in Austin, which certainly gives Austin's tech scene a black eye).
IMO, I think a root cause issue is that the "hacker types" who are most likely to want to start security software companies are also the least likely to want to implement the "boring" pieces of a process-oriented culture. For example, I can't speak so much for CrowdStrike, but it came out the SolarWinds had an egregiously bad security culture at their company. When the root cause comes out about this issue dollars-to-donuts it was just a fast and loose deployment process.
> the "hacker types" who are most likely to want to start security software companies are also the least likely to want to implement the "boring" pieces of a process-oriented culture
I disagree, security companies suffer from "too big to fail" syndrome where the money comes easy because they have customers who want to check a box. Security is NOT a product you pay for, it's a culture that takes active effort and hard work to embed from day one. There's no product on the market that can provide security, only products to point a finger at when things go wrong.
Crowdstrike was originally founded and headquartered in Irvine, CA (Southern California). In those days, most of its engineering organization was either remote/WFH or in Irvine, CA.
As they got larger, they added a Sunnyvale office, and later moved the official headquarters to Austin, TX.
They've also been expanding their engineering operations overseas which likely includes offshoring in the last few years.
Alternate hypothesis that's not necessarily mutually exclusive: security software tends to need significant and widespread access. That means that fuckups and compromises tend to be more impactful.
100% agree with that. The thing that baffles me a bit, then, is that if you are writing software that is so critical and can have such a catastrophic impact when things go wrong, that you double and triple check everything you do - what you DON'T do is use the same level of care you may use with some social media CRUD app (move fast and break things and all that...)
To emphasize, I'm really just thinking about the bad practices that were reported after the SolarWinds hack (password of "solarwinds123" and a bunch of other insider reports), so I can't say that totally applies to CrowdStrike, but in general I don't feel like these companies that can have such a catastrophic impact take appropriate care of their responsibilities.
Security software needs kernel level access.. if something breaks, you get boot loops and crashes.
Most other software doesn't need that low level of access, and even if it crashes, it doesn't take the whole system with it, and a quick, automated upgrade process is possible.
The Austin tech culture is…interesting. I stopped trying to find a job here and went remote Bay Area, and talking to tech workers in the area gave me the impression it’s a mix of slacker culture and hype chasing. After moving back here, tech talent seems like a game of telephone, and we’re several jumps past the original.
When I heard CrowdStrike was here, it just kinda made sense
On the plus side of this disaster, I am holding out some pico-sized hope that maybe organizations will rethink kernel level access. No, random gaming company, you are not good enough to write kernel level anti cheat software.
I can't imagine gaming software being affected at all, unless MS does a ton of cracking down (and would still probably give hooks for gaming since they have gaming companies in their umbrella).
No corporate org is gonna bat an eye at Riot's anti-cheat practices, because they aren't installing LoL on their line of business machines anyway.
Right, MS just paid $75e9 for a company whose main products are competitive multiplayer games. They are never going to be incentivized to compromise that sector by limiting what anti-cheat they can do.
this seems like the second or third test file any qa person would have tried, after an empty file and maybe a minimal valid file. the level of pervasive incompetence implied here is staggering
in a market where companies compete by impressing nontechnical upper management with presentations, it should be no surprise that technically competent companies have no advantage over incompetent ones
i recently read through the craig wright decision https://www.judiciary.uk/judgments/copa-v-wright/ (the guy who fraudulently claimed to be satoshi nakamoto) and he lacked even the most basic technical competence in the fields where he was supposedly a world-class specialist (decompiling malware to c); he didn't know what 'unsigned' meant when questioned on the witness stand. he'd been doing infosec work for big companies going back to the 90s. he'd apparently been tricking people with technobabble and rigged demos and forged documents for his entire career
pci-dss is not a government agency but it might as well be; it's a collusion between visa, mastercard, american express, discover, and jcb to prevent them from having different data security standards (and therefore being able to compete on security)
It's definitely ironic, and compatible with the security engineering world joke that the most secure system is one that cannot be accessed or used at all.
I suppose one way to "stop breaches" is to shut down every host entirely.
In the military world, there is a concept of an "Alpha Strike" which generally relates to a fast-enough and strong-enough first-strike that is sufficient to disable the adversary's ability to respond or fight back (e.g. taking down an entire fleet at once). Perhaps people that have been burned by this event will start calling it a Crowdstrike.
What’s frustrating is that after falsely attributing pci to government regulation and being corrected, you’re probably not going to re-examine the thought process that led you to that false belief.
It seems government IT systems in general faired pretty well the last 12 hrs, but loads of large private companies were effectively taken offline, so there's that.
You misspelled “private sector”. Use of endpoint monitoring software is coming out of private auditing companies driven by things like PCI or insurers’ requirements – almost nobody wants to pay for highly-skilled security people so they’re outsourcing it to the big auditing companies and checklists so that if they get sued they can say they were following industry practices and the audit firms okayed it.
I find it amusing that the people commenting on that link are offended this called a "Microsoft " outage, when it is "Crowdstrike's fault".
This is just as much a Microsoft failure.
This is even more, another industry failure
How many times does this have to happen before we get some industry reform that lets us do our jobs and build the secure reliable systems we have spent seven decades researching?
It's simple: the failure is not specific to the OS.
Crowdstrike runs on MacOS and Linux workstations too. And it's just as dangerous there; the big thread has stories of Crowdstrike breaking Linux systems in the past months.
Crowdstrike isn't needed by/for Windows, it's mandated by corporate and government bureaucracies, where it serves as a tool of employee control and a compliance checkbox to check.
That's why it makes no sense to blame Microsoft. If the world run on Linux, ceteris paribus, Crowdstrike would be on those machines too, and would fuck them up just as bad globally.
the reforms you're hoping for are not going to happen until the countries with bad infosec get conquered by the countries with good infosec. that is going to be much worse than you can possibly imagine
it's not microsoft's fault at all; crowdstrike caused the same problem on debian systems three months ago. the only way it could be microsoft's fault is if letting users install kernel drivers is microsoft's fault
> he lacked even the most basic technical competence in the fields where he was supposedly a world-class specialist
Pretty ironic Craig also suggests to put digital certificates on the blockchain instead of them issued by a computer or a cluster of computers, in one place, which makes them a target. A peer to peer network is much more resilient to attacks.
Was the problem, or a part of the problem, that the root certificate of all computers was replaced by CrowdStrike? I didn't follow that part closely. If it was, then certificates registered on the blockchain might be a solution.
I'm totally not an expert in cryptography, but that seems plausible.
As a side note, it is only when i heard Craig's Wright explanation of the blockchain that the bitcoin whitepaper started making a lot of sense. He may not be the greatest of coders, but are mathematicians useless in security?
I don't mean to sound conspiratorial, but it's a little early to rule out malfeasance just because of Hanlon's Razor just yet. Most fuckups are not on a ridonkulous global scale. This is looking like the biggest one to date, the Y2K that wasn't.
We as a society need to start punishing incompetence the same way we punish malice.
Of course, we also need to first start punishing individuals for intentionally causing harm through their decisions even if the harm was caused indirectly through other people. Power allows people to distance themselves from the act. Distance should not affect the punishment.
> Of course, we also need to first start punishing individuals for intentionally causing harm through their decisions even if the harm was caused indirectly through other people.
Legal teams are way ahead of you here, by design. They’ve learned to factor out responsibility in such a diffuse manner the “indirectly” term loses nearly all its power the closer one gets to executive ranks and is a fig leaf when you reach C-levels and BoD levels. We couldn’t feasibly operate behemoth corporations otherwise in litigious environments.
I personally suspect we cannot enshrine earnestness and character as metrics and KPI’s, even indirectly. From personal experience selling into many client organizations and being able to watch them make some sausage, this is where leadership over managing becomes a key distinction, and organizational culture over a long period measured in generations of decisions and actions implemented by a group with mean workforce tenure measured in decades becomes one of the few reliable signals. Would love to hear others’ experiences and observations.
CrowdStrike does this trick where it replaces the file (being transferred over a network socket) with zeroes if it matches the malware signature. Assuming that these are the malware signature files themselves, a match wouldn't be surprising.
This actually makes the most sense, and would help explain how the error didn't occur during testing (in good faith, I assume it was tested).
In testing, the dev may have worked from their primary to deploy the update to a series of secondary drives, then sequential performed a test boot from each secondary drive configured for each supported OS version. A shortcut/quick way to test that would've bypassed how their product updates in customer environments, also bypassing checks their software may have performed (in this case, overwriting their own file's contents).
Yes, this fits in with what I heard on the grapevine about this bug from a friend who knows someone working for Crowdstrike. The bug had been sitting there in the kernel driver for years before being triggered by this flawed data, which actually was added in a post-processing step of the configuration update - after it had been tested but before being copied to their update servers for clients to obtain.
Apparently, Crowdstrike's test setup was fine for this configuration data itself, but they didn't catch it before it was sent out in production, as they were testing the wrong thing. Hopefully they own up to this, and explain what they're going to do to prevent another global-impact process failure, in whatever post-mortem writeup they may release.
"I heard on the grapevine from a friend who knows someone working for Crowdstrike" is perhaps not the most reliable source of information, due to the game of telephone effect if nothing else.
And post-processing can mean many things. Could be something relatively simple such as "testing passed, so lets mark the file with a version number and release it".
At last an explanation that makes a bit of sense to me.
>Hopefully they own up to this, and explain what they're going to do to prevent another global-impact process failure
They probably needn't bother, every competent sysadmin from Greenland to New Zealand is probably disabling the autoupdate feature right now, firewalling it off and hatching a plan to get the product off their server estate ASAP.
Marketing budgets for competing product are going to get a bump this quarter probably.
I think Crowdstrike is due for more than just "owning up to it". I sincerely hope for serious investigation, fines and legal pursuits despite having "limited liabilities" in their agreements.
Seriously, I don't even know how to do the math on the amount of damage this caused (not including the TIME wasted for businesses as well as ordinary people, for instance those taking flights)
There has to be consequences for this kind of negligence
I thought windows required all kernel modules to be signed..? If there are multiple corrupt copies, rather than just some test escape, how could they have passed the signature verification and been loaded by the kernel?
It's possible that these aren't the original file contents, but rather the result of a manual attempt to stop the bleeding.
Someone may have hoped that overwriting the bad file with an all-0 file of the correct size would make the update benign.
Or following the "QA was bypassed because there was a critical vulnerability" hypothesis, stopping distribution of the real patch may be an attempt to reduce access to the real data and slow reverse-engineering of the vulnerability.
I’m not even particularly stringent when it comes to automated test across-the-board but for this level of criticality of system, you need exceptionally good state management
To the point where you should not roll to production without an integration test on every environment that you claim to support
Like it’s insane to me that this size and criticality of a company doesn’t have a staging or even a development test server that tests all of the possible target images that they claim to support.
Who is running stuff over there - total incompetence
We don't have the answers, but I'm not in a rush to assume that they don't test anything they put out at all on Windows.
I.e. only one link in the chain wasn't tested.
Sorry, but that will not do.
> We don't have the answers, but I'm not in a rush to assume that they don't test anything they put out at all on Windows.
The parent post did not suggest they don't test anything. It suggested they did not test the whole chain.
Who is saying they don't have that? Who is saying it didn't pass all of that?
You're making tons of assumptions here.
I'm not sure: is having test servers that it passed any better than none at all?
You sound like the guy that a few years ago tried to argue (the company in question) tested os code that didn't include any drivers for their gear's local storage. Its obvious it wasn't to anyone competent.
if(dataFileIsNotValid) { throw FatalKernelException("All your base are compromised"); }
EDIT+ Explanation:
With hindsight not booting may be exactly the right thing to do since a bad datafile would indicate a compromised distribution/ network.
The machines should not fully boot until file with valid signature is downloaded.*
But at this stage, all we are doing is speculating.
It doesn't matter how much you test if something down the line zeroes out your stuff.
Our release process is more like: build and package, sign package, run CI tests on signed package, run manual tests on signed package, release signed package. The deployment process should check those signatures. A test process should by design be able to detect any copy errors between test and release in a safe way.
It's possible that they run tests on new commits, but not when some other, external, non-git system pushes out new data.
Team A thinks that "obviously the driver developers are going to write it defensively and protect it against malformed data", team B thinks "obviously all this data comes from us, so we never have to worry about it being malformed"
I don't have any non-public info about what actually happened, but something along these lines seems to be the most likely hypothesis to me.
Edit: Now what would have helped here is a "staged rollout" process with some telemetry. Push the update to 0.01% of your users and solicit acknowledgments after 15 minutes. If the vast majority of systems are still alive and haven't been restarted, keep increasing the threshold. If, at any point, too many of the updated systems stop responding or indicate a failure, immediately stop the rollout, page your on-call engineers and give them a one-click process to completely roll the update back, even for already-updated clients.
This is exactly the kind of issue that non-invasive, completely anonymous, opt-out telemetry would have solved.
Or even at the very least the most popular OS that they support. I'm genuinely imagining right now that for this component, the entirety of the company does not have a single Windows machine they run tests on.
Linux and Unix ecosystems in general work by people thoroughly testing and taking responsibility for their work.
Windows ecosystems work by blame passing. Blame Ron, the IT guy. Blame Windows Update. Blame Microsoft. That's how stuff works.
It has always worked this way.
But also, all the good devs got offered 3X the salary at Google, Meta, and Apple. Have you ever applied for a job at CrowdStrike? No? That's why they suck.
* A disproportionately large number of Windows IT guys are named Ron, in my experience.
IMO, I think a root cause issue is that the "hacker types" who are most likely to want to start security software companies are also the least likely to want to implement the "boring" pieces of a process-oriented culture. For example, I can't speak so much for CrowdStrike, but it came out the SolarWinds had an egregiously bad security culture at their company. When the root cause comes out about this issue dollars-to-donuts it was just a fast and loose deployment process.
I disagree, security companies suffer from "too big to fail" syndrome where the money comes easy because they have customers who want to check a box. Security is NOT a product you pay for, it's a culture that takes active effort and hard work to embed from day one. There's no product on the market that can provide security, only products to point a finger at when things go wrong.
As they got larger, they added a Sunnyvale office, and later moved the official headquarters to Austin, TX.
They've also been expanding their engineering operations overseas which likely includes offshoring in the last few years.
To emphasize, I'm really just thinking about the bad practices that were reported after the SolarWinds hack (password of "solarwinds123" and a bunch of other insider reports), so I can't say that totally applies to CrowdStrike, but in general I don't feel like these companies that can have such a catastrophic impact take appropriate care of their responsibilities.
- CrowdStrike
- SolarWinds
- Kaspersky
- https://en.wikipedia.org/wiki/John_McAfee
I'm sure I'm forgetting a few. Maybe it's the same self-selection scheme that afflicts the legal cannabis industry?
Most other software doesn't need that low level of access, and even if it crashes, it doesn't take the whole system with it, and a quick, automated upgrade process is possible.
But I see it as the XKCD 2347 problem.
When I heard CrowdStrike was here, it just kinda made sense
No corporate org is gonna bat an eye at Riot's anti-cheat practices, because they aren't installing LoL on their line of business machines anyway.
But if their business is incompatible with strict software whitelisting, their employees might...
in a market where companies compete by impressing nontechnical upper management with presentations, it should be no surprise that technically competent companies have no advantage over incompetent ones
i recently read through the craig wright decision https://www.judiciary.uk/judgments/copa-v-wright/ (the guy who fraudulently claimed to be satoshi nakamoto) and he lacked even the most basic technical competence in the fields where he was supposedly a world-class specialist (decompiling malware to c); he didn't know what 'unsigned' meant when questioned on the witness stand. he'd been doing infosec work for big companies going back to the 90s. he'd apparently been tricking people with technobabble and rigged demos and forged documents for his entire career
george kurtz, ceo and founder of crowdstrike, was the cto of mcafee when they did the exact same thing 14 years ago: https://old.reddit.com/r/sysadmin/comments/1e78l0g/can_crowd... https://en.wikipedia.org/wiki/George_Kurtz
crowdstrike itself caused the same problem on debian stable three months ago: https://old.reddit.com/r/debian/comments/1c8db7l/linuximage6...
it's horrifying that pci compliance regulations have injected crowdstrike (and antivirus) into virtually every aspect of today's it infrastructure
But that's government regulation for you.
I suppose one way to "stop breaches" is to shut down every host entirely.
In the military world, there is a concept of an "Alpha Strike" which generally relates to a fast-enough and strong-enough first-strike that is sufficient to disable the adversary's ability to respond or fight back (e.g. taking down an entire fleet at once). Perhaps people that have been burned by this event will start calling it a Crowdstrike.
You misspelled “private sector”. Use of endpoint monitoring software is coming out of private auditing companies driven by things like PCI or insurers’ requirements – almost nobody wants to pay for highly-skilled security people so they’re outsourcing it to the big auditing companies and checklists so that if they get sued they can say they were following industry practices and the audit firms okayed it.
Just the same week Kaspersky gets kicked from the us market...
I find it amusing that the people commenting on that link are offended this called a "Microsoft " outage, when it is "Crowdstrike's fault".
This is just as much a Microsoft failure.
This is even more, another industry failure
How many times does this have to happen before we get some industry reform that lets us do our jobs and build the secure reliable systems we have spent seven decades researching?
1988 all over again again again
Crowdstrike runs on MacOS and Linux workstations too. And it's just as dangerous there; the big thread has stories of Crowdstrike breaking Linux systems in the past months.
Crowdstrike isn't needed by/for Windows, it's mandated by corporate and government bureaucracies, where it serves as a tool of employee control and a compliance checkbox to check.
That's why it makes no sense to blame Microsoft. If the world run on Linux, ceteris paribus, Crowdstrike would be on those machines too, and would fuck them up just as bad globally.
it's not microsoft's fault at all; crowdstrike caused the same problem on debian systems three months ago. the only way it could be microsoft's fault is if letting users install kernel drivers is microsoft's fault
Pretty ironic Craig also suggests to put digital certificates on the blockchain instead of them issued by a computer or a cluster of computers, in one place, which makes them a target. A peer to peer network is much more resilient to attacks.
Was the problem, or a part of the problem, that the root certificate of all computers was replaced by CrowdStrike? I didn't follow that part closely. If it was, then certificates registered on the blockchain might be a solution.
I'm totally not an expert in cryptography, but that seems plausible.
As a side note, it is only when i heard Craig's Wright explanation of the blockchain that the bitcoin whitepaper started making a lot of sense. He may not be the greatest of coders, but are mathematicians useless in security?
Of course, we also need to first start punishing individuals for intentionally causing harm through their decisions even if the harm was caused indirectly through other people. Power allows people to distance themselves from the act. Distance should not affect the punishment.
Legal teams are way ahead of you here, by design. They’ve learned to factor out responsibility in such a diffuse manner the “indirectly” term loses nearly all its power the closer one gets to executive ranks and is a fig leaf when you reach C-levels and BoD levels. We couldn’t feasibly operate behemoth corporations otherwise in litigious environments.
I personally suspect we cannot enshrine earnestness and character as metrics and KPI’s, even indirectly. From personal experience selling into many client organizations and being able to watch them make some sausage, this is where leadership over managing becomes a key distinction, and organizational culture over a long period measured in generations of decisions and actions implemented by a group with mean workforce tenure measured in decades becomes one of the few reliable signals. Would love to hear others’ experiences and observations.
Deleted Comment
> intentionally causing harm
Maybe we need to criminalize groundless accusations too.
Yes
But competence is marketed
The trade names like "Crowdstrike" and "Microsoft "
Finding a file full of zeroes on a broken computer does not mean it was shipped as all zeroes!
https://x.com/craiu/status/1814339965347610863
https://x.com/cyb3rops/status/1814329155833516492
In testing, the dev may have worked from their primary to deploy the update to a series of secondary drives, then sequential performed a test boot from each secondary drive configured for each supported OS version. A shortcut/quick way to test that would've bypassed how their product updates in customer environments, also bypassing checks their software may have performed (in this case, overwriting their own file's contents).
https://www.crowdstrike.com/blog/tech-analysis-channel-file-...
Apparently, Crowdstrike's test setup was fine for this configuration data itself, but they didn't catch it before it was sent out in production, as they were testing the wrong thing. Hopefully they own up to this, and explain what they're going to do to prevent another global-impact process failure, in whatever post-mortem writeup they may release.
And post-processing can mean many things. Could be something relatively simple such as "testing passed, so lets mark the file with a version number and release it".
Why is this stupid? It’s pretty useful to see a graph of coolant temp vs ambient temp vs motor speed vs roll/pitch.
I must be especially stupid I suppose. Nuts.
>Hopefully they own up to this, and explain what they're going to do to prevent another global-impact process failure
They probably needn't bother, every competent sysadmin from Greenland to New Zealand is probably disabling the autoupdate feature right now, firewalling it off and hatching a plan to get the product off their server estate ASAP.
Marketing budgets for competing product are going to get a bump this quarter probably.
Seriously, I don't even know how to do the math on the amount of damage this caused (not including the TIME wasted for businesses as well as ordinary people, for instance those taking flights)
There has to be consequences for this kind of negligence
https://cyberplace.social/@GossiTheDog/112812454405913406
(scroll down a little)
Most likely is not loaded as a driver binary, but instead is some data file used by the CrowdStrike driver.
Someone may have hoped that overwriting the bad file with an all-0 file of the correct size would make the update benign.
Or following the "QA was bypassed because there was a critical vulnerability" hypothesis, stopping distribution of the real patch may be an attempt to reduce access to the real data and slow reverse-engineering of the vulnerability.