The CrowdStrike file that broke everything was full of null characters?

This should not have passed a competent C/I pipeline for a system in the critical path.

I’m not even particularly stringent when it comes to automated test across-the-board but for this level of criticality of system, you need exceptionally good state management

To the point where you should not roll to production without an integration test on every environment that you claim to support

Like it’s insane to me that this size and criticality of a company doesn’t have a staging or even a development test server that tests all of the possible target images that they claim to support.

Who is running stuff over there - total incompetence

martinky24 · 2 years ago

A lot of assumptions here that probably aren't worth making without more info -- For example it could certainly be the case that there was a "real" file that worked and the bug was in the "upload verified artifact to CDN code" or something, at which point it passes a lot of things before the failure.

We don't have the answers, but I'm not in a rush to assume that they don't test anything they put out at all on Windows.

EvanAnderson · 2 years ago

I haven't seen the file, but surely each build artifact should be signed and verified when it's loaded by the client. The failure mode of bit rot / malice in the CDN should be handled.

chrisjj · 2 years ago

> it could certainly be the case that there was a "real" file that worked and the bug was in the "upload verified artifact to CDN code" or something

I.e. only one link in the chain wasn't tested.

Sorry, but that will not do.

> We don't have the answers, but I'm not in a rush to assume that they don't test anything they put out at all on Windows.

The parent post did not suggest they don't test anything. It suggested they did not test the whole chain.

arp242 · 2 years ago

> Like it’s insane to me that this size and criticality of a company doesn’t have a staging or even a development test server that tests all of the possible target images that they claim to support.

Who is saying they don't have that? Who is saying it didn't pass all of that?

You're making tons of assumptions here.

JKCalhoun · 2 years ago

To be sure. But the fact is the release broke.

I'm not sure: is having test servers that it passed any better than none at all?

martinky24 · 2 years ago

Yeah... the comment above reads like someone who has read a lot of books on CI deployment, but has zero experience in a real world environment actually doing it. Quick to throw stones with absolutely no understanding of any of the nuances involved.

ikiris · 2 years ago

Dude, the fact that it breaks directly.

You sound like the guy that a few years ago tried to argue (the company in question) tested os code that didn't include any drivers for their gear's local storage. Its obvious it wasn't to anyone competent.

carterschonwald · 2 years ago

The strange thing is that when I interviewed there years ago with the team that owns the language that runs in the kernel, they said their ci has 20k or 40k machine os combinations/configurations. Surely some of them were vanilla windows!

dboreham · 2 years ago

They used synthetic test data in CI that doesn't consist of zeros.

dagaci · 2 years ago

/* Acceptance criteria #1: do not allow machine to boot if invalid data signatures are present, this could indicate a compromised system. Booting could cause presidents diary to transmit to rival 'Country' of the week */

if(dataFileIsNotValid) { throw FatalKernelException("All your base are compromised"); }

EDIT+ Explanation:

With hindsight not booting may be exactly the right thing to do since a bad datafile would indicate a compromised distribution/ network.

The machines should not fully boot until file with valid signature is downloaded.*

hnlmorg · 2 years ago

It seems unlikely that a file entirely full of null characters was the output of any automated build pipeline. So I’d wager something got built, passed the CI tests, then the system broke at some point after that when the file was copied ready for deployment.

But at this stage, all we are doing is speculating.

russdill · 2 years ago

You can have all the CI, staging, test, etc. If some bug after that process nulls the file, the rest doesn't matter

fabian2k · 2 years ago

Those signature files should have a checksum, or even a digital signature. I mean even if it doesn't crash the entire computer, a flipped bit in there could still turn the entire thing against a harmless component of the system and lead to the same result.

LorenPechtel · 2 years ago

Yup. I had quite a battle with some sort of system bug (never fully traced) where I wrote valid data but what ended up on disk was all zero. It appeared to involve corrupted packets being accepted as valid.

It doesn't matter how much you test if something down the line zeroes out your stuff.

Jtsummers · 2 years ago

If a garbage file is pushed out, the program could have handled it by ignoring it. In this case, it did not and now we're (the collective IT industry) dealing with the consequences of one company that can't be bothered to validate its input (they aren't the only ones, but this is a particularly catastrophic demonstration of the importance of input validation).

Cerium · 2 years ago

What sort of sane system modifies the build output after testing?

Our release process is more like: build and package, sign package, run CI tests on signed package, run manual tests on signed package, release signed package. The deployment process should check those signatures. A test process should by design be able to detect any copy errors between test and release in a safe way.

jononor · 2 years ago

The issue is not that a file with nulls was produced. It is that an invalid file (or any kind) can trigger a blue screen of death.

0xcafecafe · 2 years ago

They could even have done slow rollouts. Roll it out to a geographical region and wait an hour or so before deploying elsewhere.

saati · 2 years ago

In theory CrowdStrike protects you from threats, leaving regions unprotected for an hour would be an issue.

xyst · 2 years ago

Or test in local environments first. Slow rollouts like this tend to make deployments very very painful.

daseiner1 · 2 years ago

You say even (emphasis mine). Is this not industry standard?

miki123211 · 2 years ago

Keep in mind that this was probably a data file, not necessarily a code file.

It's possible that they run tests on new commits, but not when some other, external, non-git system pushes out new data.

Team A thinks that "obviously the driver developers are going to write it defensively and protect it against malformed data", team B thinks "obviously all this data comes from us, so we never have to worry about it being malformed"

I don't have any non-public info about what actually happened, but something along these lines seems to be the most likely hypothesis to me.

Edit: Now what would have helped here is a "staged rollout" process with some telemetry. Push the update to 0.01% of your users and solicit acknowledgments after 15 minutes. If the vast majority of systems are still alive and haven't been restarted, keep increasing the threshold. If, at any point, too many of the updated systems stop responding or indicate a failure, immediately stop the rollout, page your on-call engineers and give them a one-click process to completely roll the update back, even for already-updated clients.

This is exactly the kind of issue that non-invasive, completely anonymous, opt-out telemetry would have solved.

adzm · 2 years ago

This was a .dll in all but name fwiw.

ar_lan · 2 years ago

> tests all of the possible target images that they claim to support.

Or even at the very least the most popular OS that they support. I'm genuinely imagining right now that for this component, the entirety of the company does not have a single Windows machine they run tests on.

tinytime · 2 years ago

It's wild that I'm out here boosting existing unit testing practices with mutation testing https://github.com/codeintegrity-ai/mutahunter and there are folks out there that don't even do the basic testing.

sonotathrowaway · 2 years ago

That’s not even getting into the fuckups that must have happened to allow a bad patch to get rolled out everywhere all at once.

notabee · 2 years ago

Without delving into any kind of specific conspiratorial thinking, I think people should also include the possibility that this was malicious. It's much more likely to be incompetence and hubris, but ever since I found out that this is basically an authorized rootkit, I've been concerned about what happens if another Solarwinds incident occurs with Crowdstrike or another such tool. And either way, we have the answer to that question now: it has extreme consequences. We really need to end this blind checkbox compliance culture and start doing real security.

dheera · 2 years ago

I don't know if people on Microsoft ecosystems even know what CI pipelines are.

Linux and Unix ecosystems in general work by people thoroughly testing and taking responsibility for their work.

Windows ecosystems work by blame passing. Blame Ron, the IT guy. Blame Windows Update. Blame Microsoft. That's how stuff works.

It has always worked this way.

But also, all the good devs got offered 3X the salary at Google, Meta, and Apple. Have you ever applied for a job at CrowdStrike? No? That's why they suck.

* A disproportionately large number of Windows IT guys are named Ron, in my experience.

kabdib · 2 years ago

That's a pretty broad brush.

this seems like the second or third test file any qa person would have tried, after an empty file and maybe a minimal valid file. the level of pervasive incompetence implied here is staggering

in a market where companies compete by impressing nontechnical upper management with presentations, it should be no surprise that technically competent companies have no advantage over incompetent ones

i recently read through the craig wright decision https://www.judiciary.uk/judgments/copa-v-wright/ (the guy who fraudulently claimed to be satoshi nakamoto) and he lacked even the most basic technical competence in the fields where he was supposedly a world-class specialist (decompiling malware to c); he didn't know what 'unsigned' meant when questioned on the witness stand. he'd been doing infosec work for big companies going back to the 90s. he'd apparently been tricking people with technobabble and rigged demos and forged documents for his entire career

george kurtz, ceo and founder of crowdstrike, was the cto of mcafee when they did the exact same thing 14 years ago: https://old.reddit.com/r/sysadmin/comments/1e78l0g/can_crowd... https://en.wikipedia.org/wiki/George_Kurtz

crowdstrike itself caused the same problem on debian stable three months ago: https://old.reddit.com/r/debian/comments/1c8db7l/linuximage6...

it's horrifying that pci compliance regulations have injected crowdstrike (and antivirus) into virtually every aspect of today's it infrastructure

GardenLetter27 · 2 years ago

Also ironic that the compliance ended up introducing the biggest vulnerability as a massive single point of failure.

But that's government regulation for you.

kragen · 2 years ago

pci-dss is not a government agency but it might as well be; it's a collusion between visa, mastercard, american express, discover, and jcb to prevent them from having different data security standards (and therefore being able to compete on security)

moandcompany · 2 years ago

It's definitely ironic, and compatible with the security engineering world joke that the most secure system is one that cannot be accessed or used at all.

I suppose one way to "stop breaches" is to shut down every host entirely.

In the military world, there is a concept of an "Alpha Strike" which generally relates to a fast-enough and strong-enough first-strike that is sufficient to disable the adversary's ability to respond or fight back (e.g. taking down an entire fleet at once). Perhaps people that have been burned by this event will start calling it a Crowdstrike.

bandyaboot · 2 years ago

What’s frustrating is that after falsely attributing pci to government regulation and being corrected, you’re probably not going to re-examine the thought process that led you to that false belief.

phatfish · 2 years ago

It seems government IT systems in general faired pretty well the last 12 hrs, but loads of large private companies were effectively taken offline, so there's that.

babypuncher · 2 years ago

this had nothing to do with government regulation, thank private sector insurance companies.

acdha · 2 years ago

> But that's government regulation for you.

You misspelled “private sector”. Use of endpoint monitoring software is coming out of private auditing companies driven by things like PCI or insurers’ requirements – almost nobody wants to pay for highly-skilled security people so they’re outsourcing it to the big auditing companies and checklists so that if they get sued they can say they were following industry practices and the audit firms okayed it.

baxtr · 2 years ago

I think the worst part of the incident is that state actors now have a clear blueprint for a large scale infrastructure attack.

IAmNotACellist · 2 years ago

I can think of a lot better things to put in a kernel-level driver installed on every critical computer ever than a bunch of 0s.

Guthur · 2 years ago

You're assuming it wasn't an attack.

Just the same week Kaspersky gets kicked from the us market...

worik · 2 years ago

> george kurtz, ceo and founder of crowdstrike, was the cto of mcafee when they did the exact same thing 14 years ago: https://old.reddit.com/r/sysadmin/comments/1e78l0g/can_crowd...

I find it amusing that the people commenting on that link are offended this called a "Microsoft " outage, when it is "Crowdstrike's fault".

This is just as much a Microsoft failure.

This is even more, another industry failure

How many times does this have to happen before we get some industry reform that lets us do our jobs and build the secure reliable systems we have spent seven decades researching?

1988 all over again again again

TeMPOraL · 2 years ago

It's simple: the failure is not specific to the OS.

Crowdstrike runs on MacOS and Linux workstations too. And it's just as dangerous there; the big thread has stories of Crowdstrike breaking Linux systems in the past months.

Crowdstrike isn't needed by/for Windows, it's mandated by corporate and government bureaucracies, where it serves as a tool of employee control and a compliance checkbox to check.

That's why it makes no sense to blame Microsoft. If the world run on Linux, ceteris paribus, Crowdstrike would be on those machines too, and would fuck them up just as bad globally.

kragen · 2 years ago

the reforms you're hoping for are not going to happen until the countries with bad infosec get conquered by the countries with good infosec. that is going to be much worse than you can possibly imagine

it's not microsoft's fault at all; crowdstrike caused the same problem on debian systems three months ago. the only way it could be microsoft's fault is if letting users install kernel drivers is microsoft's fault

emporas · 2 years ago

> he lacked even the most basic technical competence in the fields where he was supposedly a world-class specialist

Pretty ironic Craig also suggests to put digital certificates on the blockchain instead of them issued by a computer or a cluster of computers, in one place, which makes them a target. A peer to peer network is much more resilient to attacks.

Was the problem, or a part of the problem, that the root certificate of all computers was replaced by CrowdStrike? I didn't follow that part closely. If it was, then certificates registered on the blockchain might be a solution.

I'm totally not an expert in cryptography, but that seems plausible.

As a side note, it is only when i heard Craig's Wright explanation of the blockchain that the bitcoin whitepaper started making a lot of sense. He may not be the greatest of coders, but are mathematicians useless in security?

kragen · 2 years ago

none of your comment makes any sense. possibly it was generated by a large language model with no understanding of what is being talked about

worstspotgain · 2 years ago

I don't mean to sound conspiratorial, but it's a little early to rule out malfeasance just because of Hanlon's Razor just yet. Most fuckups are not on a ridonkulous global scale. This is looking like the biggest one to date, the Y2K that wasn't.

martin-t · 2 years ago

We as a society need to start punishing incompetence the same way we punish malice.

Of course, we also need to first start punishing individuals for intentionally causing harm through their decisions even if the harm was caused indirectly through other people. Power allows people to distance themselves from the act. Distance should not affect the punishment.

yourapostasy · 2 years ago

> Of course, we also need to first start punishing individuals for intentionally causing harm through their decisions even if the harm was caused indirectly through other people.

Legal teams are way ahead of you here, by design. They’ve learned to factor out responsibility in such a diffuse manner the “indirectly” term loses nearly all its power the closer one gets to executive ranks and is a fig leaf when you reach C-levels and BoD levels. We couldn’t feasibly operate behemoth corporations otherwise in litigious environments.

I personally suspect we cannot enshrine earnestness and character as metrics and KPI’s, even indirectly. From personal experience selling into many client organizations and being able to watch them make some sausage, this is where leadership over managing becomes a key distinction, and organizational culture over a long period measured in generations of decisions and actions implemented by a group with mean workforce tenure measured in decades becomes one of the few reliable signals. Would love to hear others’ experiences and observations.

Deleted Comment

chuckadams · 2 years ago

So which individual engineer are you locking up and telling their kids that they don't get to see their mommy or daddy again?

> intentionally causing harm

Maybe we need to criminalize groundless accusations too.

worik · 2 years ago

> We as a society need to start punishing incompetence the same way we punish malice.

Yes

But competence is marketed

The trade names like "Crowdstrike" and "Microsoft "