CrowdStrike Official RCA is now out [pdf]

That's a lot of words to say "We did not test a file that gets ingested by a kernel level program, not even once"

At no point did they deploy this file to a computer they owned and attempted to boot it. They purposely decided to deploy behavior to every computer they could without even once making sure it wouldn't break from something stupid.

Are these people fucking nuts?

I do more testing than this and I might be incompetent. Also nothing I touch will kill millions of PCs. I get having pressure put on you from above, I get being encouraged to cut corners so some shithead can check off a box on his yearly review and make more money while stiffing you on your raise, I get making mistakes.

But like, fuck man, come on.

pjsg · a year ago

I think it is worse than that. When I make a change to some code or config, I'll run it locally to make sure that the change has the effect that I want. I know that we are human and that bugs occasionally appear in code. But what I can't understand is that the human who initiated this change decided not to see if it actually did what they wanted it to do.

I've made changes on personal projects that I thought were simple, and yet broke stuff. But CrowdStrike is a multi-billion dollar company -- how can it be possible to have such a broken process. Their RCA document was interesting, but didn't cover any of the interesting issues. It seems that they don't know about the 5 Whys process (https://en.wikipedia.org/wiki/Five_whys) or decided that those answers were so embarrassing that they had to omit them.

darylteo · a year ago

> When I make a change to some code or config, I'll run it locally to make sure that the change has the effect that I want.

It's not uncommon for devs to be working against outdated databases / config dumps. Certainly bad practice but when devs have the option of being lazy vs doing chores, they will pick the path of less resistance.

> But what I can't understand is that the human who initiated this change decided not to see if it actually did what they wanted it to do.

We're assuming that the person who changed the code also made the choice to initiate the rollout. They are 2 separate actions which can be made by separate individuals and could also involve many multiple steps in between, each undertaken by a separate individual as well.

Distance from Prod does introduce a sense of malaise and complacency, I've found.

darylteo · a year ago

The whole thing smells of silo'ed teams syndrome.

Team 1 tells Team 2 that the schema is updating.

Team 2 updates their schema.

Team 2 tests against updated schema

All green in test.

Team 1 doesn't actually follow the schema.

Deployment fails.

---

It's really hard to assign blame, but I'd put more blame on Team 2 for not being defensive with their inputs enough.

As we all know there are greater issues with their deployment pipelines (lack of canaries, phased rollouts etc.) but no point going over those in this context.

tantalor · a year ago

Leeeroy Jenkiiiins!

A lot of mitigation actions but nothing to really stop it happening again: a fail safe system in their boot start driver. Bad programming and QA caused the issue, but bad design allowed it to happen

simiones · a year ago

I think the QA issues are by far the most important part. A security component of this type, by its nature, has to be able to prevent your computer from doing anything at all, since any part of userspace (at least) could be compromised.

The "fail safe" for a security component is in fact to prevent any user space code from running at all - better that than having it actively harm other systems, exfiltrate data, destroy connected hardware etc. So, no amount of clever design can prevent the CrowdStrike sensor from nuking your system if bad security rules get deployed.

For example, if a bad definition file makes it think that the legit libc or win32 libraries are compromised, it should prevent any userspace program from running, which is just as destructive as failing during boot.

That is why appropriate QA is critical for this type of program. I would expect any definition update of any kind to be tested on dozens of systems with a wide variety of Windows configurations and known-good software far before ever being deployed to any customer system. It seems that CrowdStrike thought the exact opposite of this, and in fact their customers were the first to ever run their new code end-to-end, not the last...

acdha · a year ago

> So, no amount of clever design can prevent the CrowdStrike sensor from nuking your system if bad security rules get deployed.

This is too binary a way to think about a complex system. Availability is also a security goal so we shouldn’t cavalierly trade it for minor risks which are mostly edge cases.

For example, say that the fail-safe was an old, old idea where they kept the second most recent version, and if the system failed to start or crashed repeatedly, it automatically rolled back to the last known good version. That turns this kind of problem into at most a reboot – a huge win every customer would have taken - and the only case it would introduce a vulnerability is if there’s an active attack which only the latest rules will block which is so virulent that the number of systems approximates the number who’ll be affected by a bad update. That’s an unlikely set of events, especially because there’s a really tight window where such a fast-spreading attack wouldn’t have compromised the host before CrowdStrike could ship the update.

Another variation of that idea: any time the system fails to start repeatedly, the service blocks processes other than its updater so normal apps aren’t exposed as potential vectors but the system can self-heal in most cases.

l00tr · a year ago

famous windows "guru" Alex Ionescu was their main kernel architector for long time, funny he didn't comment anything about that fail

Ukv · a year ago

> In summary, it was the confluence of these issues that resulted in a system crash: [...] the lack of a specific test for non-wildcard matching criteria in the 21st field.

I feel they focus a lot on their content validator lacking a check to catch this specific error (probably since that sounds like a more understandable oversight) when the more glaring issue is that they didn't try actually running this template instance on even a single machine, which would've instantly revealed the issue.

Even for amateur software with no unit/integration tests, the developer will still have typically ran it on their own machine to see it working. Here CrowdStrike seem to have been flying blind, just praying new template instances work if they pass the validation checks.

They do at least promise to "ensure that every new Template Instance is tested" further down.

grumple · a year ago

Absolutely. This is the number one issue I see causing problems with devs on my teams. It is extremely simple to test your damn work. Smoke test it. Make sure the damn machine boots. Make sure the app runs.

This is covered in part by a staged deployment... but that's just having your users test for you. Where's the automated integration test, or just the boot test?

teyc · a year ago

It doesn't even cover the barest of organisational root cause. How are they planning to do defense in depth and prevent any internal threat actor from wedging every machine in the world?

zer00eyz · a year ago

Crowdstrike takes it self seriously, for a security company. That means don't ask questions of the experts.

Everyone else sees these services as the patsy when the problem happens.

From a technical perspective it's a hot mess (you are spot on). But business says "everything is fine, this is fine, carry on", because it meets their goal of CYA.

mrguyorama · a year ago

ivanjermakov · a year ago

They should've read "parse, not validate": https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

thegreenroom · a year ago

Thanks that was a good read.

kiririn · a year ago

Terretta · a year ago

Add a new threat actor to the list, those pesky parameter counts actively trying to evade detection:

"This parameter count mismatch evaded multiple layers of build validation and testing, as it was not discovered during the sensor release testing process, the Template Type (using a test Template Instance) stress testing or the first several successful deployments of IPC Template Instances in the field."

Curious that csagent.sys isn't mentioned until last page, p. 12:

"csagent.sys is CrowdStrike’s file system filter driver, a type of kernel driver that registers with components of the Windows operating system…"

Well I guess I should post the obligatory

> Some people, when confronted with a problem, think

> “I know, I’ll use regular expressions.”

> Now they have two problems.

ChrisArchitect · a year ago

Cleaner link: https://www.crowdstrike.com/blog/channel-file-291-rca-availa...