Readit News logoReadit News
AndrewKemendo · 2 years ago
This should not have passed a competent C/I pipeline for a system in the critical path.

I’m not even particularly stringent when it comes to automated test across-the-board but for this level of criticality of system, you need exceptionally good state management

To the point where you should not roll to production without an integration test on every environment that you claim to support

Like it’s insane to me that this size and criticality of a company doesn’t have a staging or even a development test server that tests all of the possible target images that they claim to support.

Who is running stuff over there - total incompetence

martinky24 · 2 years ago
A lot of assumptions here that probably aren't worth making without more info -- For example it could certainly be the case that there was a "real" file that worked and the bug was in the "upload verified artifact to CDN code" or something, at which point it passes a lot of things before the failure.

We don't have the answers, but I'm not in a rush to assume that they don't test anything they put out at all on Windows.

EvanAnderson · 2 years ago
I haven't seen the file, but surely each build artifact should be signed and verified when it's loaded by the client. The failure mode of bit rot / malice in the CDN should be handled.
chrisjj · 2 years ago
> it could certainly be the case that there was a "real" file that worked and the bug was in the "upload verified artifact to CDN code" or something

I.e. only one link in the chain wasn't tested.

Sorry, but that will not do.

> We don't have the answers, but I'm not in a rush to assume that they don't test anything they put out at all on Windows.

The parent post did not suggest they don't test anything. It suggested they did not test the whole chain.

arp242 · 2 years ago
> Like it’s insane to me that this size and criticality of a company doesn’t have a staging or even a development test server that tests all of the possible target images that they claim to support.

Who is saying they don't have that? Who is saying it didn't pass all of that?

You're making tons of assumptions here.

JKCalhoun · 2 years ago
To be sure. But the fact is the release broke.

I'm not sure: is having test servers that it passed any better than none at all?

martinky24 · 2 years ago
Yeah... the comment above reads like someone who has read a lot of books on CI deployment, but has zero experience in a real world environment actually doing it. Quick to throw stones with absolutely no understanding of any of the nuances involved.
ikiris · 2 years ago
Dude, the fact that it breaks directly.

You sound like the guy that a few years ago tried to argue (the company in question) tested os code that didn't include any drivers for their gear's local storage. Its obvious it wasn't to anyone competent.

carterschonwald · 2 years ago
The strange thing is that when I interviewed there years ago with the team that owns the language that runs in the kernel, they said their ci has 20k or 40k machine os combinations/configurations. Surely some of them were vanilla windows!
dboreham · 2 years ago
They used synthetic test data in CI that doesn't consist of zeros.
dagaci · 2 years ago
/* Acceptance criteria #1: do not allow machine to boot if invalid data signatures are present, this could indicate a compromised system. Booting could cause presidents diary to transmit to rival 'Country' of the week */

if(dataFileIsNotValid) { throw FatalKernelException("All your base are compromised"); }

EDIT+ Explanation:

With hindsight not booting may be exactly the right thing to do since a bad datafile would indicate a compromised distribution/ network.

The machines should not fully boot until file with valid signature is downloaded.*

hnlmorg · 2 years ago
It seems unlikely that a file entirely full of null characters was the output of any automated build pipeline. So I’d wager something got built, passed the CI tests, then the system broke at some point after that when the file was copied ready for deployment.

But at this stage, all we are doing is speculating.

russdill · 2 years ago
You can have all the CI, staging, test, etc. If some bug after that process nulls the file, the rest doesn't matter
fabian2k · 2 years ago
Those signature files should have a checksum, or even a digital signature. I mean even if it doesn't crash the entire computer, a flipped bit in there could still turn the entire thing against a harmless component of the system and lead to the same result.
LorenPechtel · 2 years ago
Yup. I had quite a battle with some sort of system bug (never fully traced) where I wrote valid data but what ended up on disk was all zero. It appeared to involve corrupted packets being accepted as valid.

It doesn't matter how much you test if something down the line zeroes out your stuff.

Jtsummers · 2 years ago
If a garbage file is pushed out, the program could have handled it by ignoring it. In this case, it did not and now we're (the collective IT industry) dealing with the consequences of one company that can't be bothered to validate its input (they aren't the only ones, but this is a particularly catastrophic demonstration of the importance of input validation).
Cerium · 2 years ago
What sort of sane system modifies the build output after testing?

Our release process is more like: build and package, sign package, run CI tests on signed package, run manual tests on signed package, release signed package. The deployment process should check those signatures. A test process should by design be able to detect any copy errors between test and release in a safe way.

jononor · 2 years ago
The issue is not that a file with nulls was produced. It is that an invalid file (or any kind) can trigger a blue screen of death.
0xcafecafe · 2 years ago
They could even have done slow rollouts. Roll it out to a geographical region and wait an hour or so before deploying elsewhere.
saati · 2 years ago
In theory CrowdStrike protects you from threats, leaving regions unprotected for an hour would be an issue.
xyst · 2 years ago
Or test in local environments first. Slow rollouts like this tend to make deployments very very painful.
daseiner1 · 2 years ago
You say even (emphasis mine). Is this not industry standard?
miki123211 · 2 years ago
Keep in mind that this was probably a data file, not necessarily a code file.

It's possible that they run tests on new commits, but not when some other, external, non-git system pushes out new data.

Team A thinks that "obviously the driver developers are going to write it defensively and protect it against malformed data", team B thinks "obviously all this data comes from us, so we never have to worry about it being malformed"

I don't have any non-public info about what actually happened, but something along these lines seems to be the most likely hypothesis to me.

Edit: Now what would have helped here is a "staged rollout" process with some telemetry. Push the update to 0.01% of your users and solicit acknowledgments after 15 minutes. If the vast majority of systems are still alive and haven't been restarted, keep increasing the threshold. If, at any point, too many of the updated systems stop responding or indicate a failure, immediately stop the rollout, page your on-call engineers and give them a one-click process to completely roll the update back, even for already-updated clients.

This is exactly the kind of issue that non-invasive, completely anonymous, opt-out telemetry would have solved.

adzm · 2 years ago
This was a .dll in all but name fwiw.
ar_lan · 2 years ago
> tests all of the possible target images that they claim to support.

Or even at the very least the most popular OS that they support. I'm genuinely imagining right now that for this component, the entirety of the company does not have a single Windows machine they run tests on.

tinytime · 2 years ago
It's wild that I'm out here boosting existing unit testing practices with mutation testing https://github.com/codeintegrity-ai/mutahunter and there are folks out there that don't even do the basic testing.
sonotathrowaway · 2 years ago
That’s not even getting into the fuckups that must have happened to allow a bad patch to get rolled out everywhere all at once.
notabee · 2 years ago
Without delving into any kind of specific conspiratorial thinking, I think people should also include the possibility that this was malicious. It's much more likely to be incompetence and hubris, but ever since I found out that this is basically an authorized rootkit, I've been concerned about what happens if another Solarwinds incident occurs with Crowdstrike or another such tool. And either way, we have the answer to that question now: it has extreme consequences. We really need to end this blind checkbox compliance culture and start doing real security.
dheera · 2 years ago
I don't know if people on Microsoft ecosystems even know what CI pipelines are.

Linux and Unix ecosystems in general work by people thoroughly testing and taking responsibility for their work.

Windows ecosystems work by blame passing. Blame Ron, the IT guy. Blame Windows Update. Blame Microsoft. That's how stuff works.

It has always worked this way.

But also, all the good devs got offered 3X the salary at Google, Meta, and Apple. Have you ever applied for a job at CrowdStrike? No? That's why they suck.

* A disproportionately large number of Windows IT guys are named Ron, in my experience.

kabdib · 2 years ago
That's a pretty broad brush.
hn_throwaway_99 · 2 years ago
On a related note, I don't think that it's a coincidence that 2 of the largest tech meltdowns in history (this one and the SolarWinds hack from a few years ago) were both the result of "security software" run amok. (Also sad that both of these companies are based in Austin, which certainly gives Austin's tech scene a black eye).

IMO, I think a root cause issue is that the "hacker types" who are most likely to want to start security software companies are also the least likely to want to implement the "boring" pieces of a process-oriented culture. For example, I can't speak so much for CrowdStrike, but it came out the SolarWinds had an egregiously bad security culture at their company. When the root cause comes out about this issue dollars-to-donuts it was just a fast and loose deployment process.

cedws · 2 years ago
> the "hacker types" who are most likely to want to start security software companies are also the least likely to want to implement the "boring" pieces of a process-oriented culture

I disagree, security companies suffer from "too big to fail" syndrome where the money comes easy because they have customers who want to check a box. Security is NOT a product you pay for, it's a culture that takes active effort and hard work to embed from day one. There's no product on the market that can provide security, only products to point a finger at when things go wrong.

Andrex · 2 years ago
The market is crying for some kind of '10s "agile hype" equivalent for security evangelism and processes.
moandcompany · 2 years ago
Crowdstrike was originally founded and headquartered in Irvine, CA (Southern California). In those days, most of its engineering organization was either remote/WFH or in Irvine, CA.

As they got larger, they added a Sunnyvale office, and later moved the official headquarters to Austin, TX.

They've also been expanding their engineering operations overseas which likely includes offshoring in the last few years.

nullify88 · 2 years ago
They bought out Humio in Aarhus, Denmark. Now Falcon Logscale.
NegativeK · 2 years ago
Alternate hypothesis that's not necessarily mutually exclusive: security software tends to need significant and widespread access. That means that fuckups and compromises tend to be more impactful.
hn_throwaway_99 · 2 years ago
100% agree with that. The thing that baffles me a bit, then, is that if you are writing software that is so critical and can have such a catastrophic impact when things go wrong, that you double and triple check everything you do - what you DON'T do is use the same level of care you may use with some social media CRUD app (move fast and break things and all that...)

To emphasize, I'm really just thinking about the bad practices that were reported after the SolarWinds hack (password of "solarwinds123" and a bunch of other insider reports), so I can't say that totally applies to CrowdStrike, but in general I don't feel like these companies that can have such a catastrophic impact take appropriate care of their responsibilities.

worstspotgain · 2 years ago
The industry's track record is indeed quite peculiar. Just off the top of my head:

- CrowdStrike

- SolarWinds

- Kaspersky

- https://en.wikipedia.org/wiki/John_McAfee

I'm sure I'm forgetting a few. Maybe it's the same self-selection scheme that afflicts the legal cannabis industry?

ajsnigrutin · 2 years ago
Security software needs kernel level access.. if something breaks, you get boot loops and crashes.

Most other software doesn't need that low level of access, and even if it crashes, it doesn't take the whole system with it, and a quick, automated upgrade process is possible.

rahkiin · 2 years ago
Security software needs kernel level access.. *on Windows. macOS has an Endpoint Security userland extension api
koliber · 2 years ago
Don't forget heartbleed, a vulnerability in OpenSSL, the software that secures pretty much everything.
Lennie · 2 years ago
I don't see that as a business problem directly.

But I see it as the XKCD 2347 problem.

compacct27 · 2 years ago
The Austin tech culture is…interesting. I stopped trying to find a job here and went remote Bay Area, and talking to tech workers in the area gave me the impression it’s a mix of slacker culture and hype chasing. After moving back here, tech talent seems like a game of telephone, and we’re several jumps past the original.

When I heard CrowdStrike was here, it just kinda made sense

heraldgeezer · 2 years ago
I would add Kaseya VSA Ransomware attack in 2021 to that list.
0cf8612b2e1e · 2 years ago
On the plus side of this disaster, I am holding out some pico-sized hope that maybe organizations will rethink kernel level access. No, random gaming company, you are not good enough to write kernel level anti cheat software.
majormajor · 2 years ago
I can't imagine gaming software being affected at all, unless MS does a ton of cracking down (and would still probably give hooks for gaming since they have gaming companies in their umbrella).

No corporate org is gonna bat an eye at Riot's anti-cheat practices, because they aren't installing LoL on their line of business machines anyway.

minetest2048 · 2 years ago
Until the malware bring their own compromised signed anti cheat driver on their own, like what happened with Genshin Impact anti cheat mhyprot2
InitialLastName · 2 years ago
Right, MS just paid $75e9 for a company whose main products are competitive multiplayer games. They are never going to be incentivized to compromise that sector by limiting what anti-cheat they can do.
tgsovlerkhgsel · 2 years ago
> because they aren't installing LoL on their line of business machines anyway

But if their business is incompatible with strict software whitelisting, their employees might...

pityJuke · 2 years ago
The problem you’re fighting is cheat customers who go “random kernel-level driver? no problem!”
pvillano · 2 years ago
imo anti-cheat should mostly be server-side behavior based
gruez · 2 years ago
How are you going to catch wallhackers that aren't blatantly obvious?
frizlab · 2 years ago
Unless I’m mistaken on macOS at least kernel access is just not possible, so at least there’s that.
heraldgeezer · 2 years ago
And Valorant, CS2, CoD dont run on macOS due to anti cheat :)
kragen · 2 years ago
this seems like the second or third test file any qa person would have tried, after an empty file and maybe a minimal valid file. the level of pervasive incompetence implied here is staggering

in a market where companies compete by impressing nontechnical upper management with presentations, it should be no surprise that technically competent companies have no advantage over incompetent ones

i recently read through the craig wright decision https://www.judiciary.uk/judgments/copa-v-wright/ (the guy who fraudulently claimed to be satoshi nakamoto) and he lacked even the most basic technical competence in the fields where he was supposedly a world-class specialist (decompiling malware to c); he didn't know what 'unsigned' meant when questioned on the witness stand. he'd been doing infosec work for big companies going back to the 90s. he'd apparently been tricking people with technobabble and rigged demos and forged documents for his entire career

george kurtz, ceo and founder of crowdstrike, was the cto of mcafee when they did the exact same thing 14 years ago: https://old.reddit.com/r/sysadmin/comments/1e78l0g/can_crowd... https://en.wikipedia.org/wiki/George_Kurtz

crowdstrike itself caused the same problem on debian stable three months ago: https://old.reddit.com/r/debian/comments/1c8db7l/linuximage6...

it's horrifying that pci compliance regulations have injected crowdstrike (and antivirus) into virtually every aspect of today's it infrastructure

GardenLetter27 · 2 years ago
Also ironic that the compliance ended up introducing the biggest vulnerability as a massive single point of failure.

But that's government regulation for you.

kragen · 2 years ago
pci-dss is not a government agency but it might as well be; it's a collusion between visa, mastercard, american express, discover, and jcb to prevent them from having different data security standards (and therefore being able to compete on security)
moandcompany · 2 years ago
It's definitely ironic, and compatible with the security engineering world joke that the most secure system is one that cannot be accessed or used at all.

I suppose one way to "stop breaches" is to shut down every host entirely.

In the military world, there is a concept of an "Alpha Strike" which generally relates to a fast-enough and strong-enough first-strike that is sufficient to disable the adversary's ability to respond or fight back (e.g. taking down an entire fleet at once). Perhaps people that have been burned by this event will start calling it a Crowdstrike.

bandyaboot · 2 years ago
What’s frustrating is that after falsely attributing pci to government regulation and being corrected, you’re probably not going to re-examine the thought process that led you to that false belief.
phatfish · 2 years ago
It seems government IT systems in general faired pretty well the last 12 hrs, but loads of large private companies were effectively taken offline, so there's that.
babypuncher · 2 years ago
this had nothing to do with government regulation, thank private sector insurance companies.
acdha · 2 years ago
> But that's government regulation for you.

You misspelled “private sector”. Use of endpoint monitoring software is coming out of private auditing companies driven by things like PCI or insurers’ requirements – almost nobody wants to pay for highly-skilled security people so they’re outsourcing it to the big auditing companies and checklists so that if they get sued they can say they were following industry practices and the audit firms okayed it.

baxtr · 2 years ago
I think the worst part of the incident is that state actors now have a clear blueprint for a large scale infrastructure attack.
IAmNotACellist · 2 years ago
I can think of a lot better things to put in a kernel-level driver installed on every critical computer ever than a bunch of 0s.
Guthur · 2 years ago
You're assuming it wasn't an attack.

Just the same week Kaspersky gets kicked from the us market...

worik · 2 years ago
> george kurtz, ceo and founder of crowdstrike, was the cto of mcafee when they did the exact same thing 14 years ago: https://old.reddit.com/r/sysadmin/comments/1e78l0g/can_crowd...

I find it amusing that the people commenting on that link are offended this called a "Microsoft " outage, when it is "Crowdstrike's fault".

This is just as much a Microsoft failure.

This is even more, another industry failure

How many times does this have to happen before we get some industry reform that lets us do our jobs and build the secure reliable systems we have spent seven decades researching?

1988 all over again again again

TeMPOraL · 2 years ago
It's simple: the failure is not specific to the OS.

Crowdstrike runs on MacOS and Linux workstations too. And it's just as dangerous there; the big thread has stories of Crowdstrike breaking Linux systems in the past months.

Crowdstrike isn't needed by/for Windows, it's mandated by corporate and government bureaucracies, where it serves as a tool of employee control and a compliance checkbox to check.

That's why it makes no sense to blame Microsoft. If the world run on Linux, ceteris paribus, Crowdstrike would be on those machines too, and would fuck them up just as bad globally.

kragen · 2 years ago
the reforms you're hoping for are not going to happen until the countries with bad infosec get conquered by the countries with good infosec. that is going to be much worse than you can possibly imagine

it's not microsoft's fault at all; crowdstrike caused the same problem on debian systems three months ago. the only way it could be microsoft's fault is if letting users install kernel drivers is microsoft's fault

emporas · 2 years ago
> he lacked even the most basic technical competence in the fields where he was supposedly a world-class specialist

Pretty ironic Craig also suggests to put digital certificates on the blockchain instead of them issued by a computer or a cluster of computers, in one place, which makes them a target. A peer to peer network is much more resilient to attacks.

Was the problem, or a part of the problem, that the root certificate of all computers was replaced by CrowdStrike? I didn't follow that part closely. If it was, then certificates registered on the blockchain might be a solution.

I'm totally not an expert in cryptography, but that seems plausible.

As a side note, it is only when i heard Craig's Wright explanation of the blockchain that the bitcoin whitepaper started making a lot of sense. He may not be the greatest of coders, but are mathematicians useless in security?

kragen · 2 years ago
none of your comment makes any sense. possibly it was generated by a large language model with no understanding of what is being talked about
worstspotgain · 2 years ago
I don't mean to sound conspiratorial, but it's a little early to rule out malfeasance just because of Hanlon's Razor just yet. Most fuckups are not on a ridonkulous global scale. This is looking like the biggest one to date, the Y2K that wasn't.
martin-t · 2 years ago
We as a society need to start punishing incompetence the same way we punish malice.

Of course, we also need to first start punishing individuals for intentionally causing harm through their decisions even if the harm was caused indirectly through other people. Power allows people to distance themselves from the act. Distance should not affect the punishment.

yourapostasy · 2 years ago
> Of course, we also need to first start punishing individuals for intentionally causing harm through their decisions even if the harm was caused indirectly through other people.

Legal teams are way ahead of you here, by design. They’ve learned to factor out responsibility in such a diffuse manner the “indirectly” term loses nearly all its power the closer one gets to executive ranks and is a fig leaf when you reach C-levels and BoD levels. We couldn’t feasibly operate behemoth corporations otherwise in litigious environments.

I personally suspect we cannot enshrine earnestness and character as metrics and KPI’s, even indirectly. From personal experience selling into many client organizations and being able to watch them make some sausage, this is where leadership over managing becomes a key distinction, and organizational culture over a long period measured in generations of decisions and actions implemented by a group with mean workforce tenure measured in decades becomes one of the few reliable signals. Would love to hear others’ experiences and observations.

Deleted Comment

chuckadams · 2 years ago
So which individual engineer are you locking up and telling their kids that they don't get to see their mommy or daddy again?

> intentionally causing harm

Maybe we need to criminalize groundless accusations too.

worik · 2 years ago
> We as a society need to start punishing incompetence the same way we punish malice.

Yes

But competence is marketed

The trade names like "Crowdstrike" and "Microsoft "

snowwrestler · 2 years ago
I have flagged this because it is wrong, and there is no other way to kick it off the front page.

Finding a file full of zeroes on a broken computer does not mean it was shipped as all zeroes!

https://x.com/craiu/status/1814339965347610863

https://x.com/cyb3rops/status/1814329155833516492

execveat · 2 years ago
CrowdStrike does this trick where it replaces the file (being transferred over a network socket) with zeroes if it matches the malware signature. Assuming that these are the malware signature files themselves, a match wouldn't be surprising.
tsavo · 2 years ago
This actually makes the most sense, and would help explain how the error didn't occur during testing (in good faith, I assume it was tested).

In testing, the dev may have worked from their primary to deploy the update to a series of secondary drives, then sequential performed a test boot from each secondary drive configured for each supported OS version. A shortcut/quick way to test that would've bypassed how their product updates in customer environments, also bypassing checks their software may have performed (in this case, overwriting their own file's contents).

bombcar · 2 years ago
CrowStrike foot gunning itself would be amusing, if expected.
dang · 2 years ago
Ah thanks. I've made the title questionable now.
snowwrestler · 2 years ago
Finally confirmed by CrowdStrike themselves:

https://www.crowdstrike.com/blog/tech-analysis-channel-file-...

millero · 2 years ago
Yes, this fits in with what I heard on the grapevine about this bug from a friend who knows someone working for Crowdstrike. The bug had been sitting there in the kernel driver for years before being triggered by this flawed data, which actually was added in a post-processing step of the configuration update - after it had been tested but before being copied to their update servers for clients to obtain.

Apparently, Crowdstrike's test setup was fine for this configuration data itself, but they didn't catch it before it was sent out in production, as they were testing the wrong thing. Hopefully they own up to this, and explain what they're going to do to prevent another global-impact process failure, in whatever post-mortem writeup they may release.

finaard · 2 years ago
[flagged]
dist-epoch · 2 years ago
"We need to ship this by Friday. Just add a quick post-processing step, and we'll fix it next week properly" - how these things tend to happen.
arp242 · 2 years ago
"I heard on the grapevine from a friend who knows someone working for Crowdstrike" is perhaps not the most reliable source of information, due to the game of telephone effect if nothing else.

And post-processing can mean many things. Could be something relatively simple such as "testing passed, so lets mark the file with a version number and release it".

dgfitz · 2 years ago
Hmm, I post-process autonomous vehicle logs probably daily.

Why is this stupid? It’s pretty useful to see a graph of coolant temp vs ambient temp vs motor speed vs roll/pitch.

I must be especially stupid I suppose. Nuts.

machine_coffee · 2 years ago
At last an explanation that makes a bit of sense to me.

>Hopefully they own up to this, and explain what they're going to do to prevent another global-impact process failure

They probably needn't bother, every competent sysadmin from Greenland to New Zealand is probably disabling the autoupdate feature right now, firewalling it off and hatching a plan to get the product off their server estate ASAP.

Marketing budgets for competing product are going to get a bump this quarter probably.

swat535 · 2 years ago
I think Crowdstrike is due for more than just "owning up to it". I sincerely hope for serious investigation, fines and legal pursuits despite having "limited liabilities" in their agreements.

Seriously, I don't even know how to do the math on the amount of damage this caused (not including the TIME wasted for businesses as well as ordinary people, for instance those taking flights)

There has to be consequences for this kind of negligence

neffy · 2 years ago
There's a claim over on Mastodon from Kevin Beaumont that the file is different on every customer he´s received the file from.

https://cyberplace.social/@GossiTheDog/112812454405913406

(scroll down a little)

drewg123 · 2 years ago
I thought windows required all kernel modules to be signed..? If there are multiple corrupt copies, rather than just some test escape, how could they have passed the signature verification and been loaded by the kernel?
dist-epoch · 2 years ago
This is not even a valid executable.

Most likely is not loaded as a driver binary, but instead is some data file used by the CrowdStrike driver.

j-wags · 2 years ago
It's possible that these aren't the original file contents, but rather the result of a manual attempt to stop the bleeding.

Someone may have hoped that overwriting the bad file with an all-0 file of the correct size would make the update benign.

Or following the "QA was bypassed because there was a critical vulnerability" hypothesis, stopping distribution of the real patch may be an attempt to reduce access to the real data and slow reverse-engineering of the vulnerability.