Readit News logoReadit News
squirrel · a year ago
There’s only one sentence that matters:

"Provide customers with greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed."

This is where they admit that:

1. They deployed changes to their software directly to customer production machines; 2. They didn’t allow their clients any opportunity to test those changes before they took effect; and 3. This was cosmically stupid and they’re going to stop doing that.

Software that does 1. and 2. has absolutely no place in critical infrastructure like hospitals and emergency services. I predict we’ll see other vendors removing similar bonehead “features” very very quietly over the next few months.

98codes · a year ago
Combined with this, presented as a change they could potentially make, it's a killer:

> Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.

They weren't doing any test deployments at all before blasting the world with an update? Reckless.

dijksterhuis · a year ago
> our staging environment, which consists of a variety of operating systems and workloads

they have a staging environment at least, but no idea what they were running in it or what testing was done there.

SketchySeaBeast · a year ago
Unfortunately, putting the onus on risk adverse organizations like hospitals and governments to validate the AV changes means they just won't get pushed and will be chronically exposed.

That said, maybe Crowdstrike should considering validating every step of the delivery pipeline before pushing to customers.

EvanAnderson · a year ago
> That said, maybe Crowdstrike should considering validating every step of the delivery pipeline before pushing to customers.

If they'd just had a lab of a couple dozen PCs acting as canaries they'd have caught this. Apparently that was too complicated or expensive for them.

dmazzoni · a year ago
Why can't they just do it more like Microsoft security patches, making them mandatory but giving admins control over when they're deployed?
throw0101d · a year ago
> Unfortunately, putting the onus on risk adverse organizations like hospitals and governments to validate the AV changes means they just won't get pushed and will be chronically exposed.

I have a similar feeling.

At the very least perhaps have an "A" and a "B" update channel, where "B" is x hours behind A. This way if, in an HA configuration, one side goes down there's time to deal with it while your B-side is still up.

thaumasiotes · a year ago
> Unfortunately, putting the onus on risk adverse organizations like hospitals and governments to validate the AV changes means they just won't get pushed and will be chronically exposed.

Being chronically exposed may be the right call, in the same way that Roman cities didn't have walls.

Compare this perspective from Matt Levine:

https://archive.is/4AvgO

> So for instance if you run a ransomware business and shut down, like, a marketing agency or a dating app or a cryptocurrency exchange until it pays you a ransom in Bitcoin, that’s great, that’s good money. A crime, sure, but good money. But if you shut down the biggest oil pipeline in the U.S. for days, that’s dangerous, that’s a U.S. national security issue, that gets you too much attention and runs the risk of blowing up your whole business. So:

>> In its own statement, the DarkSide group hinted that an affiliate may have been behind the attack and that it never intended to cause such upheaval.

>> In a message posted on the dark web, where DarkSide maintains a site, the group suggested one of its customers was behind the attack and promised to do a better job vetting them going forward.

>> “We are apolitical. We do not participate in geopolitics,” the message says. “Our goal is to make money and not creating problems for society. From today, we introduce moderation and check each company that our partners want to encrypt to avoid social consequences in the future.”

> If you want to use their ransomware software to do crimes, apparently you have to submit a resume demonstrating that you are good at committing crimes. (“Hopeful affiliates are subject to DarkSide’s rigorous vetting process, which examines the candidate’s ‘work history,’ areas of expertise, and past profits among other things.”) But not too good! The goal is to bring a midsize company to its knees and extract a large ransom, not to bring society to its knees and extract terrible vengeance.

https://archive.is/K9qBm

> We have talked about this before, and one category of crime that a ransomware compliance officer might reject is “hacks that are so big and disastrous that they could call down the wrath of the US government and shut down the whole business.” But another category of off-limits crime appears to be “hacks that are so morally reprehensible that they will lead to other criminals boycotting your business.”

>> A global ransomware operator issued an apology and offered to unlock the data targeted in a ransomware attack on Toronto’s Hospital for Sick Children, a move cybersecurity experts say is rare, if not unprecedented, for the infamous group.

>> LockBit’s apology, meanwhile, appears to be a way of managing its image, said [cybersecurity researcher Chester] Wisniewski.

>> He suggested the move could be directed at those partners who might see the attack on a children’s hospital as a step too far.

> If you are one of the providers, you have to choose your hacker partners carefully so that they do the right amount of crime: You don’t want incompetent or unambitious hackers who can’t make any money, but you also don’t want overly ambitious hackers who hack, you know, the US Department of Defense, or the Hospital for Sick Children. Meanwhile you also have to market yourself to hacker partners so that they choose your services, which again requires that you have a reputation for being good and bold at crime, but not too bold. Your hacker partners want to do crime, but they have their limits, and if you get a reputation for murdering sick children that will cost you some criminal business.

hello_moto · a year ago
> I predict we’ll see other vendors removing similar bonehead “features” very very quietly over the next few months.

Absolutely this is what will happen.

I don't know much about the practice of AV definition-like feature across Cybersecurity but I would imagine there might be a possibility that no vendors do rolling update today because it involves Opt-in/Opt-out which might influence the vendor's speed to identify attack which in turns affect their "Reputation" as well.

"I bought Vendor-A solution but I got hacked and have to pay Ransomware" (with a side note: because I did not consume the latest critical update of AV definition) is what Vendors worried.

Now that this Global Outage happened, it will change the landscape a bit.

XlA5vEKsMISoIln · a year ago
>Now that this Global Outage happened, it will change the landscape a bit.

I seriously doubt that. Questions like "why should we use CrowdStrike" will be met with "suppose they've learned their lesson".

bawolff · a year ago
> They deployed changes to their software directly to customer production machines; 2. They didn’t allow their clients any opportunity to test those changes before they took effect; and 3. This was cosmically stupid and they’re going to stop doing that.

Is it really all that surprising? This is basically their business model - its a fancy virus scanner that is supposed to instantly respond to threats.

koolba · a year ago
> They didn’t allow their clients any opportunity to test those changes before they took effect

I’d argue that anyone that agrees to this is the idiot. Sure they have blame for being the source of the problem, but any CXO that signed off on software that a third party can update whenever they’d like is also at fault. It’s not an “if” situation, it’s a “when”.

throwaway2037 · a year ago
I felt exactly the same when I read about the outage. What kind of CTO would allow 3rd party "security" software to automatically update? That's just crazy. Of course, your own security team would do some careful (canary-like) upgrades locally... run for a bit... run some tests, then sign-off. Then upgrade in a staged manner.
tptacek · a year ago
They deployed changes to their software directly to customer production machines

This is part of the premise of EDR software.

nathanlied · a year ago
>I predict we’ll see other vendors removing similar bonehead “features” very very quietly over the next few months.

If indeed this happens, I'd hail this event as a victory overall; but industry experience tells me that most of those companies will say "it'd never happen with us, we're a lot more careful", and keep doing what they're doing.

packetlost · a year ago
I really wish we would get some regulation as a result of this. I know people that almost died due to hospitals being down. It should be absolutely mandatory for users, IT departments, etc. to be able to control when and where updates happen on their infrastructure but *especially* so for critical infrastructure.
mr_mitm · a year ago
Does anyone test their antivirus updates individually as a customer? I thought they happen multiple times a day, who has time for that?
toast0 · a year ago
Some sort of comprehensive test is unlikely.

But canary / smoke tests, you can do, if the vendor provides the right tools.

It's a cycle: pick the latest release, do some small cluster testing, including rollback testing, then roll out to 1%, if those machines are (mostly) still available in 5 minutes, roll out to 2%, if the 3% is (mostly) still available in 5 minutes, roll out to 4%, etc. If updates are fast and everything works, it goes quick. If there's a big problem, you'll have still have a lot of working nodes. If there's a small problem, you have a small problem.

It's gotta be automated though, but with an easy way for a person to pause if something is going wrong that the automation doesn't catch. If the pace is several updates a day, that's too much for people, IMHO.

packetlost · a year ago
Yes? Not consumers typically, but many IT departments with certain risk profiles absolutely do.
Fire-Dragon-DoL · a year ago
Now let's see if Microsoft listen and fixes Windows updates
openasocket · a year ago
I work on a piece of software that is installed on a very large number of servers we do not own. The crowd strike incident is exactly our nightmare scenario. We are extremely cautious about updates, we roll it out very slowly with tons of metrics and automatic rollbacks. I’ve told my manager to bookmark articles about the crowdstrike incident and share it with anyone who complains about how slow the update process is.

The two golden rules are to let host owners control when to update whenever possible, and when it isn’t to deploy very very slowly. If a customer has a CI/CD system, you should make it possible for them to deploy your updates through the same mechanism. So your change gets all the same deployment safety guardrails and automated tests and rollbacks for free. When that isn’t possible, deploy very slowly and monitor. If you start seeing disruptions in metrics (like agents suddenly not checking in because of a reboot loop) rollback or at least pause the deployment.

taspeotis · a year ago
I don’t have much sympathy for CrowdStrike but deploying slowly seems mutually exclusive to protecting against emerging threats. They have to strike a balance.
zavec · a year ago
Even a staged rollout over a few hours would have made a huge difference here. "Slow" in the context of a rollout can still be pretty fast.
getcrunk · a year ago
Seriously like rolling out on some exponential scale even over the course of 10 minutes would have stopped this dead in its tracks
yardstick · a year ago
In CrowdStrikes case, they could have rolled out to even 1 million endpoints first and done an automated sanity/wellness check before unleashing the content update on everyone.

In the past when I have designed update mechanisms I’ve included basic failsafes such as automated checking for a % failed updates over a sliding 24-hour window and stopping any more if there’s too many failures.

goalieca · a year ago
They need a lab full of canaries.
Am4TIfIsER0ppos · a year ago
> let [...] owners control when to update

The only acceptable update strategy for all software regardless of size or importance

Cyphase · a year ago
Lots of words about improving testing of the Rapid Response Content, very little about "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes".

> Enhance existing error handling in the Content Interpreter.

That's it.

Also, it sounds like they might have separate "validation" code, based on this; why is "deploy it in a realistic test fleet" not part of validation? I notice they haven't yet explained anything about what the Content Validator does to validate the content.

> Add additional validation checks to the Content Validator for Rapid Response Content. A new check is in process to guard against this type of problematic content from being deployed in the future.

Could it say any less? I hope the new check is a test fleet.

But let's go back to, "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes".

SoftTalker · a year ago
> it sounds like they might have separate "validation" code

That's what stood out to me. From the CS post: "Template Instances are created and configured through the use of the Content Configuration System, which includes the Content Validator that performs validation checks on the content before it is published."

Lesson learned, a "Validator" that is not actually the same program that will be parsing/reading the file in production, is not a complete test. It's not entirely useless, but it doesn't guarantee anything. The production program could have a latent bug that a completely "valid" (by specification) file might trigger.

modestygrime · a year ago
I'd argue that it is completely useless. They have the actual parser that runs in production and then a separate "test parser" that doesn't actually reflect reality? Why?
pdpi · a year ago
> very little about "the sensor client should not ever count on the Rapid Response Content being well-formed to avoid crashes"

That stood out to me as well.

Their response was the moral equivalent of Apple saying “iTunes crashes when you play a malformed mp3, so here’s how we’re going to improve how we test our mp3s before sending them to you”.

This is a security product that is expected to handle malicious inputs. If they can’t even handle their own inputs without crashing, I don’t like the odds of this thing being itself a potential attack vector.

Cyphase · a year ago
That's a good comparison to add to the list for this topic, thanks. An example a non-techie can understand, where a client program is consuming data blobs produced by the creator of the program.

And great point that it's not just about crashing on these updates, even if they are properly signed and secure. What does this say about other parts of the client code? And if they're not signed, which seems unclear right now, then could anyone who gains access to a machine running the client get it to start boot looping again by copying Channel File 291 into place? What else could they do?

Echoes of the Sony BMG rootkit.

https://en.wikipedia.org/wiki/Sony_BMG_copy_protection_rootk...

WatchDog · a year ago
Focusing on the rollout and QA process is the right thing to do.

The bug itself is not particularly interesting, nor is the fix for it.

The astounding thing about this issue, is the scale of the damage it caused, and that scale is all due to the rollout process.

gwd · a year ago
Indeed, the very first thing they should be doing is adding fuzzing of their sensor to the test suite, so that it's not possible (or astronomically unlikely) for any corrupt content to crash the system.
hun3 · a year ago
Is error handling enough? A perfectly valid rule file could hang (but not outright crash) the system, for example.
throwanem · a year ago
If the rules are Turing-complete, then sure. I don't see enough in the report to tell one way or another; the way rules are made to sound as if filling templates about equally suggests either (if templates may reference other templates) and there is not a lot more detail. Halting seems relatively easy to manage with something like a watchdog timer, though, compared to a sound, crash- and memory-safe* parser for a whole programming language, especially if that language exists more or less by accident. (Again, no claim; there's not enough available detail.)

I would not want to do any of this directly on metal, where the only safety is what you make for yourself. But that's the line Crowdstrike are in.

* By EDR standards, at least, where "only" one reboot a week forced entirely by memory lost to an unkillable process counts as exceptionally good.

ReaLNero · a year ago
Perhaps set a timeout on the operation then? Given this is kernel it's not as easy as userspace, but I'm sure you could request to set a interrupt on a timer.
DannyBee · a year ago
Increase counter when you start loading

Have timeout

Decrement counter after successful load and parse

Check counter on startup. If it is like 3, maybe consider you are crashing

mdriley · a year ago
> Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production.

It compiled, so they shipped it to everyone all at once without ever running it themselves.

They fell short of "works on my machine".

red2awn · a year ago
> How Do We Prevent This From Happening Again?

> Software Resiliency and Testing

> * Improve Rapid Response Content testing by using testing types such as:

> * Local developer testing

So no one actually tested the changes before deploying?!

Narretz · a year ago
And why is it "local developer testing" and not CI/CD. This makes them look like absolute amateurs.
belter · a year ago
> This makes them look like absolute amateurs.

This applies also to all Architects and CTO's at all these Fortune 500 companies, who allowed these self updating systems into their critical systems.

I would offer a copy of Antifragile to each of these teams: https://en.wikipedia.org/wiki/Antifragile_(book)

"Every captain goes down with every ship"

radicaldreamer · a year ago
They don't care, CI/CD, like QA, is considered a cost center for some of these companies. The cheapest thing for them is to offload the burden of testing every configuration onto the developer, who is also going to be tasked with shipping as quickly as possible or getting canned.

Claw back executive pay, stock, and bonuses imo and you'll see funded QA and CI teams.

hyperpape · a year ago
It sure sounds like the "Content Validator" they mention is a form of CI/CD. The problem is that it passed that validation, but was capable of failing in reality.
RaftPeople · a year ago
The fact that they even listed "local developer testing" is pretty weird.

That is just part of the basic process and is hardly the thing that ensures a problem like this doesn't happen.

Deleted Comment

spacebanana7 · a year ago
This also becomes a security issue at some point. If these updates can go in untested, what's to stop a rogue employee from deliberately pushing a malicious update?

I know insider threats are very hard to protect against in general but these companies must be the most juicy target for state actors. Imagine what you could do with kernel space code in emergency services, transport infrastructure and banks.

amluto · a year ago
CrowdStrike is more than big enough to have a real 2000’s-style QA team. There should be actual people with actual computers whose job is to break the software and write bug reports. Nothing is deployed without QA sign off, and no one is permitted to apply pressure to QA to sign off on anything. CI/CD is simply not sufficient for a product that can fail in a non-revertable way.

A good QA team could turn around a rapid response update with more than enough testing to catch screwups like this and even some rather more subtle ones in an hour or two.

cataflam · a year ago
Besides missing the actual testing (!), the staged rollout (!), looks like they also weren't fuzzing this kernel driver that routinely takes instant worldwide updates. Oops.
l00tr · a year ago
check their developer github, "i write kernel-safe bytecode interpreters" :D, [link redacted]

Dead Comment

brcmthrowaway · a year ago
He Codes With Honor(tm)
rurban · a year ago
They bypassed the tests and staged deployment, because their previous update looked good. Ha.

What if they implemented a release process, and follow it? Like everyone else does. Hackers at the workplace, sigh.

fulafel · a year ago
Also it must have been a manual testing effort, otherwise there would be no motive to skip it. IOW, missing test automation.
yuliyp · a year ago
This feels natural, though: the first time you do something you do it 10x more slowly because there's a lot more risk. Continuing to do things like that forever isn't realistic. Complacency is a double-edged sword: sometimes it gets us to avoid wasting time and energy on needless worry (the first time someone drives a car they go 5 mph and brake at anything surprising), sometimes it gets us to be too reckless (drivers forgetting to check blind spots or driving at dangerous speeds).
throwaway7ahgb · a year ago
Where do you see that, it looks like there was a bug in the template tester? Or you mean the manual tests?
kasabali · a year ago
> Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production.
CommanderData · a year ago
They know better obviously, transcending process and bureaucracy.
rurban · a year ago
Same thing happened with Falcon on Debian before. Later they admitted that they didn't test some platforms they were releasing. Never heard of Docker?

How can you keep on with such a Q&R manager? He'll cost them billions