> As Bitsight continues to investigate the traffic patterns exhibited by CrowdStrike machines across organizations globally, two distinct points emerge as “interesting” from a data perspective. Firstly, on July 16th at around 22:00 there was a huge traffic spike, followed by a clear and significant drop off in egress traffic from organizations to CrowdStrike. Second, there was a significant drop, between 15% and 20%, in the number of unique IPs and organizations connected to CrowdStrike Falcon servers, after the dawn of the 19th.
> While we can not infer what the root cause of the change in traffic patterns on the 16th can be attributed to, it does warrant the foundational question of “Is there any correlation between the observations on the 16th and the outage on the 19th?”. As more details from the event emerge, Bitsight will continue investigating the data.
Interested to know how they're capturing sample data for IPs accessing Crowdstrike Falcon APIs and the corresponding packet data.
EDIT: Not to mention that they're able to distill their dataset to group IPs by their representative organizations. Since they have that info I feel a proper analysis would include actually analyzing which orgs (types, country of origin, etc) started dropping off starting on the 16th. Alas since this seems like just a marketing fluff piece we'll never get anything substantial :(
I'm not sure what exactly they are trying to say. They saw some CrowdStrike traffic logs, saw a random spike a few days before the outage, and...that's it? Why is that "strange", and how does it relate to the incident timeline?
Just a random security company with a fluff piece with "CrowdStrike" in the title trying to get in the headlines.
It's interesting if the spike only happened on CrowdStrike computers but I'm not sure if this article has checked traffic logs on non-CrowdStrike computers and confirmed there was no spike. Even if they did, I agree there's little reason to believe it's related to the CrowdStrike incident and this is probably just a company trying to catch the CrowdStrike PR wave.
I feel very disappointed too after reading it; there's no "Mystery" in the behavior of those numbers, the only unexplained (as of now) is the huge spike, but the subsequent reduction in traffic is explainable by "the client hosts were crashing", hence lower numbers.
I would be interested to know what the distribution of release times for these "channel files" is like. Dropping them at 8pm Eastern time is in line with some companies' idea of well-timed system maintenance windows, whereas others prefer to do things during the workday so that if they need all hands on deck, they can get them more easily.
The latter works better with organizations that release often and have reasonable surety that their updates are not going to cause disruption -- it becomes a normal part of the day, most commonly it causes no noticeable disruption at all, and thus it makes sense to not have to have eng / ops working late hours for the release. This surety can come from different ways, but the one I've seen is having a very methodical rollout with at least a smoke-test (affecting a very small subset of "production", not internal or lab machines, so in CRWD's case it would be customers' machines), and then rolling out to a random %age of machines starting with 1%, and depending on your level of confidence, some schedule that gets you to 100% before the end of business for your easternmost co-workers.
Some additional things to gain confidence can include a 1% rollout to a set of machines that is picked to ideally provide exposure to every type of machine in the fleet, and 100% rollout to customers who have agreed to be at the cutting edge (how you get them to accept that risk is an exercise for the reader, but maybe cut them a deal like 30% off their license).
The reason I'm curious about the distribution of channel file drops, for the case of Crowdstrike, is that if it's an atypically-timed release, that could indicate that it's a response to whatever caused the dip in traffic on the 16th mentioned in the Bitsight article.
Edit: From what I understand, Crowdstrike does have at least some segmentation of releases for the kernel extension, but it appears the configuration file / channel file updates seem to be "Oh well, fire ze missiles".
How exactly is Bitsight collecting the data used in this analysis? I understand it’s just a sampling, but how are they sampling traffic between two arbitrary parties (Crowdstrike and customers in this case)?
The obvious inference from this is that the bad update was trickled out to some customers on the 16th and it took them 2 days to report the issue because they were all busy figuring out why every machine was blue-screening.
Alternatively it took CrowdStrike 2 days to notice that their traffic was disappearing and put 2 and 2 together as to why.
I infer that they pushed a (slightly) bad update on the 16th and tried to correct on the 19th, and that the correction was the update that hosed the world.
I feel like there's an army of sys admins who would be speaking up right now if things started going down on the 16th, but that doesn't seem to be the case.
> While we can not infer what the root cause of the change in traffic patterns on the 16th can be attributed to, it does warrant the foundational question of “Is there any correlation between the observations on the 16th and the outage on the 19th?”. As more details from the event emerge, Bitsight will continue investigating the data.
Interested to know how they're capturing sample data for IPs accessing Crowdstrike Falcon APIs and the corresponding packet data.
EDIT: Not to mention that they're able to distill their dataset to group IPs by their representative organizations. Since they have that info I feel a proper analysis would include actually analyzing which orgs (types, country of origin, etc) started dropping off starting on the 16th. Alas since this seems like just a marketing fluff piece we'll never get anything substantial :(
Just a random security company with a fluff piece with "CrowdStrike" in the title trying to get in the headlines.
I want my 10 minutes back.
The latter works better with organizations that release often and have reasonable surety that their updates are not going to cause disruption -- it becomes a normal part of the day, most commonly it causes no noticeable disruption at all, and thus it makes sense to not have to have eng / ops working late hours for the release. This surety can come from different ways, but the one I've seen is having a very methodical rollout with at least a smoke-test (affecting a very small subset of "production", not internal or lab machines, so in CRWD's case it would be customers' machines), and then rolling out to a random %age of machines starting with 1%, and depending on your level of confidence, some schedule that gets you to 100% before the end of business for your easternmost co-workers.
Some additional things to gain confidence can include a 1% rollout to a set of machines that is picked to ideally provide exposure to every type of machine in the fleet, and 100% rollout to customers who have agreed to be at the cutting edge (how you get them to accept that risk is an exercise for the reader, but maybe cut them a deal like 30% off their license).
The reason I'm curious about the distribution of channel file drops, for the case of Crowdstrike, is that if it's an atypically-timed release, that could indicate that it's a response to whatever caused the dip in traffic on the 16th mentioned in the Bitsight article.
Edit: From what I understand, Crowdstrike does have at least some segmentation of releases for the kernel extension, but it appears the configuration file / channel file updates seem to be "Oh well, fire ze missiles".
Deleted Comment
Dead Comment
Not sure if that would make them more or less incompetent...
Collect evidence first, draw conclusions when you have enough evidence.
Makes a refreshing change from deciding what happened then collecting only the evidence that supports it.
Deleted Comment