1. Their bot management system is designed to push a configuration out to their entire network rapidly. This is necessary so they can rapidly respond to attacks, but it creates risk as compared to systems that roll out changes gradually.
2. Despite the elevated risk of system wide rapid config propagation, it took them 2 hours to identify the config as the proximate cause, and another hour to roll it back.
SOP for stuff breaking is you roll back to a known good state. If you roll out gradually and your canaries break, you have a clear signal to roll back. Here was a special case where they needed their system to rapidly propagate changes everywhere, which is a huge risk, but didn’t quite have the visibility and rapid rollback capability in place to match that risk.
While it’s certainly useful to examine the root cause in the code, you’re never going to have defect free code. Reliability isn’t just about avoiding bugs. It’s about understanding how to give yourself clear visibility into the relationship between changes and behavior and the rollback capability to quickly revert to a known good state.
Cloudflare has done an amazing job with availability for many years and their Rust code now powers 20% of internet traffic. Truly a great team.
I don’t really buy this requirement. At least make it configurable with a more reasonable default for “routine” changes. E.g. ramping to 100% over 1 hour.
As long as that ramp rate is configurable, you can retain the ability to respond fast to attacks by setting the ramp time to a few seconds if you truly think it’s needed in that moment.
(laid off) Microsoft PM here that worked on the patch described as a result of the research from Wiz.
One correction I’d like to suggest to the article: the guidance given is to check either the “iss” or “tid” claim when authorizing multi-tenant apps.
The actual recommended guidance we provided is slightly more involved. There is a chance that when only validating the tenant, any service principal could be granted authorized access.
You should always validate the subject in addition to validating the tenant for the token being authorized. One method for this would be to validate the token using a combined key (for example, tid+oid) or perform checks on both the tenant and subject before authorizing access. More info can be found here:
https://learn.microsoft.com/en-us/entra/identity-platform/cl...
1. Working out in the open
2. Dogfooding their own product
3. Pushing the state of the art
Given that the negative impact here falls mostly (completely?) on the Microsoft team which opted into this, is there any reason why we shouldn't be supporting progress here?
It’s showing the actual capabilities in practice. That’s much better and way more illuminating than what normally happens with sales and marketing hype.
I would much rather give to charities focusing on countries that don’t have the economy/ability to fix their basic issues.
For example my STB is on a VLAN that has WAN access (otherwise it won't do anything), but that makes it untrustworthy so it is completely isolated from rest of LAN.
On the other hand some "smart"/IoT devices are on a VLAN that has no WAN access so that they can't phone home, become a botnet, or download firmware updates that remove functionality in favor of subscription services. Only a VM running homeassistant can talk to them.
This will work until amazon sidewalk / built-in LTE modems become too frequent, at that point I'll have to start ripping out the radio modules from things I buy.
With the 4-way stop there is never a time in the cycle when all traffic is stopped. The drivers who are present are continuously paying attention to what other drivers are doing which robs them of situational awareness to note pedestrians. You can try and time it but that's risky. With the walk signal there is a brief moment in time when the drivers are doing nothing but waiting for you and are all stopped so you as a pedestrian can account for them in preparation just before you get your signal and make your move.
The author can get lost with this sort of textbook correct but questionable in reality take. Legally having the right of way doesn't make you any less dead when the driver who's got three other drivers to pay attention to doesn't see you.
I guess it's an organizational consequence of mitigating attacks in real time, where rollout delays can be risky as well. But if you're going to do that, it would appear that the code has to be written much more defensively than what they're doing it right now.
I really don’t buy this requirement to always deploy state changes 100% globally immediately. Why can’t they just roll out to 1%, scaling to 100% over 5 minutes (configurable), with automated health checks and pauses? That will go along way towards reducing the impact of these regressions.
Then if they really think something is so critical that it goes everywhere immediately, then sure set the rollout to start at 100%.
Point is, design the rollout system to give you that flexibility. Routine/non-critical state changes should go through slower ramping rollouts.